Semiparametric posterior limits
under local asymptotic exponentiality
Bas Kleijn1 and Bartek Knapik2
1
Korteweg-de Vries Institute for Mathematics, University of Amsterdam
2
Department of Mathematics, VU University Amsterdam
February 2015
Abstract
Consider a nonparametric model of densities that display discontinuities; estimation
of only the location of discontinuity is an example of an irregular semiparametric question
in which likelihoods are approximated using Ibragimov and Has’minskii’s local asymptotic
exponentiality (LAE). It is shown that, under certain conditions on model and prior, the
marginal posterior for an LAE parameter of interest displays Bernstein–von Mises-type
asymptotic behaviour with exponential distributions forming the limiting sequence. The
main result is applied to examples of semiparametric estimation of LAE location and scaling parameters. Although the ML and MAP estimators are inefficient under this form of
irregularity due to bias, posterior means attain minimax risk bounds. This suggests that
with regard to asymptotic optimality, the posterior is more informative than the maximum of its density alone, and it urges focus on efficiency of Bayesian methods in their
own right, rather than on comparison with frequentist procedures like the MLE.
Keywords: Asymptotic posterior exponentiality; Posterior limit distribution; Local asymptotic exponentiality; Semiparametric statistics; Irregular estimation; Bernstein–von Mises;
Densities with jumps; Efficiency in irregular models; Exponential limit experiment.
1
Introduction
In recent years, asymptotic efficiency of Bayesian semiparametric methods has enjoyed much
attention. The general question concerns a nonparametric model P in which exclusive interest goes to the estimation of a sufficiently smooth, finite-dimensional functional of interest.
Asymptotically, regularity of the estimator combined with the Cramér-Rao bound in the
Gaussian location model that forms the limit experiment [28] fixes the rate of convergence
to n−1/2 and poses a bound to the accuracy of regular estimators expressed, e.g. through
Hajék’s convolution [12] and asymptotic minimax theorems [13]. In regular Bayesian context,
efficiency of estimation is best captured by the Bernstein–von Mises limit (see, e.g. Le Cam
and Yang (1990) [31]).
Since frequentist parametric theory for regular estimates extends quite effortlessly to regular semiparametric problems, one may be tempted to believe that semiparametric extensions
of Bernstein–von Mises-type asymptotic behaviour should proceed without essential problems.
1
However, the matter appears more delicate than that: although far from developed fully, some
general considerations of Bayesian semiparametric efficiency are found in [4, 7, 34, 36, 1]. In
addition, there are many model- and prior-specific instances of the Bernstein–von Mises limit,
e.g. [3, 5, 6, 17, 18, 24, 25, 26]. When combined, these results appear to urge caution and
make clear that there are no simple, straightforward answers, not even when one restricts
attention to the relatively well-behaved context of Gaussian (e.g. white-noise) models with
Gaussian process priors. Limits of posteriors on sieves are considered in Ghosal (1999, 2000)
[9, 10] and Bontemps (2011) [2]. Kim and Lee (2004) [19], Kim (2006, 2009) [20, 21] even
consider infinite-dimensional limiting posteriors.
The quintessential example of a non-regular estimation problem calls for estimation of
a point of discontinuity of a density: consider an almost-everywhere differentiable Lebesgue
density on R that displays a jump at some point θ ∈ R. Examples of such problems arise, e.g.,
in auction models where the density of accepted bids jumps from zero to a positive value at
the auction price (see Section 2 for more details and other examples). Estimators for θ exist
that converge at rate n−1 with exponential limit distributions [16]. To illustrate the form
that this conclusion takes in a simple Bayesian context, consider the more specific example
of a θ ∈ R parametrizing Fθ (x) = (1 − e−λ(x−θ) ) ∨ 0, where λ > 0 is fixed and known. Let
X1 , X2 , . . . form an i.i.d. sample from Fθ0 , for some θ0 . It is easy to see that the maximum
likelihood estimator θ̂n is equal to the minimum of the sample X(1) . Moreover, n(X(1) − θ0 ) is
exponentially distributed with rate λ for every n ≥ 1, showing that the maximum likelihood
estimator is consistent at rate n−1 . However the MLE has a bias of size 1/(nλ) and the
de-biased estimate outperforms the MLE in terms of asymptotic variance (see, e.g., Le Cam
(1990) [32]):
2
2
P0 n(X(1) − θ0 ) = 2 ,
λ
2
1
1
P0 n X(1) −
− θ0
= 2,
nλ
λ
(1)
(where P0 f denotes the expectation of a random variable f under θ0 ). So the role of the bias
does not wash out completely in the asymptotic limit and the MLE is inefficient in that it
does not minimize asymptotic L2 -loss. The following theorem describes the posterior limit in
this example.
Theorem 1.1. Assume that X1 , X2 , . . . form an i.i.d. sample from Fθ0 , for some θ0 . Let
π : R → (0, ∞) be a continuous Lebesgue probability density. Then the associated posterior
distribution satisfies,
θ0
sup Πn θ ∈ A X1 , . . . , Xn − Exp−
(A)
−→ 0,
X(1) ,nλ
A
where Exp−
X(1) ,nλ is a negative exponential distribution with rate nλ supported on (−∞, X(1) ].
While the MAP estimator can be expected to follow the MLE, by contrast the means
e
θn = X(1) − 1/(nλ) of the limiting exponential distributions follow the de-biased version of
the MLE. Consequently, the asymptotic variance of the posterior mean lies strictly below that
of the ML or MAP estimates! As a matter of fact the posterior mean is optimal here, since
1/λ2 is the lower bound for the (localized) squared risk in the exponential experiment (see,
2
e.g., Theorem V.5.3 in [16]). At the least, this suggests strongly that posteriors are more
informative than the maxima of their densities alone and thus urges focus on efficiency (or
minimax optimality) of Bayesian methods in their own right, without immediate comparison
with ML methods.
As a frequentist semiparametric problem, estimation of a support boundary point is wellunderstood [16]: assuming that the distribution Pθ of X is supported on the half-line [θ, ∞)
and an i.i.d. sample X1 , X2 , . . . , Xn is given, one estimates θ using X(1) . If Pθ has an absolutely
continuous Lebesgue density of the form pθ (x) = η(x − θ) 1{x ≥ θ}, its rate of convergence is
R
determined by the behaviour of the quantity 7→ 0 η(x) dx for small values of . If η(x) > 0
for x in a right neighbourhood of 0, then,
n X(1) − θ = OPθ (1).
For densities of this form, for any sequence θn that converges to θ at rate n−1 , the Hellinger
distance d obeys (see Theorem VI.1.1 in [16]):
n1/2 d(Pθn , Pθ ) = O(1).
(2)
If we substitute the estimators θn = θ̂n (X1 , . . . , Xn ) = X(1) , uniform tightness of the sequence
in the above display signifies rate optimality of the estimator (c.f. Le Cam (1973, 1986)
[29, 30]). Regarding minimal asymptotic variance (or other measures of dispersion of the
limit distribution), note that the (one-sided) distributions for n(X(1) − θ) can always be
improved upon by de-biasing (see, e.g., Section VI.6, examples 1–3 in [16]). However in the
present semiparametric context, there is no immediate de-biasing recipe.
As a Bayesian semiparametric question, the matter of estimating support boundaries is
not settled by the above: for the posterior, it is the local limiting behaviour of the likelihood
around the point of convergence (see, e.g., Theorems VI.2.1–VI.2.3 in [16]) that determines
convergence rather than the behaviour of any particular statistic. The goal of this paper is
to shed some light on the behaviour of marginal posteriors for the parameter of interest in
semiparametric, irregular estimation problems, through a study of the Bernstein–von Mises
phenomenon. Only the prototypical case of a density of bounded variation, supported on
the half-line [θ, ∞) or on the interval [0, θ], with a jump at θ, is analysed in detail. We
offer some abstraction by considering the class of models that exhibit a weakly converging
expansion of the likelihood called local asymptotic exponentiality (LAE) [16], to be compared
with local asymptotic normality (LAN) in regular problems. Like in the parametric case of
Theorem 1.1, this type of asymptotic behaviour of the likelihood is expected to give rise to
an approximately (negative-)exponential marginal posterior, expressed through the irregular
Bernstein–von Mises limit:
sup Πn h ∈ A X1 , . . . , Xn − Exp−
∆n ,γθ
A
0 ,η0
P0
(A) −→
0,
(3)
where h = n(θ − θ0 ) and the random sequence ∆n converges weakly to exponentiality (see
Definition 2.1). The constant γθ0 ,η0 determines the scale in the limiting exponential distribution and, as such, is related to the asymptotic bound for estimators of θ. In this paper, we
explore general sufficient conditions on model and prior to conclude that the limit (3) obtains.
3
The main theorem is applied in two semiparametric LAE example models, one for a
shift parameter and one for a scale parameter (compare with the two regular semiparametric
questions in Stein (1956) [37]). The former one is an extension of the setting considered in
Theorem 1.1, and is closely related to regression problems with one-sided errors, often arising
in economics. The latter includes a problem of estimation of the scale parameter in the family
of uniform distributions [0, θ], (θ > 0).
The paper is structured as follows: in Section 2 we first introduce the notion of local
asymptotic exponentiality and then present two semiparametric LAE models satisfying the
exponential Bernstein–von Mises property (3) asymptotically. In Section 3 we give the main
theorem and a corollary that simplifies the formulation. In Section 4, the proof of the main
theorem is built up in several steps, from a particular type of posterior convergence, to an LAE
expansion for integrated likelihoods and on to posterior exponentiality of the type described
by (3). Section 5 contains the proofs of auxiliary results needed in the proof of the main
theorem, as well as verification of the conditions of the simplified corollary for the two models
presented in Section 2.
Notation and conventions
The (frequentist) true distribution of each of the data points in the i.i.d. sample X n =
(X1 , . . . , Xn ) is denoted P0 and assumed to lie in the model P. Associated order statistics are
denoted X(1) , X(2) , . . .. The location-scale family associated with the exponential distribution
−
is denoted Exp+
∆,λ and its negative version by Exp∆,λ . We localise θ by introducing h =
n(θ −θ0 ) with inverse θn (h) = θ0 +n−1 h. The expectation of a random variable f with respect
to a probability measure P is denoted P f ; the sample average of g(X) is denoted Pn g(X) =
P
(1/n) ni=1 g(Xi ) and Gn g(X) = n1/2 (Pn g(X)−P g(X)). If hn is stochastic, Pθnn (hn ),η f denotes
R
the integral f (ω) (dPθnn (hn (ω)),η /dP0n (ω))(ω) dP0n (ω). The Hellinger distance between P and
P 0 is denoted d(P, P 0 ) and induces a metric dH on the space of nuisance parameters H by
dH (η, η 0 ) = d(Pθ0 ,η , Pθ0 ,η0 ), for all η, η 0 ∈ H. A prior on (a subset Θ of) Rk is said to be
thick (at θ ∈ Θ) if it is Lebesgue absolutely continuous with a density that is continuous and
strictly positive (at θ).
2
Local asymptotic exponentiality and estimation of support
boundary points
The problem of estimating the support boundary of a distribution has important practical
motivations, arising in many areas of finance, like one-sided bond auctions or one-sided risks
in portfolio hedging, search models, production frontier models, as well as truncated- or
censored-regression models. Consider, for instance, a common mechanism to auction bonds:
a set amount of debt is to be issued in the form of bonds; dealers and investors bid, by
stating volume-price pairs they are willing to transact for those bonds. When bids are in,
the issuer fills orders, accepting first the highest prices bid and working downward until the
4
entire amount of the issuance is filled. The last accepted price is called the auction price;
bids below the auction price are discarded. For many practical purposes, one is not too
concerned about the details of the distribution of accepted prices, the quantity of interest is
the auction price, that is, the lower boundary of the support of the distribution of accepted
bids. Clearly, from the perspective of the investor, accurate estimates for future auction
prices are required to not overbid (yet get his bids accepted), to which end historical auctions
are studied statistically. As a second example, consider also i.i.d. data (X1 , Y1 ), (X2 , Y2 ), . . .,
generated by the regression model,
Yi = f (Xi ) + θ + i ,
where f denotes a smooth function satisfying f (0) = 0. For simplicity, assume independence
between covariates and errors. Also, assume that the density of i is continuous and supported
on [0, ∞) (with non-zero value at 0). In certain applications like the hedging of credit portfolio’s, where the Yi commonly represent returns subject to one-sided risk, the main quantity
of interest is the parameter θ (which can be viewed as the boundary of the support of the
random variable θ + i = Yi − f (Xi )).
For more details, generality and other examples we refer the reader to Hirano and Porter
(2003) [15], Chernozhukov and Hong (2004) [8] and Hall and van Keilegom (2009) [14] and
references therein.
Throughout this paper we consider estimation of a functional θ : P → R on a nonparametric model P based on a sample X1 , X2 , . . ., distributed i.i.d. according to some unknown
P0 ∈ P. We assume that P is parametrized in terms of a one-dimensional parameter of interest θ ∈ Θ and a nuisance parameter η ∈ H so that we can write P = {Pθ,η : θ ∈ Θ, η ∈ H},
and that P is dominated by a σ-finite measure on the sample space with densities pθ,η . The
set Θ is open in R, and (H, dH ) is an infinite dimensional metric space (to be specified further at later stages). Assuming identifiability, there exist unique (θ0 , η0 ) ∈ Θ × H such that
P0 = Pθ0 ,η0 . Assuming measurability of the map (θ, η) 7→ Pθ,η and priors ΠΘ on Θ and ΠH
on H, the prior Π on P is defined as the product prior ΠΘ × ΠH on Θ × H, lifted to P. The
subsequent sequence of posteriors [11] takes the form,
Πn
A X1 , . . . , Xn =
Z Y
n
,Z
p(Xi ) dΠ(P )
A i=1
n
Y
P i=1
p(Xi ) dΠ(P ),
(4)
where A is any measurable model subset.
Throughout most of this paper, the parameter of interest θ is represented in localised
form, by centering on θ0 and rescaling: h = n(θ − θ0 ) ∈ R. (We also make use of the inverse
θn (h) = θ0 + n−1 h.) The following (irregular) local expansion of the likelihood is due to
Ibragimov and Has’minskii (1981) [16].
Definition 2.1 (Local asymptotic exponentiality). A model (θ, η) 7→ Pθ,η is said to be locally
asymptotically exponential (LAE) in the θ-direction at θ0 ∈ Θ and η ∈ H if there exists a
5
sequence of random variables (∆n ) and a positive constant γθ0 ,η such that for all (hn ), hn → h,
n
Y
pθ
−1 h ,η
n
0 +n
i=1
pθ0 ,η
(Xi ) = exp(hγθ0 ,η + oPθ0 ,η (1)) 1{h≤∆n } ,
with ∆n converging weakly to Exp+
0,γθ
0 ,η
.
This definition should be viewed as an irregular variation on Le Cam’s local asymptotic normality (LAN) [27], which forms the smoothness requirement in the context of the
Bernstein–von Mises theorem (see, e.g. Le Cam and Yang (1990) [31]). An LAE expansion
is expected to give rise to a one-sided, exponential marginal posterior limit, c.f. (3). In the
main result of the paper we use a slightly stronger version of local asymptotic exponentiality.
We say that the model is stochastically LAE if the LAE property holds for every random
sequence (hn ) that is bounded in probability. Therefore, h in the expansion is also replaced
with hn .
In the remainder of this section, we turn to two detailed semiparametric examples for
which the likelihood displays an LAE expansion. In Subsection 2.1 the parameter of interest
is a shift parameter, while in Subsection 2.2 we consider a semiparametric scaling family.
2.1
Semiparametric shifts
The so-called location problem is one of the classical problems in statistical inference: let
X1 , X2 , . . . be i.i.d. real-valued random variables, each with marginal Fθ : R → [0, 1], where
θ ∈ R is the location, i.e. the distribution function Fθ is some fixed distribution F shifted over
µ: Fθ (x) = F (x − θ). Depending on the nature of F , the corresponding location estimation
problem can take various forms: for instance, in case F possesses a density f : R → [0, ∞)
R
that is symmetric around 0 (and satisfies the regularity condition (f 0 /f )2 (x) dF (x) < ∞),
the location θ is estimated at rate n−1/2 (equally well whether we know f or not [37]). If F
has a support that is contained in a half-line in R (i.e. if there is a domain boundary), the
problem of estimating the location might become easier, as noticed in the example given in
the introduction.
In this subsection we consider a model of densities with a discontinuity at θ: we assume
that p(x) = 0 for x < θ and p(θ) > 0 while p : R → [0, ∞) is continuous at all x ≥ θ. Observed
is an i.i.d. sample X1 , X2 , . . . with marginal P0 . The distribution P0 is assumed to have a
density of above form, i.e. with unknown location θ for a nuisance density η in some space
H. Model distributions Pθ,η are then described by densities,
pθ,η : [θ, ∞) → [0, ∞) : x 7→ η(x − θ),
for η ∈ H and θ ∈ Θ ⊂ R. As for the family H of nuisance densities, our interest does not lie
in modelling of the tail, we concentrate on specifying the behaviour at the discontinuity. For
that reason (and in order to connect with Theorem 3.1), we consider the following construction
of the nuisance space H: fix α > 0, and 0 < S < α. Let L denote the ball of radius S in the
space (C[0, ∞], k · k∞ ) of continuous functions from the extended half-line to R with uniform
6
norm. Elements `˙ of the ball L are bounded continuous functions with a limit at infinity. We
define the nuisance space H as the image of the ball L under an Esscher transform of the
form,
e−α x+
η`˙(x) = R ∞
0
Rx
0
˙ dt
`(t)
Ry
e−α y+
0
˙ dt
`(t)
,
(5)
dy
for x ≥ 0.
Properties of this mapping (c.f. Lemma 5.2) guarantee that H consists of functions of
bounded variation, hence Theorem V.2.2 in Ibragimov and Has’minskii (1981) [16] confirms
that the model exhibits local asymptotic exponentiality in the θ-direction, uniformly in η
in a neighbourhood of η0 . In the notation of Definition 2.1, the constant γθ0 ,η equals η(0),
the size of the discontinuity at zero. Since it is not difficult to find a prior on a space
of bounded continuous functions (see, e.g. Lemma 5.8 below), (Borel) measurability of the
Esscher transform as a map between L and H enables a push-forward prior on H.
Theorem 2.2. Let X1 , X2 , . . . be an i.i.d. sample from the location model introduced above
with P0 = Pθ0 ,η0 for some θ0 ∈ Θ, η0 ∈ H. Endow Θ with a prior that is thick at θ0 and L
with a prior ΠL such that L ⊂ supp(ΠL ). Then the marginal posterior for θ satisfies,
P0
supΠ(n(θ − θ0 ) ∈ A | X1 , . . . , Xn ) − Exp−
(A)
−→ 0,
∆n ,γθ ,η
0
A
(6)
0
where ∆n is exponentially distributed with rate γθ0 ,η0 = η0 (0).
Details of the proof of Theorem 2.2 can be found in Subsection 5.3. In the above result α
and S are fixed to facilitate the presentation. For the interested reader, let us also sketch the
situation when α and S are n-dependent: inspection of the proofs in Subsection 5.3 shows
that, as long as αn − Sn stays bounded away from zero, the assertions of Lemmas 5.3 and 5.7
continue to hold. Moreover, if αn +Sn = O(n), also Lemmas 5.4 and 5.6 remain valid. Finally,
αn + Sn should diverge slowly enough in order to verify condition (iii) of Corollary 3.1, with
the help of Lemma 5.5. To exploit these observations in a more Bayesian setting, one can
view α and S as hyper-parameters and endow them with hyper-priors satisfying exponential
upper bounds on mass in the tails, related to the constraints on αn and Sn .
2.2
Semiparametric scaling
Another prototypical statistical problem is related to the scale or dispersion of a probability distribution: let X1 , X2 , . . . be i.i.d. real-valued random variables, each with marginal
Fθ : R → [0, 1], where θ ∈ (0, ∞) is the scale, i.e. the distribution function Fθ is some fixed
distribution F scaled by θ: Fθ (x) = F (x/θ). Again, depending on the nature of F , the corresponding scale estimation problem can take various forms: for instance, in case F possesses
a density f : R → [0, ∞) with support R that is absolutely continuous (and satisfies the
R
regularity condition (1 + x2 )(f 0 /f )2 (x) dF (x) < ∞), the scale θ is estimated at rate n−1/2
(equally well whether we know f or not, as conjectured in [37], and studied later in [41] and
[33]). If F is supported on [0, ∞) (or (−∞, 0]), the problem can be reparametrized and viewed
7
as a regular location problem. When F has a support that is a closed interval with only one
unknown boundary point (i.e. only one point of the support varies with scale), the estimation
problem becomes easier, depending on the amount of probability close to θ. Probably the best
known example of this type is estimation of the scale parameter in the family of the uniform
distributions [0, θ], (θ > 0), in which X(n) estimates θ at rate n−1 .
In this subsection we consider an extension of this uniform example: we assume that
p(x) > 0 for x ∈ [0, θ] and 0 otherwise while p : [0, θ] → [0, ∞) is continuous at all x ∈ (0, θ).
Observed is an i.i.d. sample X1 , X2 , . . . with marginal P0 . The distribution P0 is assumed to
have a density of above form, i.e. with unknown scale θ for a nuisance density η in some space
H. Model distributions Pθ,η are then described by densities,
1 x
pθ,η : [0, θ] → [0, ∞) : x 7→ η
,
θ
θ
(7)
for η ∈ H and θ ∈ Θ ⊂ (0, ∞). To properly frame the question and make application of our
results possible, we define the nuisance space H with an Esscher transform again: fix S > 0
and let L denote the ball of radius S in the normed space (C[0, 1], k · k∞ ) of continuous
functions from the unit interval to R with uniform norm. The space H is defined as the image
of the ball L under the following Esscher transform:
Rx
η`˙(x) = R 1
0
eSx+
eSy+
0
Ry
0
˙ dt
`(t)
˙ dt
`(t)
,
(8)
dy
for x ∈ [0, 1]. (Note that the resulting densities are monotone increasing, to facilitate verification of conditions of Theorem 3.1.)
Theorem V.2.2 in [16] verifies local asymptotic exponentiality in the θ-direction, uniformly
in η, in its positive version. Although it is necessary to flip some signs, this does not pose
problems in applying the results of this paper: we maintain the sign for h and write ∆n = −∇n ,
where ∇n = n(θ0 − X(n) ). In the notation of Definition 2.1, γθ0 ,η = η(1)/θ0 , i.e. the constant
in the limiting exponential distribution equals the size of the discontinuity at the varying
boundary point of the support. Again, we use a push-forward prior on H based on the above
construction and a prior on L .
Theorem 2.3. Let X1 , X2 , . . . be an i.i.d. sample from the scale model introduced above with
P0 = Pθ0 ,η0 for some θ0 ∈ Θ, η0 ∈ H. Endow Θ with a prior that is thick at θ0 , and L with
a prior ΠL such that L ⊂ supp(ΠL ). Then the marginal posterior for θ satisfies,
P0
(A)
supΠ(n(θ − θ0 ) ∈ A | X1 , . . . , Xn ) − Exp+
−→ 0,
−∇n ,γθ ,η
0
A
(9)
0
where ∇n is exponentially distributed with rate γθ0 ,η0 = η0 (1)/θ0 .
As already noted, our scaling and location problems are both LAE and the parametrizations and solutions we formulate appear to be closely related. However, nuisance parametrizations are quite different and the relation between the models is a subtle one. Therefore the
location theorem of the previous subsection and the scaling theorem that follows are very
similar in appearance, but form the answers to quite distinct questions. Details of the proof
of Theorem 2.3 can be found in Subsection 5.4.
8
3
General results
In order to establish the limit (3) (also (6) and (9)), we study posterior convergence of a
particular type, termed consistency under perturbation in [1]. One can compare this type
of consistency with ordinary posterior consistency in nonparametric models, except here the
nonparametric component is the nuisance parameter η and we allow for (stochastic) perturbation by (local) deformations of the parameter of interest θn (hn ) = θ0 + n−1 hn . In regular
situations, this gives rise to accumulation of posterior mass around so-called least-favourable
submodels, but here the parameter of interest is irregular and the situation is less involved:
accumulation of posterior mass occurs around (θn (hn ), η0 ). Therefore, posterior consistency
under perturbation describes concentration in dH -neighbourhoods of the form, (ρ > 0),
D(ρ) = {η ∈ H : dH (η, η0 ) < ρ}.
(10)
To guarantee sufficiency of prior mass around the point of convergence, we use Kullback–
Leibler-type neighbourhoods of the form,
pθn (h),η
≤ ρ2 ,
Kn (ρ, M ) = η ∈ H : P0 sup −1Aθn (h),η log
pθ0 ,η0
|h|≤M
pθn (h),η 2
2
≤ρ ,
P0 sup −1Aθn (h),η log
pθ0 ,η0
|h|≤M
(11)
and K(ρ) = Kn (ρ, 0), where in the present LAE setting,
pθn (h),η
Aθn (h),η = x :
(x) > 0 .
pθ0 ,η0
Q
Note that ni=1 1Aθn (h),η (Xi ) = 1{h≤∆n } , as in the LAE expansion.
Suppose that A in (4) is of the form A = B × H for some measurable B ⊂ Θ. Since we
use a product prior ΠΘ × ΠH , the marginal posterior of the parameter θ ∈ Θ depends on the
nuisance factor only through the integrated likelihood,
Z Y
n
pθ,η
Sn : Θ → R : θ 7→
(Xi ) dΠH (η),
pθ0 ,η0
H
(12)
i=1
and its localised version, h 7→ sn (h) = Sn (θ0 + n−1 h). In the subsequent theorem we assume
uniform stochastic LAE expansion of the likelihood in the θ-direction, for all nuisance parameters η in shrinking neighborhoods of η0 . To be precise, for every (ρn ) with ρn ↓ 0 and every
bounded, stochastic (hn ) we define,
Y
n
n
Y
pθn (hn ),η
pθ0 ,η
(Xi ) − log
(Xi ) − hn γθ0 ,η ,
Rn (hn ) = sup log
p0
p0
η∈D(ρn )
i=1
i=1
and require Rn (hn ) = oP0 (1). We also consider the rate-free uniform stochastic LAE condition
by taking ρn ≡ ρ > 0. We then say that the uniform stochastic LAE holds for all η in a dH neighbourhood of η0 . In order to control the expansion of the localised integrated likelihood
we also impose a condition based on the following quantity,
Rn0 (ρ, hn ) = sup hn γθ0 ,η − hn γθ0 ,η0 ,
η∈D(ρ)
9
controlling the deviation of the exponent in the LAE expansion for nuisance parameters in
the neighbourhood of η0 .
Theorem 3.1 (Irregular semiparametric Bernstein–von Mises). Let X1 , X2 , . . . be distributed
i.i.d.-P0 , with P0 ∈ P. Let ΠH and ΠΘ be priors on H and Θ and assume that ΠΘ is
thick at θ0 . Suppose that there exists a sequence (ρn ) with ρn ↓ 0, nρ2n → ∞. Furthermore,
assume that θ 7→ Pθ,η is uniformly stochastically LAE in the θ-direction, for all η in shrinking
neighborhoods D(ρn ) of η0 and that γθ0 ,η0 > 0. Assume also that for large enough n, the map
h 7→ sn (h) is continuous on (−∞, ∆n ], P0n -almost-surely. Furthermore, assume that
2
(i) for all M > 0, there exists a K > 0 such that ΠH Kn (ρn , M ) ≥ e−Knρn , for large
enough n,
2
(ii) and for all n large enough, the Hellinger metric entropy satisfies N ρn , H, dH ≤ enρn .
Moreover, for every bounded stochastic (hn ),
(iii) the model satisfies Rn0 (ρn , hn ) = oP0 (1),
(iv) and for all L > 0, Hellinger distances satisfy the uniform bound,
d Pθn (hn ),η , Pθ0 ,η
= o(1).
sup
d Pθ0 ,η , P0
η∈H\D(Lρn )
Finally, suppose that,
(v) for every (Mn ), Mn → ∞, the posterior satisfies
P0
Πn |h| ≤ Mn X1 , . . . , Xn −→
1.
Then the sequence of marginal posteriors for θ satisfies,
supΠn h ∈ AX1 , . . . , Xn − Exp−
∆n ,γθ
0 ,η0
A
P0
(A) −→
0,
(13)
i.e. the posterior converges to a negative exponential distribution in total variation.
Regarding the nuisance rate of convergence ρn , conditions like (i) and (ii) are expected
in some form or other, in order to achieve consistency under perturbation. As stated, they
almost coincide with requirements for nonparametric convergence at rate (ρn ) [11]. Concerning
condition (iii), note that smoothness conditions on model densities pθ,η bound Rn0 uniformly
over the balls D(ρn ) appropriately. For condition (iv), note that, typically, the numerator of
the fraction converges to zero at rate O(n−1/2 ) (see (2)), while the denominator goes to zero
at slower, nonparametric rate. Hence condition (iv) is considered a weak condition that rarely
poses a true restriction to the applicability of the theorem. Condition (v) appears to be the
hardest to verify in applications; yet it cannot be weakened, since (v) also follows from (13).
Besides condition (i), only condition (v) implies a requirement on the nuisance prior ΠH .
Experience with the examples of Section 2 suggests that conditions (i)–(iv) are relatively
10
weak in applications, while (v) harbours the potential for negative surprises, mainly due
to semiparametric bias leading to sub-optimal asymptotic variance, sub-optimal marginal
rate or even marginal inconsistency. On the other hand, there are conditions under which
condition (v) can be verified: in Section 4.3 we present a model condition that guarantees
marginal posterior convergence according to (v) for any choice of the nuisance prior ΠH
satisfying also condition (i).
Additionally we give a simplified version of Theorem 3.1 that does not make reference to
any specific nuisance rate ρn . In this case, conditions on prior mass and entropy numbers ((i)
and (ii)) essentially require nuisance consistency (at some rate rather than a specific one),
thus weakening requirements on model and prior. More specifically, the new conditions are
comparable to those of Schwartz’s consistency theorem (see Schwartz (1965) [35]).
Corollary 3.1 (Rate-free irregular semiparametric Bernstein–von Mises). Let X1 , X2 , . . . be
distributed i.i.d.-P0 , with P0 ∈ P and let ΠΘ be thick at θ0 . Suppose that θ 7→ Pθ,η is uniformly
stochastically LAE in the θ-direction, for η in a dH -neighbourhood of η0 and that γθ0 ,η0 is
strictly positive. Also assume that for large enough n, the map h 7→ sn (h) is continuous on
(−∞, ∆n ] P0n -almost-surely. Furthermore, assume that,
(i) for all ρ > 0, the Hellinger metric entropy satisfies N (ρ, H, dH ) < ∞, and the nuisance
prior satisfies ΠH (K(ρ)) > 0,
(ii) for every M > 0, there exists an L > 0 such that for all ρ > 0 and large enough n
K(ρ) ⊂ Kn (Lρ, M ),
and that for every bounded, stochastic (hn ),
(iii) there is a function g satisfying g(r) ↓ 0 as r ↓ 0 such that Rn0 (r, hn ) ≤ g(r), (for r small
enough),
(iv) and that Hellinger distances satisfy, supη∈H d(Pθn (hn ),η , Pθ0 ,η ) = O(n−1/2 ).
Finally, assume that,
(v) for every (Mn ), Mn → ∞, the posterior satisfies,
P0
Πn |h| ≤ Mn |X1 , . . . , Xn −→
1.
Then marginal posteriors for θ converge in total variation to a negative exponential distribution,
supΠn h ∈ A|X1 , . . . , Xn − Exp−
∆n ,γθ
0 ,η0
A
P0
(A) −→
0.
Proof Under conditions (i), (ii), and (iv) the assertion of Corollary 4.1 holds with some rate
ρn . Since ρn ↓ 0, the uniform stochastic LAE assumption implies the uniform stochastic LAE,
for η in shrinking neighborhoods D(ρn ) of η for large enough n. Also, due to condition (iii),
condition (iii) in Theorem 3.1 is satisfied for large enough n, and the assertion of Theorem 4.2.
Condition (v) then suffices for the assertion of Theorem 4.3.
11
4
Asymptotic posterior exponentiality
In this section we construct the proof of Theorem 3.1 in several steps: the first step (Subsection 4.1) asserts consistency under perturbation under a condition on the nuisance prior ΠH
and a testing condition, and is based on Theorem 3.1 and Corollary 3.3 in Bickel and Kleijn
(2012) [1]. In Subsection 4.2 it is shown that the integral of the likelihood with respect to
the nuisance prior displays an LAE-expansion, if consistency under perturbation obtains and
the model is stochastically LAE in an appropriate, locally uniform way. In the third step,
also discussed in Subsection 4.2, we show that an LAE-expansion of the integrated likelihood
gives rise to a semiparametric exponential limit for the posterior in total variation, if the
marginal posterior for the parameter of interest converges at n−1 -rate. The rate of marginal
convergence depends on control of likelihood ratios, which is discussed in Subsection 4.3. Put
together, the results constitute a proof of Theorem 3.1. Stated conditions are verified in
Section 5 for the two examples of Section 2.
4.1
Posterior convergence under perturbation
Given a rate sequence (ρn ), ρn ↓ 0, we say that the conditioned nuisance posterior is consistent
under n−1 -perturbation at rate ρn , if, for all bounded, stochastic sequences (hn ),
P0
Πn H \ D(ρn ) θ = θ0 + n−1 hn , X1 , . . . , Xn −→
0,
with D(ρ) as in (10). For a more elaborate discussion of this property, the reader is referred
to Bickel and Kleijn (2012) [1].
Theorem 4.1 (Posterior convergence under perturbation). Assume there is a sequence (ρn ),
ρn ↓ 0, nρ2n → ∞ with the property that for all M > 0 there exist a K > 0 such that,
2
ΠH (Kn (ρn , M )) ≥ e−Knρn ,
2
N ρn , H, dH ≤ enρn ,
for large enough n. Assume also that for all L > 0 and all bounded, stochastic (hn ),
sup
η∈H\D(Lρn )
d(Pθn (hn ),η , Pθ0 ,η )
= o(1).
d(Pθ0 ,η , P0 )
(14)
Then, for every bounded, stochastic (hn ) there exists an L > 0 such that,
Πn H \ D(Lρn ) θ = θ0 + n−1 hn , X1 , . . . , Xn = oP0 (1).
The proof of this theorem can be broken down into two separate steps, with the following
testing condition in between: for every bounded, stochastic (hn ) and all L > 0 large enough,
a test sequence (φn ) and constant C > 0 must exist, such that,
P0n φn → 0,
sup
η∈H\D(Lρn )
2 nρ2
n
Pθnn (hn ),η (1 − φn ) ≤ e−CL
,
(15)
for large enough n. According to Lemma 3.2 in [1], the metric entropy condition and “cone
condition” (14) suffice for the existence of such a test sequence. While the above testing
12
argument is instrumental in the control of the numerator of (4), the denominator of the
posterior is lower-bounded with the help of the following lemma, which adapts Lemma 8.1 in
[11] to n−1 -perturbed, irregular setting. The proof of Theorem 4.1 follows then the proof of
Theorem 3.1 in [1].
Lemma 4.1. Let (hn ) be stochastic and bounded by some M > 0. Then,
P0n
Z Y
n
pθn (hn ),η
H i=1
p0
−(1+C)nρ2
(Xi ) dΠH (η) < e
ΠH (Kn (ρ, M )) ∩ {hn ≤ ∆n } ≤
1
C 2 nρ2
,
for all C > 0, ρ > 0 and n ≥ 1, where θn (hn ) = θ0 + n−1 hn .
A proof of this lemma can be found in Section 5.
In many applications, (ρn ) does not play an explicit role because consistency at some rate is
sufficient. The following provides a possible formulation of weakened conditions guaranteeing
consistency under perturbation. Corollary 4.1 is based on the family of Kullback–Leibler
neighbourhoods K(ρ) (see the definition right after (11)) that would also play a role for
marginal posterior consistency of the nuisance with known θ0 (as in [11]).
Corollary 4.1. Assume that for all ρ > 0, N (ρ, H, dH ) < ∞ and ΠH (K(ρ)) > 0. Furthermore, assume that for every stochastic, bounded (hn ),
(i) for every M > 0, there exists an L > 0 such that for all ρ > 0 and large enough n,
K(ρ) ⊂ Kn (Lρ, M ).
(ii) Hellinger distances satisfy supη∈H d(Pθn (hn ),η , Pθ0 ,η ) = O(n−1/2 ).
Then there exists a sequence (ρn ), ρn ↓ 0, nρ2n → ∞, such that the conditional nuisance
posterior converges under n−1 -perturbation at rate (ρn ).
Proof See the proof of Corollary 3.3 in Bickel and Kleijn (2012) [1].
4.2
Marginal posterior asymptotic exponentiality
To see how the irregular Bernstein–von Mises assertion (3) arises, we note the following: the
marginal posterior density πθ,n : Θ → R for the parameter of interest with respect to the prior
ΠΘ is given by,
,Z Z n
Z Y
n
Y pθ,η
pθ,η
(Xi ) dΠH (η)
(Xi ) dΠH (η) dΠΘ (θ),
πθ,n (θ) =
pθ0 ,η0
pθ0 ,η0
Θ H
H
i=1
i=1
P0 -almost-surely. This form resembles that of a parametric posterior density on Θ if one replaces the ordinary, parametric likelihood by the integral of the semiparametric likelihood with
respect to the nuisance prior, c.f. Sn (θ) in (12). If Sn (θ) displays properties similar to those
that lead to posterior asymptotic normality in the smooth parametric case, we may hope that
in the irregular, semiparametric setting the classical proof can be largely maintained. More
specifically, we shall replace the LAN expansion of the parametric likelihood by a stochastic
13
LAE expansion of the likelihood integrated over the nuisance as in (12). Theorem 4.3 uses
this observation to reduce the proof of the main theorem of this paper to a strictly parametric
discussion.
In this subsection, we prove marginal posterior asymptotic exponentiality in two parts:
first we show that Sn (θ) satisfies an LAE expansion of its own, and second, we use this to
obtain Bernstein–von Mises assertion (3), proceeding along the lines of proofs presented in
Le Cam and Yang (1990) [31], Kleijn and van der Vaart (2012) [23] and Kleijn (2003) [22].
We restrict attention to the case in which the model itself is stochastically LAE and the
posterior is consistent under n−1 -perturbation (although other, less stringent formulations
are conceivable). Both theorems are proven in Section 5.
Theorem 4.2 (Integrated Local Asymptotic Exponentiality). Suppose that for some (ρn ),
ρn ↓ 0 the model is uniformly stochastically locally asymptotically exponential in the θdirection, for η in shrinking neighbourhoods D(ρn ) of η0 , with D(ρ) as in (10), and that
condition (iii) of Theorem 3.1 is satisfied. Furthermore, assume that model and prior ΠH are
such that for every bounded, stochastic (hn ),
P0
Πn H \ D(ρn ) θ = θ0 + n−1 hn ; X1 , . . . , Xn −→
0.
Then the integral LAE-expansion holds, i.e.,
Z Y
Z Y
n
n
pθn (hn ),η
pθ0 ,η
(Xi ) dΠH (η) =
(Xi ) dΠH (η) exp(hn γθ0 ,η0 + oP0 (1))1{hn ≤∆n } ,
p0
p0
H
H
i=1
i=1
for any stochastic sequence (hn ) ⊂ R that is bounded in P0 -probability.
The following theorem uses the integrated LAE expansion in conjunction with a marginal
posterior convergence condition to derive the exponential Bernstein–von Mises assertion.
Marginal posterior convergence forms the subject of the next subsection.
Theorem 4.3 (Posterior asymptotic exponentiality). Let Θ be open in R with thick prior
ΠΘ . Suppose that for every n ≥ 1, h 7→ sn (h) is continuous on (−∞, ∆n ], P0n -almost-surely.
Assume that for every stochastic sequence (hn ) ⊂ R that is bounded in probability,
sn (hn )
= exp(hn γθ0 ,η0 + oP0 (1))1{hn ≤∆n } ,
sn (0)
(16)
for some positive constant γθ0 ,η0 , where sn (h) = Sn (θ0 + n−1 h), c.f. (12). Suppose that for
every Mn → ∞, we have,
P0
1.
Πn |h| ≤ Mn X1 , . . . , Xn −→
Then the sequence of marginal posteriors for θ satisfies,
sup Πn h ∈ A X1 , . . . , Xn − Exp−
∆n ,γθ
A
0 ,η0
(17)
P0
(A) −→
0,
(18)
i.e. it converges to a negative exponential in total variation, in P0 -probability.
Note that Theorem 1.1 is a special case of above: for given λ > 0, consider Θ = R, H = {λ}
and pθ,η (x) = λe−λ(x−θ) for x ≥ θ. Condition (16) is trivially satisfied with γθ0 ,η0 = λ and
condition (17) can be verified using Lemma 4.2 below.
14
4.3
Marginal posterior convergence at n−1 -rate
One of the conditions in the main theorem is marginal consistency at rate n−1 , so that the
posterior measure of a sequence of model subsets of the form,
Θn × H = {(θ, η) ∈ Θ × H : n|θ − θ0 | ≤ Mn },
converge to one in P0 -probability, for every sequence (Mn ) such that Mn → ∞. Marginal
(semiparametric) posteriors have not been studied extensively or systematically in the literature. As a result, fundamental questions (for example, on semiparametric bias) surrounding
marginal posterior consistency have not yet received the attention they deserve.
Here, we present a straightforward formulation of sufficient conditions, based solely on
bounded likelihood ratios. This has the advantage of leaving the nuisance prior completely
unrestricted but may prove to be too stringent a model condition in some applications. Conceivably [6], the nuisance prior has a much more significant role to play in questions on
marginal consistency. The inadequacy of Lemma 4.2 manifests itself primarily through the
occurrence of a supremum over the nuisance space H in condition (19), a uniformity that appears a bit coarse. It can be refined somewhat by imposing a uniform bound on the likelihood
ratios in a sequence of model subsets that capture most of the full nonparametric posterior
mass. Reservations aside, it appears from the examples of Section 2 that the lemma is also
useful in the form stated.
Lemma 4.2. Let the sequence of maps θ 7→ Sn (θ) be P0n -a.s. continuous on (−∞, ∆n ] and
exhibit the stochastic integral LAE property. Furthermore, assume that there exists a constant
C > 0 such that for any (Mn ), Mn → ∞, Mn ≤ n for n ≥ 1, and Mn = o(n),
pθ,η
CMn
n
≤−
→ 1.
P0 sup sup Pn log
pθ0 ,η
n
η∈H θ∈Θcn
(19)
Then, for any nuisance prior ΠH and ΠΘ that is thick at θ0 ,
P0
Πn n|θ − θ0 | > Mn X1 , . . . , Xn −→
0,
for any (Mn ), Mn → ∞.
Proof
Let us first note, that if marginal consistency holds for a sequence Mn , then it
also holds for any sequence Mn0 that diverges faster (i.e. if Mn = O(Mn0 )). Without loss of
generality, we therefore assume that Mn diverges more slowly than n, i.e. Mn = o(n). We
can also assume Mn ≤ n for n ≥ 1. Define Fn to be the events in (19) so that P0n (Fnc ) = o(1)
by assumption. In addition, let,
Z
Gn = (X1 , . . . , Xn ) :
Sn (θ) dΠΘ (θ) ≥ e−CMn /2 Sn (θ0 ) .
Θ
15
By Lemma 4.3, P0n (Gcn ) = o(1) as well. Hence,
P0n Πn n|θ − θ0 | > Mn X1 , . . . , Xn
≤ P0n Πn n|θ − θ0 | > Mn X n 1Fn ∩Gn (X n ) + o(1)
Z Z Y
n
n
Y
pθ,η
pθ0 ,η
1
CMn /2 n
≤e
P0
(Xi )
(Xi ) dΠΘ dΠH 1Fn (X n )
Sn (θ0 ) H Θcn
pθ0 ,η
pθ0 ,η0
i=1
i=1
+o(1).
On the events Fn we have
n
n
Y
Y
pθ,η
pθ0 ,η
(Xi )
(Xi ) dΠΘ dΠH
p
p
c
θ
,η
Θn i=1 θ0 ,η
0
0
i=1
Z Y
Z
n
pθ,η pθ0 ,η
exp nPn log
=
(Xi )
dΠΘ dΠH
pθ0 ,η
H i=1 pθ0 ,η0
Θcn
Z Y
n
pθ0 ,η
pθ,η ≤
(Xi ) dΠH sup sup exp nPn log
pθ0 ,η
η∈H θ∈Θcn
H i=1 pθ0 ,η0
pθ,η
,
≤ Sn (θ0 ) exp sup sup nPn log
pθ0 ,η
η∈H θ∈Θcn
Z Z
H
which ultimately proves marginal consistency at rate n−1 .
In the proof of Lemma 4.2 the lower bound for the denominator of the marginal posterior
comes from the following lemma. (Let Πn denote the prior ΠΘ in the local parametrization
in terms of h = n(θ − θ0 ).)
Lemma 4.3. Let the sequence of maps θ 7→ sn (θ) exhibit the LAE property of (16). Assume
that the prior ΠΘ is thick at θ0 (and denoted by Πn in the local parametrization in terms of
h). Then,
P0n
Z
sn (h) dΠn (h) < an sn (0) → 0,
for every sequence (an ), an ↓ 0.
5
Proofs
In this section, several longer proofs from the main text have been collected.
5.1
Proof
Proof of Lemma 4.1
Let C > 0, ρ > 0, and n ≥ 1 be given. If ΠH (Kn (ρ, M )) = 0, the assertion
holds trivially, so we assume ΠH (Kn (ρ, M )) > 0 without loss of generality and consider the
conditional prior Πn (A) = ΠH (A|Kn (ρ, M )) (for measurable A ⊂ H). Since,
Z Y
Z Y
n
n
pθn (hn ),η
pθn (hn ),η
(Xi ) dΠH (η) ≥ ΠH (Kn (ρ, M ))
(Xi ) dΠn (η),
p0
p0
H
i=1
i=1
16
we may choose to consider only the neighbourhoods Kn . Restricting attention to the event
{hn ≤ ∆n }, we obtain,
Z Y
Z
n
pθn (hn ),η
pθ (h ),η
log
(Xi ) dΠn (η) ≥ nPn log 1Aθn (hn ),η n n dΠn (η)
p0
p0
i=1
Z
Z
pθ (h),η
pθ (h),η
≥
inf nPn 1Aθn (h),η log n
dΠn (η) ≥ nPn inf 1Aθn (h),η log n
dΠn (η)
p0
p0
|h|≤M
|h|≤M
Z
pθ (h),η
√
≥ n −Gn sup −1Aθn (h),η log n
dΠn (η), −nρ2 ,
p
0
|h|≤M
using the definition of Kn in the last step (see (11)). Then,
Z Y
n
pθn (hn ),η
n
−(1+C)nρ2
∩ {hn ≤ ∆n }
P0
(Xi ) dΠn (η) < e
p0
i=1
Z
pθ (h),η
√
≤ P0n
−Gn sup −1Aθn (h),η log n
dΠn (η) < − nCρ2 .
p0
|h|≤M
By Chebyshev’s inequality, Jensen’s inequality, Fubini’s theorem and the fact that for any
P0 -square-integrable random variables Zn , P0n (Gn Zn )2 ≤ P0n Zn2 ,
Z
pθn (h),η
√
n
2
P0
− Gn sup −1Aθn (h),η log
dΠn (η) < − nCρ
p0
|h|≤M
Z
pθn (h),η 2
1
1
n
≤
P0 Gn sup −1Aθn (h),η log
,
dΠn (η) ≤
2
4
nC ρ
p0
nC 2 ρ2
|h|≤M
where the last step follows again from definition (11).
5.2
Proof of Theorems 4.2 and 4.3, and Lemma 4.3
Proof (of Theorem 4.2)
Let (hn ) be bounded in P0 -probability. Throughout this proof we write θn (hn ) = θ0 + n−1 hn .
Let δ, > 0 be given. There exists a constant M > 0 such that P0n (|hn | > M ) < δ for all
n ≥ 1. Since the posterior for η is consistent under perturbation, for large enough n,
P0n log Πn D(ρn ) θ = θn (hn ); X1 , . . . , Xn ≥ − > 1 − δ.
This implies that the posterior’s numerator and denominator are related through,
Z Y
Z
n
n
Y
pθn (hn ),η
pθn (hn ),η
n
P0
(Xi ) dΠH (η) ≤ e
(Xi ) dΠH (η), |hn | ≤ M > 1 − 2δ,
p0
p0
H
D(ρn )
i=1
i=1
for this M and all n large enough. To this we may add that,
Z
Z Y
n
n
Y
pθn (hn ),η
pθn (hn ),η
(Xi ) dΠH (η) ≤
(Xi ) dΠH (η).
p0
p0
D(ρn )
H
i=1
i=1
We continue with the integrals over D(ρn ) under the restriction |hn | ≤ M . Recall that the
model is uniformly stochastically locally asymptotically exponential, that is P0n (Rn (hn ) >
) < δ for large enough n, where,
Y
n
n
Y
pθn (hn ),η
pθ0 ,η
Rn (hn ) = sup log
(Xi ) − log
(Xi ) − hn γθ0 ,η .
p0
p0
η∈D(ρn )
i=1
i=1
17
Therefore,
P0n
Z
−
e
n
Y
pθ
D(ρn ) i=1
0 ,η
p0
(Xi )e
hn γθ0 ,η
Z Y
n
pθn (hn ),η
dΠH (η) ≤
(Xi ) dΠH (η)
p0
H i=1
Z
n
Y
pθ0 ,η
2
hn γθ0 ,η
≤e
dΠH (η) > 1 − 3δ.
(Xi )e
p0
D(ρn )
i=1
According to (iii) of Theorem 3.1, for large enough n, P0n (Rn0 (hn ) > ) < δ, so we have,
P0n
−2+hn γθ0 ,η0
n
Y
pθ
Z
e
D(ρn ) i=1
0 ,η
p0
(Xi ) dΠH (η) ≤
Z Y
n
pθn (hn ),η
p0
H i=1
3+hn γθ0 ,η0
(Xi ) dΠH (η)
n
Y
pθ
Z
≤e
D(ρn ) i=1
0 ,η
p0
(Xi ) dΠH (η)
> 1 − 4δ.
Since the posterior for η is consistent under perturbation, also,
P0n log Πn D(ρn ) θ = θ0 = θn (0); X1 , . . . , Xn < − < δ,
for large enough n. We use this as in the beginning of the proof and arrive at,
P0n
Z Y
Z Y
n
n
pθn (hn ),η
pθ0 ,η
−3+hn γθ0 ,η0
e
(Xi ) dΠH (η) ≤
(Xi ) dΠH (η)
p0
H i=1 p0
H i=1
Z Y
n
pθ0 ,η
3+hn γθ0 ,η0
≤e
(Xi ) dΠH (η) > 1 − 5δ.
p0
H
i=1
Since this holds for arbitrarily small > 0 it proves desired result.
Proof (of Theorem 4.3)
Let C be an arbitrary compact subset of R containing an open neighbourhood of the origin.
Denote the (randomly located) distribution Exp−
∆n ,γθ
0 ,η0
by Ξn . The marginal posterior for
the local parameter h is denoted Πn . Conditioned on C ⊂ R, these measures are denoted ΞC
n,
∗
and ΠC
n respectively. Define the functions ξn , ξn : R → R as,
ξn∗ (x) = γθ0 ,η0 eγθ0 ,η0 (x−∆n ) ,
ξn (x) = ξn∗ (x) 1{x≤∆n } ,
noting that ξn is the Lebesgue density for Ξn . Let dn = log sn (∆n ) − log sn (0) − ∆n γθ0 ,η0 .
Define s∗n (h) = sn (h) on (−∞, ∆n ] and s∗n (h) = sn (0) exp(hγθ0 ,η0 + dn ) elsewhere. Finally,
define, for every g, h ∈ C and large enough n,
ξn (h) sn (g) πn (g)
fn (g, h) = 1 −
1
1
,
ξn (g) sn (h) πn (h) + {g≤∆n } {h≤∆n }
and
fn∗ (g, h)
ξn∗ (h) s∗n (g) πn (g)
= 1− ∗
,
ξn (g) s∗n (h) πn (h) +
where πn is the Lebesgue density of the prior for the centered and rescaled parameter h. By
(16) we know that dn = oP0 (1). Furthermore, for every stochastic sequence (hn ) in C,
log s∗n (hn ) = log s∗n (0) + hn γθ0 ,η0 + oP0 (1),
18
log ξn∗ (hn ) = (hn − ∆n )γθ0 ,η0 + log γθ0 ,η0 .
Since ξn∗ (h) and ξn (h) (s∗n (h) and sn (h), respectively) coincide on {h ≤ ∆n }, fn (g, h) ≤
fn∗ (g, h). For any two stochastic sequences (hn ), (gn ) in C, πn (gn )/πn (hn ) → 1 as n → ∞
since πn is continuous and non-zero at 0. Combination with the above display leads to,
log
ξn∗ (hn ) s∗n (gn ) πn (gn )
ξn∗ (gn ) s∗n (hn ) πn (hn )
= (hn − ∆n )γθ0 ,η0 − (gn − ∆n )γθ0 ,η0 + gn γθ0 ,η0 − hn γθ0 ,η0 + oP0 (1) = oP0 (1).
Since x 7→ (1 − ex )+ is continuous on (−∞, ∞), we conclude that for any stochastic sequence
P
0
(gn , hn ) in C × C, fn∗ (gn , hn ) −→
0. To render this limit uniform over C × C, continuity is
enough: (g, h) 7→ πn (g)/πn (h) is continuous since the prior is thick. Note that ξn∗ (h)/s∗n (h) is
of the form γθ0 ,η0 exp(γθ0 ,η0 (∆n + Rn (h))) for all h, n ≥ 1, and Rn (hn ) = oP0 (1). Tightness
of ∆n and Rn implies that ξn∗ (h)/s∗n (h) ∈ (0, ∞), (P0n − a.s.). Continuity of h 7→ sn (h) and
h 7→ ξn∗ (h) then implies continuity of (g, h) 7→ (ξn∗ (h)s∗n (g))/(ξn∗ (g)s∗n (h)), (P0n − a.s.). Hence
we conclude that,
sup
(g,h)∈C×C
fn (g, h) ≤
sup
P
0
fn∗ (g, h) −→
0.
(20)
(g,h)∈C×C
Since sn (h) is supported on (−∞, ∆n ], since C contains a neighbourhood of the origin and
since ∆n is tight and positive, Ξn (C) > 0 and Πn (C) > 0, (P0n − a.s.). So conditioning on C
is well-defined (for the relevant cases where h ≤ ∆n ). Let δ > 0 be given and define events,
o
n
sup fn (g, h) ≤ δ .
Ωn = X n :
(g,h)∈C×C
Based on Ωn and (20), write,
C
C
n
C
P0n sup ΠC
(h
∈
A)
−
Ξ
(A)
≤
P
sup
Π
(h
∈
A)
−
Ξ
(A)
1Ωn + o(1).
n
n
0
n
n
A
A
C
C
Note that both ΞC
n and Πn have strictly positive densities on C. Therefore, Ξn is dominated
by ΠC
n for all n large enough. With that observation, the first term on the right-hand side of
the above display is calculated to be,
1 n
C
P0 sup ΠC
(h
∈
A)
−
Ξ
(A)
1Ωn (X n )
n
n
2
A
Z dΞC
n
n
= P0
1−
1{h≤∆n } dΠC
n (h)1Ωn (X n )
dΠC
C
n R+
Z sn (g)πn (g)1{g≤∆n } dg n
= P0
1{h≤∆n } dΠC
1 − ξnC (h) C
n (h)1Ωn (X n )
sn (h)πn (h)
+
C
Z Z
sn (g)πn (g)ξn (h)
= P0n
1−
1{g≤∆n } dΞC
(g)
1{h≤∆n } dΠC
n
n (h)1Ωn (X n ),
s
(h)π
(h)ξ
(g)
+
n
n
n
C
C
for large enough n. Jensen’s inequality leads to,
1 n
C
P0 sup ΠC
(h
∈
A)
−
Ξ
(A)
1Ωn (X n )
n
n
2
A
Z sn (g)πn (g)ξn (h) n
≤ P0
1−
1
1
dΞC (g) dΠC
n (h)1Ωn (X n )
sn (h)πn (h)ξn (g) + {h≤∆n } {g≤∆n } n
Z
C
≤ P0n
sup fn (g, h) dΞC
n (g) dΠn (h)1Ωn (X n ) ≤ δ.
(g,h)∈C×C
19
We conclude that for all compact C ⊂ R containing a neighbourhood of the origin, we obtain
C
P0n kΠC
n − Ξn k → 0. To finish the argument, let (Cm ) be a sequence of closed balls centred at
the origin with radii Mm → ∞. For each fixed m ≥ 1 the above display holds with C = Cm ,
so if we traverse the sequence (Cm ) slowly enough, convergence to zero can still be guaranteed,
i.e. there exists a subsequence mn → ∞ such that for Mn := Mmn , and a sequence of closed
Bn
n
balls (Bn ) centered at the origin with radii Mn we observe P0n kΠB
n − Ξn k → 0. Since
P
0
Πn (R \ Bn ) −→
0 by (17), the following lemma shows that (18) holds.
Lemma 5.1. Let Bn be a sequence of balls centered at the origin with radii Mn → ∞, and
Πn , Ξn be as in the proof of Theorem 4.3. Then,
P
0
Ξn (R \ Kn ) −→
0,
and
P
0
Bn
n
kΠn − Ξn k − kΠB
n − Ξn k −→ 0.
Proof Fix δ > 0. Recall that Ξn denotes the negative exponential distribution with scale
γθ0 ,η0 located at ∆n . Uniform tightness of ∆n implies the existence of L > 0 such that,
sup P0n (|∆n | ≥ L) ≤ δ.
n≥1
Define An = {ω : ∆n (ω) ≥ L}. Let λ ∈ R be given. Denote by Ξλ the negative exponential
distribution with location parameter λ and scale parameter γθ0 ,η0 . Since Ξλ is tight, there
exists for every > 0 a constant L0 > 0 such that Ξλ (B(λ, L0 )) ≥ 1 − . If |λ| ≤ L, then
B(λ, L0 ) ⊂ B(0, L0 + L). Therefore, with M = L0 + L, Ξλ (B(0, M )) ≥ 1 − for all λ such that
|λ| ≤ L. Choose N ≥ 1 such that Mn ≥ M for all n ≥ N . Let n ≥ N be given. Then,
P0n Ξn (R\B(0, Mn )) > ≤ P0n An + P0n {Ξn (R \ B(0, Mn )) > } ∩ Acn
≤ δ + P0n {Ξn (B(0, Mn )c ) > } ∩ Acn .
(21)
Note that on the complement of An , we observe |∆n | < L, so,
Ξn (B(0, Mn )c ) ≤ 1 − Ξn (B(0, M )) ≤ 1 − inf Ξλ (B(0, M )) ≤ ,
|λ|<L
and we conclude that the last term on the right-hand side of (21) equals zero and the first
assertion of the lemma holds. Let B be a measurable subset of R and let Φ be any probability
measure on R such that Φ(B) > 0. Then for any measurable A ⊂ R we have,
ΦB (A) =
Φ(A ∩ B)
Φ(A ∩ B)(Φ(B) + Φ(R \ B))
=
= Φ(A ∩ B) + Φ(R \ B)ΦB (A),
Φ(B)
Φ(B)
and hence,
Φ(A) − ΦB (A) = Φ A ∩ (R \ B) − Φ(R \ B)ΦB (A) ≤ 2Φ(R \ B).
Therefore,
B
(A)
−
Ξ
(A)
−
Ξ
(A)
Πn (A) − ΠB
≤ 2 Πn (R \ B) + Ξn (R \ B) .
n
n
n
20
As a result of the triangle inequality, we then find that the difference in total-variation disB
tances between Πn and Ξn on the one hand, and ΠB
n and Ξn on the other is bounded above
by the expression on the right in the above display (which is independent of A).
Define Fn and Gn to be the events that Πn (Bn ) > 0, Ξn (Bn ) > 0 respectively. On
Bn
n
Ω n = F n ∩ Gn , Π B
n and Ξn are well-defined probability measures. The first assertion of this
lemma and condition (17) of Theorem 4.3 guarantee that P0n (Ωn ) converges to 1. Restricting
attention to the event Ωn in the above upon substitution of the sequence (Bn ) and using the
first assertion and (17) again, we prove the second assertion of the lemma.
Proof (of Lemma 4.3)
Let M > 0 be given and define the set C = {h : −M ≤ h ≤ 0}. Denote the oP0 (1) rest-term
in the integral LAE expansion (16) by h 7→ Rn (h). By continuity of θ 7→ Sn (θ), the expansion
holds uniformly over compacta for large enough n and in particular, suph∈C |Rn (h)| converges
to zero in P0 -probability. Let (Kn ), Kn → ∞ be given. The events Bn = supC |Rn (h)| ≤
Kn /2 satisfy P0n (Bn ) → 1. Since ΠΘ is thick at θ0 , there exists a π > 0 such that
inf h∈C dΠn /dh ≥ π, for large enough n. Therefore,
Z
Z
sn (h)
sn (h)
n
−Kn
−1 −Kn
n
≤ P0
∩Bn + o(1).
dΠn (h) ≤ e
dh ≤ π e
P0
sn (0)
C sn (0)
On Bn , the integral LAE expansion is lower bounded so that, for large enough n,
Z
Z
sn (h)
n
n
hγθ0 ,η0
−1 − K2n
−1 −Kn
P0
∩Bn ≤ P0
e
dh ≤ π e
dΠn (h) ≤ π e
.
C sn (0)
C
Since
R
C
ehγθ0 ,η0 dh ≥ M e−M γθ0 ,η0 and Kn → ∞, e−
Kn
2
≤ π M e−M γθ0 ,η0 for large enough n.
Combination of the above with Kn = − log an proves the desired result.
5.3
Proofs of Subsection 2.1
We first present properties of the map defining the nuisance space.
Lemma 5.2. Let α > S be fixed. Define H as the image of L under the map that takes `˙ ∈ L
into densities η`˙ defined by (5) for x ≥ 0. This map is uniform-to-Hellinger continuous and
the space H is a collection of probability densities that are (i) monotone decreasing with subexponential tails, (ii) continuously differentiable on (0, ∞) and (iii) satisfy α − S ≤ η`˙(0) ≤
α + S. Moreover, the densities in H are (iv) Lipschitz with constant (α + S)2 and (v) logLipschitz with constant α + S.
Rx
˙ is uniform-to-uniform continuous and that
Proof One easily shows that `˙ 7→ exp(−α x+ 0 `)
Rx
˙ > 0, which implies uniform-to-Hellinger continuity of the Esscher transform.
exp(−α x + 0 `)
R
Rx
˙ dy ≤ S x < α x, so that x 7→ exp(−α x + x `(t)
˙ dt)
`(y)
For the properties of η ˙, note that
`
0
0
is sub-exponential, which implies that `˙ 7→ η`˙ gives rise to a probability density. The density
η`˙ is differentiable and monotone decreasing. Note that,
Z ∞
Z ∞
Z x
Z
˙ dt dx ≤
exp(−α x − S x) dx ≤
exp −α x +
`(t)
0
0
0
21
0
∞
exp(−α x + S x) dx,
˙
so |η`0˙(x)| ≤ (α+S)2
and the middle term equals η`˙(0)−1 . Moreover, η`0˙(x) = η`˙(x)(−α+ `(x)),
which proves the Lipschitz property. Furthermore, for all θ, θ0 ∈ Θ and all x ≥ θ0 ,
Z x−θ
η`˙(x − θ)
˙ dt ≤ e(α+S)|θ−θ0 | ,
`(t)
≤ exp α(θ − θ0 ) +
η`˙(x − θ0 )
x−θ0
proving the log-Lipschitz property.
The proof of Theorem 2.2 consists of a verification of the conditions of Corollary 3.1. The
following lemmas make the most elaborate steps explicit.
Lemma 5.3. Hellinger covering numbers for H are finite, i.e. for all ρ > 0, N (ρ, H, dH ) < ∞.
Proof Given 0 < S < α, we define ρ20 = α − S > 0. Consider the distribution Q with
2
Lebesgue density q > 0 given by q(x) = ρ20 e−ρ0 x for x ≥ 0. Then the family F = {x 7→
p
7 [0, C],
η`˙/q(x) : `˙ ∈ L } forms a subset of the collection of all monotone functions R →
where C is fixed and depends on α, and S. Referring to Theorem 2.7.5 in van der Vaart and
Wellner (1996) [40], we conclude that the L2 (Q)-bracketing entropy N[ ] (, F , L2 (Q)) of F is
finite for all > 0. Noting that,
2
dH (η, η0 ) = dH η`˙, η`˙0
2
=
Z r
η
R
`˙
q
r
(x) −
η`˙0
q
(x)
2
dQ(x),
it follows that N (ρ, H, dH ) = N (ρ, F , L2 (Q)) ≤ N[ ] (2ρ, F , L2 (Q)) < ∞.
The following lemma establishes that condition (ii) of Corollary 3.1 is satisfied. Moreover,
assuming that the nuisance prior is such that L ⊂ supp(ΠL ), this lemma establishes that
ΠH (K(ρ)) > 0. This, together with the assertion of the previous lemma, verifies condition (i)
of Corollary 3.1.
Lemma 5.4. For every M > 0 there exist constants L1 , L2 > 0 such that for small enough
ρ > 0, {η ˙ ∈ H : k`˙ − `˙0 k∞ ≤ ρ2 } ⊂ K(L1 ρ) ⊂ Kn (L2 ρ, M ).
`
Proof Let ρ, 0 < ρ < ρ0 and `˙ ∈ L such that k`˙ − `˙0 k∞ ≤ ρ2 be given. Then,
Z x−θ0
pθ ,η
(`˙ − `˙0 )(t) dt ≤ ρ2 P0 (X − θ0 ) + O(ρ4 ),
log 0 (x) −
pθ0 ,η0
0
(22)
for all x ≥ θ0 . Define, for all α > S and `˙ ∈ L , the logarithm z of the normalising factor in
(5). Then the relevant log-density-ratio can be written as,
Z x−θ0
pθ0 ,η
˙ + z(α, `˙0 ),
log
(x) =
(`˙ − `˙0 )(t) dt − z(α, `)
pθ0 ,η0
0
where only the first term is x-dependent. Assume that `˙ ∈ L is such that k`˙ − `˙0 k∞ < ρ2 .
R y−θ
˙ ≤ z(α + ρ2 , `˙0 ). Noting
Then, | 0 0 (`˙ − `˙0 )(t) dt| ≤ ρ2 (y − θ0 ), so that z(α − ρ2 , `˙0 ) ≤ z(α, `)
that dk z/dαk (α, `˙0 ) = (−1)k P0 (X − θ0 )k < ∞ and using the first-order Taylor expansion of
z in α, we find, z(α ± ρ2 , `˙0 ) = z(α, `˙0 ) ∓ ρ2 P0 (X − θ0 ) + O(ρ4 ), and (22) follows. Next note
that, for every k ≥ 1,
Z X−θ0
k
(`˙ − `˙0 )(t) dt
P0
0
Z ∞ Z x−θ0 k
2k
dy dP0 = ρ2k P0 (X − θ0 )k .
≤ρ
θ0
0
22
(23)
Using (22) we bound the differences between KL divergences and integrals of scores as follows:
Z x−θ0
pθ0 ,η
(`˙ − `˙0 )(t) dt ≤ ρ2 P0 (X − θ0 ) + O(ρ2 ) ,
(x) −
log
pθ0 ,η0
0
2 2 Z x−θ0
pθ0 ,η
(`˙ − `˙0 )(t) dt ≤ ρ2 P0 (X − θ0 ) + O(ρ2 )
(x) −
log
pθ0 ,η0
0
Z x−θ0
× 2
(`˙ − `˙0 )(t) dt + ρ2 P0 (X − θ0 ) + O(ρ2 ) ,
0
and, combining with the bounds (23), we see that,
pθ ,η
−P0 log 0 ≤ 2ρ2 P0 (X − θ0 ) + O(ρ2 ) ,
pθ0 ,η0
pθ0 ,η 2
≤ ρ4 P0 (X − θ0 )2 + 3P0 (X − θ0 ) + O(ρ2 ) ,
P0 log
pθ0 ,η0
which proves the first inclusion. Let M > 0. Note that Aθ,η = [θ, ∞) for every η, and that,
sup −1Aθn (h),η log
|h|≤M
pθn (h),η
pθ0 ,η0
pθ ,η pθ0 ,η
= sup 1Aθn (h),η log
= sup 1Aθn (h),η log 0 0
pθ0 ,η0
pθn (h),η
pθ0 ,η pθn (h),η
|h|≤M
|h|≤M
= sup 1Aθn (h),η log
|h|≤M
pθ0 ,η
pθn (h),η
+ log
pθ ,η
pθ0 ,η0
(α + S)M
≤
+ log 0 0 ,
pθ0 ,η
n
pθ0 ,η
where in the last step we use assertion (v) of Lemma 5.2, so that,
pθ (h),η pθ ,η
(α + S)M
≤ −P0 log 0 +
,
P0 sup − 1Aθn (h),η log n
pθ0 ,η0
pθ0 ,η0
n
|h|≤M
pθ (h),η 2
P0 sup − 1Aθn (h),η log n
pθ0 ,η0
|h|≤M
pθ ,η 2 2(α + S)M h pθ ,η 2 i1/2 (α + S)2 M 2
≤ P0 log 0
P0 log 0
,
+
+
pθ0 ,η0
n
pθ0 ,η0
n2
implying the existence of a constant L2 .
We verify condition (iii) of Corollary 3.1 with the help of the following lemma.
Lemma 5.5. For a stochastic (hn ), |hn | ≤ M , and r > 0,
√
sup hn γθ0 ,η − hn γθ0 ,η0 ≤ 2M (α + S) r.
η∈D(r)
Proof Recall γθ0 ,η = η(0). Suppose |η(0) − η0 (0)| > ρ for some ρ > 0. By Lemma 2.15 in
[38], with C > 0,
Z ∞
Z
dH (η, η0 ) ≥
η(x − θ0 ) − η0 (x − θ0 ) dx =
θ0
0
By assertion (iv) of Lemma 5.2,
Z Cρ
Z Cρ
η(x) − η0 (x) dx ≥
0
0
Z Cρ
≥
∞
η(x) − η0 (x) dx ≥
Z
Cρ η(x) − η0 (x) dx.
0
η(0) − η0 (0) − η(x) − η(0) − η0 (x) − η0 (0) dx
η(0) − η0 (0) − 2(α + S)2 x dx ≥ Cρ2 − (α + S)2 C 2 ρ2 .
0
Choosing C = 1/2(α +
S)2
√
and ρ = 2(α + S) r proves the assertion of the lemma.
The following lemma shows that condition (iv) of Corollary 3.1 is also satisfied.
23
Lemma 5.6. For all bounded, stochastic sequences (hn ), Hellinger distances between Pθn (hn ),η
and Pθ0 ,η are of order n1/2 uniformly in η, i.e. supη∈H n1/2 d(Pθn (hn ),η , Pθ0 ,η ) = O(1).
Proof Fix n and ω; write hn for hn (ω). First we consider the case that hn ≥ 0, for x ≥ θ0 ,
η 1/2 (x−θn (hn )) − η 1/2 (x − θ0 )
2
= η(x − θ0 )1[θ0 ,θn (hn )] (x) + (η 1/2 (x − θn (hn )) − η 1/2 (x − θ0 ))2 1[θn (hn ),∞) (x).
To upper bound the second term, we use the absolute continuity of η 1/2 ,
1/2
η (x − θ0 ) − η 1/2 (x − θn (hn )) =
Z
1 Z Mn η 0
η0
1 x−θ0
(y)
dy
≤
(z
+
x
−
θ
(h
))
n n dz,
2 x−θ0 − hn η 1/2
2 0 η 1/2
n
and then by Jensen’s inequality,
η
1/2
(x − θ0 ) − η
1/2
2 M
(x − θn (hn )) ≤
4n
Z
M
n
0
(η 0 )2
(z + x − θn (hn )) dz.
η
Similarly for hn < 0 and x ≥ θn (hn ),
2
η 1/2 (x − θ0 ) − η 1/2 (x − θn (hn ))
≤ η(x − θn (hn ))1[θn (hn ),θ0 ] (x) − η(x − θn (−M ))1[θn (−M ),θ0 ] (x)
Z M 0 2
n (η )
M
(z + x − θ0 ) dz 1[θ0 ,∞) (x).
+ η(x − θn (−M ))1[θn (−M ),θ0 ] +
4n 0
η
Combining these results, we obtain a bound for the squared Hellinger distance:
Z
2
d (Pθn (hn ),η , Pθ0 ,η ) ≤
θn (M )
η(x − θ0 ) dx +
η(x − θn (−M )) dx
θ0
Z
+ 1{hn <0}
θ0
Z
θn (−M )
θ0
Z
θn (hn )
η(x − θn (hn )) dx − 1{hn <0}
θ0
η(x − θn (−M )) dx
θn (−M )
(24)
Z M 0 2
n (η )
M
+ 1{hn ≥0}
(z + x − θn (hn )) dz dx
η
θn (hn ) 4n 0
Z ∞
Z M 0 2
n (η )
M
+ 1{hn <0}
(z + x − θ0 ) dz dx.
η
θ0 4n 0
Z
∞
As for the first two terms on the right-hand side of (24), we note the following inequality:
Z
θn (M )
θ0
M2
M
+ 2
η(x − θ0 ) dx +
η(x − θn (−M )) dx ≤ 2γθ0 ,η
n
n
θn (−M )
Z
θ0
∞
Z
|η 0 (y)| dy,
0
by Lemma 5.16. Furthermore, by shifting appropriately, we find that the third and fourth
term of (24) satisfy the bound,
1{hn <0}
Z
θ0
Z
θ0
η(x − θn (hn )) dx −
θn (hn )
η(x − θn (−M )) dx
θn (−M )
= 1{hn <0}
Z
− hnn
Z
η(y) dy −
0
0
24
M
n
η(y) dy = −1{hn <0}
Z
M
n
− hnn
η(y) dy ≤ 0,
(where it is noted that the hn dependent integral in the above display is well defined for
any hn ). Finally, the fifth and sixth term of (24) are bounded by the Fisher information for
location associated with η:
M
n
∞Z
Z
0
0
(η 0 )2
(z + x) dz dx =
η
M
n
Z
∞
Z
0
z
(η 0 )2
M
(x) dx dz ≤
η
n
Z
0
∞
(η 0 )2
(x) dx.
η
Combining, we obtain the following upper bound for the relevant Hellinger distance,
Z ∞ 0
Z ∞ 0
M2
η (x) 2
M
η (x) 2
+2 2
η(x) dx ,
d (Pθn (hn ),η , Pθ0 ,η ) ≤ 2γθ0 ,η
η(x) dx +
n
n
η(x)
η(x)
0
0
˙
which proves the lemma upon noting that |η 0 (x)| = η(x)|`(x)
− α| ≤ η(x)(α + S).
To verify condition (v) of Corollary 3.1 we now check condition (19) of Lemma 4.2.
Lemma 5.7. Let (Mn ), Mn → ∞, Mn ≤ n for n ≥ 1, Mn = o(n) be given. Then there exists
a constant C > 0 such that the condition of Lemma 4.2 is satisfied.
Proof
Note first that for fixed x and η, the map θ 7→ pθ,η (x) is monotone increasing.
Therefore,
sup
θ∈Θcn
where
θ∗
n
n
i=1
i=1
Y pθ,η
Y η(Xi − θ∗ )
1
1
∗ (X ),
log
(Xi ) ≤ log
1
n
pθ0 ,η
n
η(Xi − θ0 ) {X(1) ≥θ } n
= X(1) if X(1) ≥ θ0 + Mn /n, or θ0 − Mn /n otherwise. We first note that X(1) <
θ0 + Mn /n with probability tending to one. Indeed, shifting the distribution to θ = 0, we
calculate,
Z Mn
Z Mn
n
n
n
Mn n
η
(x)
dx
η
(x)
dx
.
=
1
−
≤
exp
−n
P0,η
X
≥
0
0
(1)
0
n
0
0
By Lemma 5.16, the right-hand side of the above display is bounded further as follows,
Z
exp −γθ0 ,η0 Mn + Mn
0
Mn
n
γ
θ ,η
|η00 (x)| dx ≤ exp − 0 0 Mn ,
2
for large enough n. We continue with θ∗ = θ0 − Mn /n. By absolute continuity of η we have,
Z Xi −θ∗
∗
η(Xi − θ ) = η(Xi − θ0 ) +
η 0 (y) dy,
Xi −θ0
and the conditions on the nuisance η yield the following bound,
Z Xi −θ∗
η 0 (y) dy ≤ (θ0 − θ∗ )(S − α)η(Xi − θ0 ).
Xi −θ0
Therefore,
n
(α−S)M n
Y η(Xi −θ∗ )
1
1
(α−S)Mn
n
log
1{X(1) ≥θ∗ } (X n ) ≤ log 1−
≤−
.
n
η(Xi −θ0 )
n
n
n
i=1
If C < α − S, the condition of Lemma 4.2 is clearly satisfied.
To demonstrate that priors exist such that L ⊂ supp(ΠL ), an explicit construction based
on the distribution of Brownian sample paths is provided in the following lemma.
25
Lemma 5.8. Let S > 0 be given. Let {Wt : t ∈ [0, 1]} be Brownian motion on [0, 1] and let
Z be independent and distributed N (0, 1). We define the prior ΠL on L as the distribution
of the process,
˙ = SΨ(Z + WΨ(t) ),
`(t)
where Ψ : [−∞, ∞] → [−1, 1] : x 7→ 2 arctan(x)/π. Then L ⊂ supp ΠL .
Proof Consider C[0, 1] with the uniform norm and its Borel σ-algebra, equipped with the
law Π of t 7→ Z + Wt , as a probability space. Since Ψ is Lipschitz, the map f that takes
C[0, 1] into C[0, ∞], Z + W· 7→ Z + WΨ(·) is continuous, norm-preserving, and Borel-to-Borel
measurable. This enables the view of C[0, ∞] with its Borel σ-algebra as a probability space,
with probability measure Π0 (B) = Π f −1 (B) . Similarly, the map g that takes C[0, ∞] into
L , Z +WΨ(·) 7→ SΨ Z +WΨ(·) is continuous and Borel-to-Borel measurable. We view L with
its Borel σ-algebra as a probability space, with probability measure ΠL (C) = Π0 g −1 (C) .
Let T denote a closed set in L such that ΠL (T ) = 1. Note that f −1 (g −1 (T )) is closed
and Π f −1 (g −1 (T )) = 1, so that supp(Π) ⊂ f −1 (g −1 (T )). Since the support of ΠL equals
T
the intersection of all such T , supp(Π) ⊂ T f −1 (g −1 (T )) = f −1 g −1 supp ΠL
. Since
supp(Π) = C[0, 1], for every y ∈ C[0, 1], f (g(y)) ∈ supp ΠL . The continuity does not
change under g ◦ f , so supp ΠL includes L .
It is noted that many alternative constructions/priors would achieve similar results, for
instance SΨ(Z + e−t W̃t ), with {W̃t : t ∈ [0, ∞)} representing Brownian motion on [0, ∞).
5.4
Proofs of Subsection 2.2
Lemma 5.9. Define H as the image of L under the map that takes `˙ ∈ L into densities
η`˙ defined by (8) for x ∈ [0, 1]. This map is uniform-to-Hellinger continuous and the space
H is a collection of probability densities that are (i) monotone increasing, (ii) continuously
differentiable on [0, 1], (iii) satisfy 2S/(e2S − 1) ≤ η`˙(x) ≤ e2S for x ∈ [0, 1] and (iv) are Lipschitz with constant 2Se2S . Moreover, (v) the resulting densities pθ,η satisfy the log-Lipschitz
condition (in θ) in an -neighbourhood ( < θ0 /2) with constant (2 + 8S)/θ0 .
Proof The proof is similar to the proof of Lemma 5.2 and is therefore omitted.
The proof of Theorem 2.3 consists of a verification of the conditions of Corollary 3.1 (after
the aforementioned modification to comply with the positive version of the LAE expansion).
The following lemmas make the most elaborate steps explicit, as in the proof of Theorem 2.2.
Lemma 5.10. Hellinger covering numbers for H are finite, i.e. for all ρ > 0, N (ρ, H, dH ) <
∞.
Proof Denote by Q the distribution with density η0 = η`˙0 . Then the family F = {x 7→
p
1 ([0, 1]), where M is fixed and depends
η`˙/η0 : `˙ ∈ L } forms a subset of the collection CM
on S. Referring to Corollary 2.7.2 in [40], we conclude that the L2 (Q)-bracketing entropy
N[ ] (, F , L2 (Q)) of F is finite for all > 0. Similarly as in the proof of Lemma 5.10, it
follows that N (ρ, H, dH ) = N (ρ, F , L2 (Q)) ≤ N[ ] (2ρ, F , L2 (Q)) < ∞.
26
The previous lemma together with the following lemma verify conditions (i) and (ii) of
Corollary 3.1.
Lemma 5.11. For every M > 0 there exist constants L1 , L2 > 0 such that for small enough
ρ > 0, {η ˙ ∈ H : k`˙ − `˙0 k∞ ≤ ρ2 } ⊂ K(L1 ρ) ⊂ Kn (L2 ρ, M ).
`
Proof Let ρ > 0 and `˙ ∈ L such that k`˙ − `˙0 k∞ ≤ ρ2 be given. Then,
Z x/θ0
pθ ,η
(`˙ − `˙0 )(t) dt ≤ ρ2 P0 (X/θ0 ) + O(ρ4 ),
log 0 (x) −
pθ0 ,η0
0
(25)
for all x ∈ [0, θ0 ]. Define, for all α ∈ R and `˙ ∈ L ,
˙ = log
z(α, `)
1
Z
Ry
eαy+
0
˙ dt
`(t)
dy.
0
Then the relevant log-density-ratio can be written as,
pθ ,η
log 0 (x) =
pθ0 ,η0
Z
x/θ0
˙ + z(S, `˙0 ),
(`˙ − `˙0 )(t) dt − z(S, `)
0
where only the first term is x-dependent. Assume that `˙ ∈ L is such that k`˙ − `˙0 k∞ < ρ2 .
Ry
˙ ≤ z(S + ρ2 , `˙0 ). Noting that
Then, | 0 (`˙ − `˙0 )(t) dt| ≤ ρ2 y, so that z(S − ρ2 , `˙0 ) ≤ z(S, `)
dk z/dαk (S, `˙0 ) = P0 (X/θ0 )k < ∞ and using the first-order Taylor expansion of z in α, we
find, z(S ± ρ2 , `˙0 ) = z(S, `˙0 ) ± ρ2 P0 (X/θ0 ) + O(ρ4 ), and (25) follows. Next note that, for
every k ≥ 1,
Z X/θ0
k
(`˙ − `˙0 )(t) dt
P0
0
Z θ0 Z x/θ0 k
2k
dy dP0 = ρ2k P0 (X/θ0 )k .
≤ρ
0
0
(26)
Using (25) we bound the differences between KL divergences and integrals of scores and,
combining with the bounds (26), we see that,
pθ0 ,η
≤ 2ρ2 P0 (X/θ0 ) + O(ρ2 ) ,
pθ0 ,η0
pθ ,η 2
P0 log 0
≤ ρ4 P0 (X/θ0 )2 + 3P0 (X/θ0 ) + O(ρ2 ) ,
pθ0 ,η0
−P0 log
which proves the first inclusion. Let M > 0. Similarly as in the proof of Lemma 5.4 we can
show that,
pθn (h),η pθ ,η
2 + 8S M
≤ −P0 log 0 +
,
pθ0 ,η0
pθ0 ,η0
θ0 n
|h|≤M
pθ (h),η 2
P0 sup − 1Aθn (h),η log n
pθ0 ,η0
|h|≤M
pθ ,η 2 4 + 16S M h pθ ,η 2 i1/2 (2 + 8S)2 M 2
≤ P0 log 0
P0 log 0
,
+
+
pθ0 ,η0
θ0
n
pθ0 ,η0
n2
θ02
P0
sup − 1Aθn (h),η log
implying the existence of a constant L2 .
We verify condition (iii) of Corollary 3.1 with the help of the following lemma.
27
Lemma 5.12. For a stochastic (hn ), |hn | ≤ M , and r > 0,
√
3M eS Sr
sup hn γθ0 ,η − hn γθ0 ,η0 ≤
.
θ0
η∈D(r)
Proof Recall γθ0 ,η = η(1)/θ0 . Suppose |η(1) − η0 (1)| > ρ for some ρ > 0. By Lemma 2.15
in [38], with C > 0,
Z
dH (η, η0 ) ≥
0
θ0
1 η(x/θ0 ) − η0 (x/θ0 ) dx =
θ0
Z
1
η(x) − η0 (x) dx ≥
Z
1
η(x) − η0 (x) dx.
1−Cρ
0
By assertion (iv) of Lemma 5.9,
Z
1
η(x) − η0 (x) dx ≥
Z
1−Cρ
1
η(1) − η0 (1) − η(x) − η(1) − η0 (x) − η0 (1) dx
1−Cρ
1
Z
≥
η(1) − η0 (1) − 4Se2S (1 − x) dx ≥ Cρ2 − 2Se2S C 2 ρ2 .
1−Cρ
√
Choosing C = 1/3Se2S and ρ = 3eS Sr proves the assertion of the lemma.
The following lemma shows that condition (iv) of Corollary 3.1 is also satisfied.
Lemma 5.13. For all bounded, stochastic (hn ), Hellinger distances between Pθn (hn ),η and
Pθ0 ,η are of order n1/2 uniformly in η, i.e.,
sup n1/2 d(Pθn (hn ),η , Pθ0 ,η ) = O(1).
η∈H
Proof Note that the elements of the nuisance space H are uniformly bounded by e2S . Fix
n and ω; write hn for hn (ω). First we consider the case that hn ≥ 0,
η 1/2 (x/θ (h ))
n n
1/2
−
θn (hn )
η 1/2 (x/θ0 ) 2
1/2
θ0
=
η 1/2 (x/θ (h )) η 1/2 (x/θ ) 2
η(x/θn (hn ))
n n
0
1[θ0 ,θn (hn )] (x) +
−
1[0,θ0 ] (x).
1/2
1/2
θn (hn )
θn (hn )
θ0
Note that the first term is bounded from above by (e2S /θ0 )1[θ0 ,θn (M )] (x). To upper bound the
second term, we use the absolute continuity of η 1/2 . Let g(y) = (η(x/y)/y)1/2 ,
η 1/2 (x/θ (h )) η 1/2 (x/θ ) Z θn (hn )
Z θn (M ) 0 n n
0 0
−
=
g
(y)
dy
≤
g
(y)
dy.
1/2
1/2
θ0
θ0
θn (hn )
θ0
By the definition of the nuisance space, for y ∈ [θ0 , θn (M )], and x ≤ θ0 ,
eS
|g 0 (y)| ≤
3/2
(S + 1),
θ0
and then,
η 1/2 (x/θ (h ))
n n
1/2
θn (hn )
−
η 1/2 (x/θ0 ) 2
1/2
θ0
28
≤
M 2 e2S
(S + 1)2 .
n2 θ03
Similarly for hn < 0,
η 1/2 (x/θ (h ))
n n
1/2
−
η 1/2 (x/θ0 ) 2
1/2
θn (hn )
θ0
≤
Sθ
2
e2S
M2
e2S
0
1[0,θ0 ] (x).
1[θn (−M ),θ0 ] (x) + 2
+
1
θ0 − M/n
n (θ0 − M/n)3 θ0 − M/n
Combining these results, we obtain a bound for the squared Hellinger distance:
d2 (Pθn (hn ),η , Pθ0 ,η ) ≤
Sθ
2
M e2S
M2
M e2s
M 2 e2S
e2S θ0
0
+
+ 2 2 (S +1)2 + 2
+1
.
nθ0
nθ0 − M n θ0
n (θ0 − M/n)3 θ0 − M/n
To verify condition (v) of Corollary 3.1 we now check condition (19) of Lemma 4.2.
Lemma 5.14. Let (Mn ), Mn → ∞, Mn ≤ n for n ≥ 1, Mn = o(n) be given. Then there
exists a constant C > 0 such that the condition of Lemma 4.2 is satisfied.
Proof The proof of this lemma is similar to the proof of Lemma 5.7. We only note that by
absolute continuity of η we have,
η(Xi /θ0 )
η(Xi /θ∗ )
=
+
∗
θ
θ0
Z
θ∗
g 0 (y) dy,
θ0
where g(y) = η(Xi /y)/y, and,
g 0 (y) =
X 1
1
1 0
i
η (Xi /y) − 2 + η(Xi /y) − 2 ≤ η(Xi /y) − 2 .
y
y
y
y
To demonstrate that priors exist such that L ⊂ supp(ΠL ), an explicit construction based
on the distribution of Brownian sample paths is provided in the following simplified version
of Lemma 5.8.
Lemma 5.15. Let S > 0 be given. Let {Wt : t ∈ [0, 1]} be Brownian motion on [0, 1] and let
Z be independent and distributed N (0, 1). We define the prior ΠL on L as the distribution
of the process,
˙ = SΨ(Z + Wt ),
`(t)
where Ψ : [−∞, ∞] → [−1, 1] : x 7→ 2 arctan(x)/π. Then L ⊂ supp ΠL .
Proof The proof is similar to the proof of Lemma 5.8 and is omitted.
Lemma 5.16. For every differentiable η and > 0 the following inequalities hold:
Z Z Z η(0) − |η 0 (x)| dx ≤
η(x) dx ≤ η(0) + |η 0 (x)| dx.
0
0
0
Proof Integration by parts yields,
Z Z η(x) dx = η(0) +
( − x)η 0 (x) dx.
0
0
Since −|η 0 (x)| ≤ ( − x)η 0 (x) ≤ |η 0 (x)| for x ∈ [0, ], the assertion holds.
29
Acknowledgments
B. J. K. Kleijn would like to thank the Statistics Department of Seoul National University,
South Korea and the Dipartimento di Scienze delle Decisioni, Universita Bocconi, Milano,
Italy for their kind hospitality. B. T. Knapik was supported by the Netherlands Organization
for Scientific Research NWO and ANR Grant ‘Banhdits’ ANR–2010–BLAN–0113–03. He
would also like to acknowledge the support of CEREMADE, Université Paris-Dauphine and
ENSAE–CREST, and thank I. Castillo, J. Rousseau and A. Tsybakov for insightful comments.
References
[1] P. Bickel and B. Kleijn, The semiparametric Bernstein–von Mises theorem, Ann. Statist. 40 (2012),
206–237.
[2] D. Bontemps, Bernstein-von Mises theorems for Gaussian regression with increasing number of regressors, Ann. Statist. 39 (2011), 2557–2584.
[3] P. De Blasi and N. Hjort, The Bernstein-von Mises theorem in semiparametric competing risk models,
J. Statist. Plann. Inference 139 (2009), 2316–2328.
[4] S. Boucheron and E. Gassiat, A Bernstein–von Mises theorem for discrete probability distributions,
Electron. J. Statist. 3 (2009), 114–148.
[5] I. Castillo, A semiparametric Bernstein–von Mises theorem for Gaussian process priors, Prob. Theory
Rel. Fields 152 (2012), 53-99.
[6] I. Castillo, Semiparametric Bernstein–von Mises theorem and bias, illustrated with Gaussian process
priors, Sankhya A 74 (2012), 194–221.
[7] G. Cheng and M. Kosorok, General frequentist properties of the posterior profile distribution, Ann.
Statist. 36 (2008), 1819–1853.
[8] V. Chernozhukov and H. Hong, Likelihood estimation and inference in a class of nonregular econometric models, Econometrica 72 (2004), 1445–1480.
[9] S. Ghosal, Asymptotic normality of posterior distributions in high-dimensional linear models, Bernoulli
5 (1999), 315–331.
[10] S. Ghosal, Asymptotic normality of posterior distributions for exponential families when the number of
parameters tends to innity, J. Multivariate Anal. 74 (2000), 49–68.
[11] S. Ghosal, J. Ghosh and A. van der Vaart, Convergence rates of posterior distributions, Ann. Statist.
28 (2000), 500–531.
[12] J. Hájek, A characterization of limiting distributions of regular estimates, Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 14 (1970), 323–330.
[13] J. Hájek, Local asymptotic minimax and admissibility in estimation, Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability 1, 175–194. University of California Press,
Berkeley (1972).
[14] P. Hall and I. van Keilegom, Nonparametric “regression” when errors are positioned at end-points,
Bernoulli 15 (2009), 614–633.
[15] K. Hirano and J. Porter, Asymptotic efficiency in parametric structural models with parameterdependent support, Econometrica 71 (2003), 1307–1338.
[16] I. Ibragimov and R. Has’minskii, Statistical estimation: asymptotic theory, Springer, New York (1981).
[17] I. Johnstone, High dimensional Bernstein–von Mises: simple examples, in Borrowing Strength: Theory
Powering Applications – A Festschrift for Lawrence D. Brown, (eds. J. Berger, T. Cai and I. Johnstone)
IMS, Beachwood (2010), 87–98.
[18] R. de Jonge and H. van Zanten, Semiparametric Bernstein-von Mises for the error standard deviation,
Electron. J. Stat. 7 (2013), 217-243.
30
[19] Y. Kim and J. Lee, A Bernstein Von Mises theorem in the nonparametric right-censoring model, Ann.
Statist. 4 (2004), 1492–1512.
[20] Y. Kim, The Bernstein Von Mises theorem for the proportional hazard model, Ann. Statist. 4 (2006),
1678–1700.
[21] Y. Kim, A Bernstein-von Mises theorem for doubly censored data, Statistica Sinica 19 (2009), 581–595.
[22] B. Kleijn, Bayesian asymptotics under misspecification. PhD. Thesis, Free University Amsterdam
(2003).
[23] B. Kleijn and A. van der Vaart, The Bernstein-Von-Mises theorem under misspecification. Electron.
J. Statist. 6 (2012), 354–381.
[24] B.T. Knapik, A.W. van der Vaart and J.H. van Zanten, Bayesian inverse problems with Gaussian
priors, Ann. Statist. 39 (2011), 2626–2657.
[25] W. Kruijer and J. Rousseau, Bayesian semi-parametric estimation of the long-memory parameter
under FEXP-priors, Electron. J. Stat. 7 (2013), 2947–2969.
[26] H. Leahu, On the Bernstein-von Mises phenomenon in the Gaussian white noise model, Electron. J.
Statist. 5 (2011), 373-404.
[27] L. Le Cam, On some asymptotic properties of maximum-likelihood estimates and related Bayes estimates,
University of California Publications in Statistics, 1 (1953), 277–330.
[28] L. Le Cam, Limits of Experiments, Proceedings of the Sixth Berkeley Symposium on Mathematical
Statistics and Probability 1, 245–261. University of California Press, Berkeley (1972).
[29] L. Le Cam, Convergence of Estimates Under Dimensionality Restrictions, Ann. Statist. 1 (1973), 38–53.
[30] L. Le Cam, Asymptotic methods in statistical decision theory, Springer, New York (1986).
[31] L. Le Cam and G. Yang, Asymptotics in Statistics: some basic concepts, Springer, New York (1990).
[32] L. Le Cam, Maximum Likelihood: An Introduction, preprint, U.C. Berkeley, Dept. of Statistics (1990).
[33] B. U. Park, Efficient estimation in the two-sample semiparametric location-scale models, Probab. Th.
Rel. Fields 86 (1990), 21–39.
[34] V. Rivoirard and J. Rousseau, Bernstein–von Mises theorem for linear functionals of the density, Ann.
Statist. 40 (2012), 1489–1523.
[35] L. Schwartz, On Bayes procedures, Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 4
(1965), 10–26.
[36] X. Shen, Asymptotic normality of semiparametric and nonparametric posterior distributions, Journal of
the American Statistical Association 97 (2002), 222–235.
[37] C. Stein, Efficient non-parametric testing and estimation, Proceedings of the Third Berkeley Symposium
on Mathematical Statistics and Probability 1, 187–196. University of California Press, Berkeley (1956).
[38] H. Strasser, Mathematical Theory of Statistics: Statistical Experiments and Asymptotic Decision Theory, Walter de Gruyter, Berlin (1985).
[39] A. van der Vaart, Asymptotic Statistics, Cambridge University Press, Cambridge (1998).
[40] A. van der Vaart and J. Wellner, Weak convergence and empirical processes, Springer Series in
Statistics, Springer-Verlag, New York (1996).
[41] J. Wolfowitz, Asymptotically Efficient Non-parametric Estimators of Location and Scale Parameters.
II, Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 30 (1974), 117–128.
31
© Copyright 2026 Paperzz