A novel class of priors for edge-preserving methods in Bayesian

Università degli Studi di Milano
Facoltà di Scienze Matematiche, Fisiche e Naturali
Corso di Laurea Magistrale in Matematica
A novel class of priors for
edge-preserving methods in
Bayesian Inversion
Relatore:
Prof. Vincenzo CAPASSO
Correlatore: Prof. Martin BURGER
Working Group Imaging
Institute for Computational and Applied Mathematics
University of Münster (D)
Tesi di Laurea di
Silvia COMELLI
Matr. 771763
ANNO ACCADEMICO 2010 - 2011
Acknowledgements
I would like to thank all the people who made this thesis project possible,
especially:
Prof. Vincenzo Capasso, for helping me to find my way in the math research world. His generous and fascinating way of advising students
and helping them in finding their own way is something unique and
precious;
Prof. Martin Burger, for giving me the opportunity of working with him,
as learning from a great researcher is the best way of approaching and
solving a math problem. Also, for his more than friendly welcome in
his working group;
Felix Lucka, for his help and assistance on the computational side;
Eric Edwards, for having proofread the thesis, checking in particular my
English;
all the people in the Imaging Workgroup of the University of Münster for
being like a family for me during my stay at their university.
Contents
Introduction
I
1 Inverse Problems and Bayesian Inversion
1.1 Basics on Bayesian Inversion . . . . . . . . . . .
1.1.1 The prior . . . . . . . . . . . . . . . . .
1.1.2 The joint probability density . . . . . . .
1.1.3 The likelihood function . . . . . . . . . .
1.1.4 The posterior distribution . . . . . . . .
1.1.5 Bayesian formulation of inverse problems
1.1.6 The MAP and CM estimates . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 Discretization Invariance in Bayesian inversion
2.1 The problem of discretization . . . . . . . . . . . . . . . . .
2.2 Discretization invariance . . . . . . . . . . . . . . . . . . . .
2.3 Can one use total variation prior for edge-preserving Bayesian
inversion? . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 Convergence of MAP estimates . . . . . . . . . . . .
2.3.2 Convergence of CM estimate . . . . . . . . . . . . . .
3 New proposal for the priors
3.1 A new setting for our problem
3.2 Behavior of MAP estimates .
3.3 Behavior of CM estimates . .
3.3.1 Case q = 2 . . . . . . .
3.3.2 Case general q . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
4
7
7
8
9
10
11
. 11
. 15
. 17
. 20
. 31
.
.
.
.
.
55
55
56
58
62
72
4 Computational results
4.1 The model problem . . . . . . . . . . .
4.2 Computation of the MAP estimates . .
4.2.1 The TV MAP estimates . . . .
4.2.2 The q-TV MAP estimates . . .
4.3 Computation of the CM estimates . . .
4.4 Results . . . . . . . . . . . . . . . . . .
4.4.1 Results for the MAP estimates
4.4.2 Results for the CM estimates .
4.4.3 Summary . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
74
74
77
77
77
78
81
81
84
88
Conclusions
90
Bibliography
91
Introduction
This thesis addresses Bayesian modeling and discretization in inverse problems related to imaging.
More in detail, Chapter 1 contains some basics on Bayesian Inversion. The
main idea behind this particular approach is that the uncertainty about the
unknown of an inverse problem introduces a certain component of randomness. This randomness can be expressed by modeling all the variables as
random variables and it is managed in the following way: all the qualitative
features of the unknown that are considered indefeasible are encoded in the
quantitative language of the so called prior distribution. In other words,
this distribution encodes all the important information we have about the
unknown before performing the measurement. On the other hand, the so
called posterior distribution expresses the knowledge we have about the unknown after the measurement data have been collected. Bayes’ formula is
the link between the prior and the posterior distribution. By applying this
formula, we are able, starting from the posterior distribution, to produce
some estimates of the unknown that represent the most probable values of
the unknown given the data and the prior information. In particular, in this
work we use the maximum a posteriori (MAP) and conditional mean (CM)
estimates in order to reconstruct our images, exploiting image restoration
methods that are focused on good reconstruction of the edges, i.e. edgepreserving methods. This property, modeled in the prior distribution, has
been obtained through the use of total variation in the priors.
I
In Chapter 2 we present the results described in [Lassas and Siltanen, 2004]1 .
In this paper, the authors contributed to ongoing the discussion about discretization invariance in the Bayesian approach to inverse problems. In fact,
as already mentioned, the Bayesian solution of an inverse problem involves
discretizing the problem and expressing a priori information in the form of a
prior distribution in a finite dimensional space. In [Lassas and Siltanen, 2004]
the authors study the convergence properties for the MAP and CM estimates
of the following class of edge-preserving priors: given a random variable U
from a probability space (Ω, F, P ) to the space of continuous functions on
the unit interval
Y = C0 ([0, 1]) = {u : [0, 1] → R : u ∈ C([0, 1]); u(0) = u(1) = 0},
they consider its discretizations Un as the Yn -valued random variables
Un (ω, t) =
n
X
Ujn (ω)ψjn (t)
j=1
where Yn is the n-dimensional space of piecewise linear functions and {ψjn }nj=1
k
) = δjk for j = 1, ..., n
is the roof-top basis for Yn , i.e. ψjn ∈ Yn satisfy ψjn ( n+1
and k = 0, ..., n + 1. They claim Un to have total variation prior distributions
if the random vector [U1n , ..., Unn ]T is distributed as
!!
n+1
X
(n) n
πpr
(u1 , ..., unn ) = Cn exp −αn
|unj − unj−1 |
,
j=1
where un0 = unn+1 = 0, Cn is a normalization constant and αn is a parameter. The main results obtained in [Lassas and Siltanen, 2004] are: if αn is
constant, then the MAP estimates converge to some function ũ of bounded
variation on the unit interval (i.e. in BV[0,1]) such that ũ(0) = ũ(1) = 0,
in the weak*-topology of BV [0, 1]. On the other hand, if αn is proportional
√
to n + 1, then the CM estimates converge to a Gaussian random variable,
thus loosing their edge-preserving property with very fine discretizations of
the model space; in other words they are not discretization invariant. In all
1
M.Lassas and S.Siltanen, Can One Use Total Variation Prior for Edge-Preserving
Bayesian Inversion?, Inverse Problems, 20, 2004.
II
other cases we have either convergence to a trivial limit or even no convergence.
With the goal of addressing the open questions formulated in [Lassas and
Siltanen, 2004], in Chapter 3 we look for a novel class of edge-preserving
priors that are also discretization invariant. We began by introducing a class
of distributions with smaller tails, expected to be discretization invariant.
We thus introduced the following novel class of priors: given the Y -valued
random variable U and its discretizations Un that are Yn -valued random
variables, we claim Un to have q-total variation prior distributions, q ∈ N, if
the random vector [U1n , ..., Unn ]T is distributed as
(n,q) n
πpr
(u1 , ..., unn ) = Cq,n exp −αn
n+1
X
!q !
|unj − unj−1 |
,
j=1
where un0 = unn+1 = 0 and Cq,n is a normalization constant.
Regarding the MAP estimates, we obtained the same results derived in [Lassas and Siltanen, 2004]. In fact, for the MAP estimates we are always able,
without loss of generality, to reduce the case of q-total variation priors to the
case of total variation priors, with some restrictions on the parameters used.
Hence, all the results valid in the case considered in [Lassas and Siltanen,
2004] are confirmed in the novel case. On the other hand, we studied the
problem of convergence of the CM estimates. In the case q = 2, αn constant,
we proved that the CM estimates converge in the weak-topology of Lp ([0, 1])
for 1 < p < ∞ to some uncharacterized limit function. This is a reasonable
result for a BV limit, since it should be recalled that Yn ⊂ Lp ([0, 1]) and that
BV ([0, 1]) ⊂ Lp ([0, 1]) for any 1 ≤ p ≤ ∞. In further detail, we proved that
for any 1 ≤ p ≤ ∞,
E(k Un (ω, ·) kLp ([0,1]) ) ≤ E(k Un (ω, ·) kL∞ ([0,1]) ) < K,
where K is a constant independent of n. Thus, using Markov’s inequality
and Banach-Alaoglu theorem, we can prove that the family of random variables {Un }n∈N is tight in the weak-topology of Lp ([0, 1]) for any 1 < p < ∞.
Thus we obtain weak Lp ([0, 1])-convergence of a subsequence of {Un }n∈N for
any 1 < p < ∞ and therefore of a subsequence of the CM estimates. The
III
interesting result is that there is no reason to expect the limit function to
be Gaussian and thus the 2-total variation priors with αn constant might be
effectively discretization invariant. For all the other choices of αn , in the case
q = 2, we obtain either convergence to a trivial limit or no convergence.
In the case q 6= 2, the right scaling for αn was not thoroughly characterized,
though we were able to produce some estimates limiting the range for the
critical value of αn .
Finally, in Chapter 4, we illustrate our theoretical findings by numerical examples with computer simulated data. In particular, from the numerical
analysis we found that in the case q = 10, the CM estimates converge to
some limit function if we choose αn constant. This might suggest that the
right scaling is αn constant even in the case q 6= 2.
IV
Chapter 1
Inverse Problems and Bayesian
Inversion
This work focuses on the discretization problems that occur in Bayesian inversion. To achieve a thorough understanding of what will follow, it is therefore
important to have a background knowledge of Bayesian techniques in solving
inverse problems. This chapter will thus recall the main concepts of inverse
problems and their Bayesian formulation.
Inverse problems are typically encountered in situations where one has to
deal with directly observable quantities alongside other quantities that cannot be observed. In inverse problems, some of the unobservable quantities are
of primary interest. These quantities depend on each other through models.
Trying to invert the model while assuring existence, uniqueness and wellposedness of the solution is the main goal of the study of inverse problems.
Historically, there are two different frameworks in which inverse problems are
studied.
The classical framework deals with the solution of an inverse problem primarily concentrating on commonplace analytic techniques. In particular,
since many inverse problems are characterized by ill-posedness in the sense
of Hadamard (i.e. they might not have a solution in the strict sense, solutions might not be unique and/or might not depend continuously on the
1
data), this approach focuses on regularization techniques. These techniques
aim at producing a reasonable estimate of the solution. They try to obtain
this result by introducing a slight modification of the original problem to try
to make it stable. This approach introduces some error to the solution and
therefore the key point of these methods is to try to keep the modification as
small as possible. Also, a main trait of these methods is that the magnitude
of the noise, i.e. the noise level, can always be accurately estimated, and in
this sense we can say that standard regularization methods are always known
to converge with respect to the noise level.
A completely different approach to inverse problems is the Bayesian framework. This work will exclusively deal with inverse problems by using this
particular approach. Namely, statistical inversion theory, through Bayesian
statistics, reformulates inverse problems as prior knowledge’s problems. The
new idea of this approach is that it does not, as usually happens with regularization methods, just produce a solution or a single point estimate for
the problem by an ad hoc removal of the ill-posedness of the problem, but
it solves the inverse problem systematically in such a way that all the information available is properly incorporated to the model. In this sense the
Bayesian solution to an inverse problem is a probability distribution. This
distribution is then used to asses the reliability of any value of the solution
and this is in fact the main difference between the Bayesian and other approaches.
1.1
Basics on Bayesian Inversion
We would like to introduce now some basics concepts used in Bayesian Inversion and we will follow presentation contained in Chapter 3 of [18]. As
we have already said, the Bayesian inversion setup reformulates inverse problems as prior knowledge’s problems. Thus, to summarize, the main features
of this new formulation are:
- All the variables involved in the inverse problem are modelled as
2
random variables defined on a common complete probability space
(Ω, F, P );
- By describing them as random variables, we assign to the variables
of our model a certain amount of randomness, which reflects our lack
of information concerning their realizations;
- We encode all the information we have about these random variables
in the probability distributions;
- The solution of an inverse problem is a posterior probability distribution.
These are obviously only the main guidelines of Bayesian formulation of inverse problems.
Coming to details, let us introduce an inverse problem and reformulate it
in Bayesian terms. Assume that we are measuring a quantity y ∈ RM but
that we are actually interested in another quantity x ∈ RN . Since perfect
measurements do not exist, there always exists a certain component of noise.
It is also possible that some parameters involved in the problem are not
completely known. All the quantities just mentioned are involved in the
inverse problem and tied together by a specific model.
Let us suppose that their relation can be expressed as follows:
y = f (x, e)
where f : RN × RK → RM represents the mutual dependence existing between the quantities and e ∈ RK is a vector containing all the poorly known
parameters as well as the measurement noise. To reformulate this inverse
problem as a statistical problem, let us replace all the involved quantities by
the appropriate random variables, which we will indicate by capital letters.
Thus we obtain the following stochastic equation:
Y = f (X, E).
(1.1)
This relation among the random variables X, Y and E enforces a relation
among the corresponding distributions.
3
Notation. As already anticipated, we henceforth distinguish random variables from their realizations by denoting the former with capital letters and
the latter with small letters.
Let us now introduce some new terms that will help us in characterizing our
model.
Definition 1.1.1 (Measurements, data, unknown, parameters, noise).
We call the directly observable random variable Y the measurement, its realization y in the actual measurement process the data. The nonobservable
random variable X that is of primary interest is called unknown. Those random variables that are neither observable nor of primary interest are called
parameter or noise, depending on the setting.
Since we decided to represent all the quantities as random variables, it now
seems obvious to assign to each of them a probability distribution. Since the
knowledge we have about X is strictly related to our knowledge about Y ,
we can define different probability densities which may take into account the
mutual relation among the random variables.
1.1.1
The prior
In most cases, before performing the measurement, we already have some
knowledge about the random variable X.
Definition 1.1.2 (Prior density).
The information we have about the variable X before performing the measurement can be encoded in a probability density x → πpr (x) called the prior
density.
Remark. In the Bayesian theory of inverse problems, choosing the prior is
one of the most crucial steps and also the most challenging part of the solution. This is due to the fact that often we only have a qualitative notion
of the unknown’s principal features. This qualitative notion can be gathered
in many different ways. Any relevant visual characteristic of the image we
are taking into consideration is suitable for prior including but not limited
4
to texture, light intensity, and boundary structure. For example, in medical
imaging a researcher could be interested in looking for a cancer tumor, where
the position and main characteristics of the tumor may be known. The hardest task of the researcher is translating this qualitative information into the
quantitative language of the distributions. The general guideline in designing
priors is that the prior itself should be concentrated on those values of x that
we expect, assigning a higher probability to them than to those that we do
not expect to see.
In this work we will exclusively be concerned with methods that are focused
on good reconstruction of the edges, i.e. edge-preserving methods. This
property has to be modeled, of course, in the prior distribution. As we said
before, the prior should be constructed in such a way that the solutions
which have high probability with respect to the posterior distribution are
piecewise smooth and have rapidly changing values only in a set of small
measure. In finite dimensional Bayesian inversion theory a large number of
edge-preserving methods have been introduced, but we will mainly deal with
the ones using total variation priors. In fact, in [18], it has been proven that
in a fixed discrete model space, the use of a total variation prior in restoring images not only preserves the edges but is also particulary suitable for
representing "blocky" images, that is images consisting of piecewise smooth
simple shapes, both features extremely important in medical imaging.
For a better understanding of how this choice for the prior results in an
edge-preserving and blocky restoration of images, let us analyze it in further
detail. First, let us define the concept of total variation of a function.
Definition 1.1.3 (Total variation of a function).
Let f : D → R be a function in L1 (D), the space of integrable functions on
D ⊂ Rd , where d ∈ N. We define the total variation of f as
Z
T V (f ) =
|Df |
Ω
Z
d
1
= sup
f (∇g)dx | g = (g1 , ..., gn ) ∈ C0 (D, R ), k g(x) kL∞(D) ≤ 1 ,
D
where the test function space C01 (D, Rd ) consists of continuously differentiable
vector-valued functions on D that vanish at the boundary and k · kL∞ (D) is
5
the essential supremum norm.
A function is said to have bounded variation if T V (f ) < ∞.
At this point we would like to define the discrete analogue. Let us thus
consider the two dimensional problem. Assume that D ⊂ R2 is bounded and
divided in N pixels.
Definition 1.1.4 (The discrete total variation density ).
The total variation of the discrete image x = [x1 , x2 , ..., xN ]T is defined as
TV(x) =
N
X
Vj (x),
Vj (x) =
j=1
1X
`ij |xi − xj |,
2 i∈N
j
where `ij is the lenght of the edge between the neighboring pixels and the Nj
is the index set of the neighbors of pixel xj (where two pixels are defined to
be neighbors if they share a common edge).
The discrete total variation density is then defined as
π(x) ∝ exp(−αTV(x)).
Example 1.1.1
For instance, if we consider the three images contained in Figure 1.1, we can
easily calculate their total variation. Suppose that the pixels in these 7 × 7
images take values 0 for black, 1 for grey and 2 for white. Then the blocky
image on the left has the smallest total variation, the highest value of the
total variation prior and hence the highest probability to occur. The opposite
is true for the second and the third image.
Figure 1.1: Three 7 × 7 pixel images with different total variation, from left
to right 18, 28 and 40, respectively
6
All the priors used in this work will be based on total variation, being just
a slight modification of the total variation prior. This modification will be
introduced in order to try to obtain all the benefits of total variation priors
and to avoid all their limits related to discretizaton invariance.
1.1.2
The joint probability density
Returning to the list of possible probability densities we can introduce while
treating the inverse problem (1.1), since we have at least two random variables involved, it is always possible to consider their joint distribution.
Definition 1.1.5 (Joint probability density).
Let us denote the joint probability density of X and Y by π(x, y).
Remark. Obviously the two probabilities, i.e. the joint and the prior probability, are connected by the following relation:
Z
π(x, y)dy = πpr (x).
RM
1.1.3
The likelihood function
Besides the joint probability density, we can always also consider the probability density of one variable, given the value assumed by the other one.
Definition 1.1.6 (Likelihood function).
The conditional probability distribution of Y given the information that
X=x
is called likelihood function and can be expressed as
π(y|x) =
π(x, y)
πpr (x)
if πpr (x) 6= 0.
The reason for its name is the fact that this probability density expresses the
likelihood of different measurements outcomes when X = x is given.
Remark. One could ask whether there is or is not a simple way to calculate
the likelihood distribution. The answer to this question is that it obviously
7
depends on which model has been chosen to describe the noise. In this work,
we decided on a particular model for the noise, choosing it to be additive
and independent of the unknown X. Thus, finding the likelihood function
becomes extremely simple. In fact, the noise being additive, the stochastic
model we introduced before in (1.1) becomes
Y = f (X) + E,
where X ∈ RN , while Y and E are in RM =K and X and E are mutually
independent. Denote the probability distribution of the noise by
Z
µE (B) = P (E ∈ B) =
πnoise (e)de.
B
Since X and E are mutually independent, the probability density of E is not
affected when we make the assumption that X = x. Therefore we obtain the
following equalities:
π(y|x) = π(f (x) + e|x) = πnoise (e) = πnoise (y − f (x)).
Hence, if the distribution of the noise is known (and, as we will see in the
following chapters, in this work we will choose the noise to be Gaussian
distributed), the likelihood function is completely determined.
1.1.4
The posterior distribution
Going back to conditional distributions, we can obviously also define:
Definition 1.1.7 (Posterior distribution).
The conditional probability distribution of X given the measurement data
Y = y is called the posterior distribution of X and can be expressed as
Z
π(x, y)
, if π(y) =
π(x, y)dx 6= 0.
πpost (x) = π(x|y) =
π(y)
RN
It expresses the knowledge we have about X after the realized observation
Y = y.
8
1.1.5
Bayesian formulation of inverse problems
We are now ready to reformulate in Bayesian terms the inverse problem we
began with:
Given the data Y = y and the information encoded about X in its
prior probability distribution, find the conditional probability distribution πpost (x) = π(x|y).
Fortunately, there exists a theorem which ties together all the probability
densities just introduced in a single formula.
Theorem 1.1.1 (Bayes’ theorem).
Assume that the RN -valued random vector X has a known prior probability
density πpr (x) and the data consist of the observed value y of an observable RM -valued random variable Y such that π(y) > 0. Then the posterior
probability density of X, given the data y is
πpr (x)π(y|x)
.
π(y)
R
Remark. The marginal density π(y) = RN π(x, y)dx, which appears in
Bayes’ formula is just a normalization constant and therefore does not have
a fundamental influence on the posterior distribution. However, it is always
possible that π(y) = 0, which can obviously create a technical problem. Since
π(y) = 0 would imply that our measurement data has zero probability, this
means that the model we chose for the prior is not consistent with reality
and therefore we can easily exclude this case.
πpost (x) = π(x|y) =
Since solving an inverse problem in the Bayesian setup basically means finding the posterior distribution of the unknown, due to the Bayes’ formula it is
clear that the simplest way to find the posterior distribution consists of finding the proper prior distribution for X, on calculating the likelihood function
and finally on developing methods to explore the posterior distribution we
get with the previous steps.
9
1.1.6
The MAP and CM estimates
Finally, recall that, even in Bayesian inversion, finding single point estimates
is always possible. These are in fact based on the posterior distribution of
the unknown and they represent the most probable values of the unknown
X given the data y and the prior information.
The most common estimates are the maximum a posteriori estimate and the
conditional mean estimate which are defined as follows:
Definition 1.1.8 (Maximum a posteriori estimate).
Given the posterior probability density π(x|y) of the unknown RN -valued random vector X, the maximum a posteriori (MAP) estimate satisfies
xM AP = arg max π(x|y),
x∈RN
provided that such maximizer exists.
Remark. Note that such a maximizer, even when it exists, is not necessarily
unique. Note also that finding this point estimate is an optimization problem.
Definition 1.1.9 (Conditional mean estimate).
The conditional mean (CM) estimate of the unknown X conditioned on
the data y is defined as
Z
xCM = E[X|Y = y] =
x · π(x|y)dx,
RN
provided that the integral converges.
Remark. Note that finding the CM estimate consists of solving an integration problem.
In our work we will use these particular estimates in order to reconstruct our
images.
10
Chapter 2
Discretization Invariance in
Bayesian inversion
Bayesian methods have been widely applied to practical applications of inverse problems. Nevertheless, several crucial questions remain open. In particular, it is still under discussion whether these methods can be successfully
applied to infinite dimensional problems. In fact, in practical situations, we
actually work with a discretization of our continous models. Thus, asymptotical questions about how the results obtained in finite spaces should be
translated into the infinite dimensional limit space, have naturally arisen in
literature. In particular, in this chapter we will summarize the inquiry presented in [22], underlining the critical points and considering the limits of
choosing total variation priors.
2.1
The problem of discretization
Consider a physical quantity u that can be observed through an indirect
noisy measurement
m = Au + e,
(2.1)
where the random noise e is suppose to be additive and the linear operator
A models the measurement.
The inverse problem can be formulated as the problem of finding u given m.
We assume further that the unknown u is known to be piecewise regular (i.e.
11
we will look for an edge-preserving method).
In the Bayesian setup all the quantities are modelled as random variables
defined on a common probability space (Ω, F, P ). Let us assume that U is
a priori known to take values in a function space Y , that is U : Ω → Y ,
while M and E take values in another function space W , that is M : Ω → W
and E : Ω → W . Suppose Y and W to be Hilbert spaces. As mentioned,
A : Y → W is a linear operator. Hence, the problem becomes
M = AU + E,
(2.2)
that is, the problem of finding what we know about the random variable U ,
given a realization of the random variable M , let us say m = M (ω), where ω
is a fixed element of Ω. We will refer to (2.2) as the continuum model. To
deal with this problem we need to introduce two different independent levels
of discretization.
The first discretization we introduce is justified by the fact that, in practical
situations, the measurements are always finite dimensional. We will refer to
the original problem, discretized through this first kind of discretization, as
practical measurement model. It will likely appear as
Mk = Pk M = Pk AU + Pk E = Ak U + Ek ,
(2.3)
where Pk is a linear operator related to the device that, for simplicity, will
be assumed to be an orthogonal projection Pk : W → W with k-dimensional
range Wk . Hence, Ak : Y → Wk , Mk : Ω → Wk and Ek : Ω → Wk .
Thus, the inverse problem has become the problem of estimating U , given a
realization mk = Mk (ω).
At this point we need a second discretization step since the computational solution of the inverse problem using Bayesian inversion involves discretization
of the unknown quantity U . If we choose a linear projection Tn : Y → Y with
n-dimensional range Yn and define a random variable Un = Tn U : Ω → Yn
taking values in Yn , this leads to the computational model
Mkn = Pk ATn U + Pk E = Ak Un + Ek ,
(2.4)
where Mkn : Ω → Wk . Hence, the inverse problem has become the problem
of estimating U , given a realization mkn = Mkn (ω). It is clear that the two
12
discretizations are independent: Pk is related to the measurement device
and has k-dimensional range and Tn is related to finite representation of the
unknown and has n-dimensional range.
Let us give an example that will help us to understand how this doublediscretization works. In computerized tomography X-rays attenuation data is
collected from 180 directions separately for each two dimensional slice. Using
a reconstruction algorithm, inner structure in the slice is revealed.
Figure 2.1: Computerized tomography.
In this simple example we can distinguish the three models just described.
Figure 2.2: From the left to the right we have the continuum model, the
practical measurement model, and the computational model for tomography.
13
Notation. Henceforth, index k always refers to the level of discretization
related to the practical measurement model and index n always refers to the
level of discretization related to the computational model.
At this point, it is useful to reformulate all the probability densities related
to the computational model, which is the model we use. To do so, we need
to translate the model into the real setting in order to be able to describe
the probability densities with respect to the Lebesgue measure.
Consider two arbitrary isometries Ik : Yk → Rk and In : Yn → Rn . Through
these operators the equation (2.4) is mapped to a matrix equation
Mkn = Akn Un + Ek ,
(2.5)
where Mkn = Ik Mkn : Ω → Rk , Akn = Ik Ak In −1 ∈ Rk×n : Rn → Rk ,
Un = In Un : Ω → Rn and Ek = Ik Ek : Ω → Rk .
Now, if all the random variables are absolutely continuous with respect to
the Lebesgue measure, by Bayes’ formula the posterior density πkn can be
expressed as
(n)
πpr (un )Γkn (mkn |un )
,
(2.6)
πkn (un |mkn ) =
γkn (mkn )
where we emphasize that un = Un (ω) = In Tn U (ω) ∈ Rn , mkn = Mkn (ω) =
(n)
Ik Mkn (ω) ∈ Rk , πpr and Γkn represent the density of the prior distribution
of Un and the likelihood distribution of Mkn given Un = un and γkn is the
density of Mkn .
Remark. Note that the computational model does not exist in nature. It
is something that we introduce to make the computational part feasible.
Hence we will never find mkn = Mkn (ω) as given data. Thus, we will use
mk = Mk (ω) = Ik Mk (ω) ∈ Rk instead of mkn = Mkn (ω) = Ik Mkn (ω) ∈ Rk .
Thus our problem becomes the problem of estimating U given a realization
mk = Mk (ω). We will do it by using the common estimates (MAP or CM
estimates) based on the posterior distribution πkn (un |mk ).
Notation. From here on, we do not distinguish anymore between the random functions and their respective random variables obtained through the
use of the isometries Ik and In , denoting them in each case with non-italic
capital letters.
14
2.2
Discretization invariance
The scheme just presented becomes meaningful as soon as we consider it in
a finite dimensional space. The problem is that our aim is to resolve problem (2.2) which is typically set in an infinite dimensional space. We must
therefore consider the limits of the posterior distributions relative to the discretized models and of the estimates we can produce starting from them.
This is the central idea of studying discretization invariance. Since, as we
said, the discretizations are independent, n and k are independent, and thus,
if we want the Bayesian inversion strategy to work, posterior estimates must
converge as n or k or both tend to infinity.
(n)
Hence, choosing the right Tn and πpr (where right means discretization invariant) is extremely important and is the most difficult part of Bayesian
inversion problems. In particular, we will like to choose them in such a way
that the following phenomena would be avoided:
- The estimates ukn diverge as n → ∞. This means that investing
more computational efforts will not necessarily result in improved reconstructions.
- The estimates ukn diverge as k → ∞. This means that refining the
measurement device(s) will not necessarily result in improved reconstructions.
- Representation of a priori knowledge is incompatible with discretization. This means that the a priori information we have could change
for different values of n.
In other words, we want the discretized model and the continuum model to
be coherent for every choice of n and k.
In this work we will exclusively deal with the problem of discretization invariance when n goes to infinity, not considering what happens when k goes
(n)
to infinity. In particular, we would like to construct Tn and πpr in such a
way that if we construct prior distributions for U in the infinite dimensional
space Y , then the random variable Un = Tn U taking values in the finite
15
dimensional subspace Yn ⊂ Y represents the same a priori knowledge as U .
Also, we would like Un to converge to U in distribution as n goes to infinity such that this guarantees that the representation of a priori information
becomes more accurate when n grows. Finally, we would like even the point
estimates of Un to converge in the limit to the same point estimate of U .
Let us thus define in further detail what constitutes a "good" choice for Tn
(n)
and πpr .
Definition 2.2.1 (Linear discretization of random functions).
Let Un(`) (t, ω) be Yn(`) -valued random variables with ` ∈ N and
1 < n(1) < n(2) < · · · .
Assume that
- There is a Y -valued random variable V = V (ω, t) such that for any
t ∈ [0, 1]
lim Un(`) (t) = V (t)
in distribution.
`→∞
- There are Tn(`) ∈ L(Y ), where
L(Y ) = {T : Y → Y ; T linear and bounded}
such that for any t ∈ [0, 1]
Un(`) (t) = (Tn(`) V )(t).
Then Un(`) are said to be linear discretization of the random function
V . Further, we say that V can be approximated by finite dimensional
random variables in a discretization invariant manner and Un(`) are
proper linear discretizations of V .
(n)
Definition 2.2.2 (Discretization invariant choice of Yn and πpr ).
(n(`))
Assume given Yn(`) and πpr
for (2.4) with ` ∈ N and
1 < n(1) < n(2) < · · · .
Let Un(`) be random functions taking values in Yn(`) and having distribution
(n)
(n)
πpr . We say that the choice of Yn and πpr is discretization invariant if
Un(`) are linear discretizations of a random function in the sense of Definition
2.2.1.
16
2.3
Can one use total variation prior for edgepreserving Bayesian inversion?
In [22], the authors consider the case of total variation priors for Un and
analyze their discretization invariance as n goes to infinity. To be more
precise, in [22] the authors consider a family of priors distributed according
to the so called p-distributions, which are slight modifications of the total
variation priors. However, since we are mainly interested in the limit behavior
of the total variation priors, we will recall the results and the theory presented
in [22] just in the case p = 1.
Notation. Recall that, for all the quantities we define index k as referring to
the level of discretization related to the practical masurement model which
remains fixed in our consideration but whose convergence could be further
studied in future.
We restrict ourselves in this work to a class of one dimensional inverse problems. Recall that we are interested in reconstructing a function u which is a
priori known to be piecewise regular. It is therefore natural to define Y and
Yn as follows:
Definition 2.3.1 (Function spaces Y and Yn ).
Let Y be the space of continuous functions on the interval [0, 1] vanishing at
the endpoints
Y = C0 ([0, 1]) = {u ∈ C([0, 1]) : u(0) = u(1) = 0}.
For any integer n > 0, let Yn ⊂ Y be the following set of piecewise linear
functions on the interval [0, 1]
Yn = {u ∈ Y : u|[xnj ,xnj+1 ] is linear for j = 0, ..., n},
where
xnj =
j
n+1
for j = 0, ..., n + 1.
Definition 2.3.2 (Roof-top basis for Yn ).
Define the roof-top basis {ψjn }nj=1 for Yn , where ψjn ∈ Yn is such that
ψjn (xnk ) = δjk for j = 1, ..., n and k = 0, ..., n + 1.
17
We choose Un to be distributed according to total variation prior in Yn .
Definition 2.3.3 (Prior distributions for Un ).
Let n > 0 be an integer, αn > 0 be a parameter. The Yn -valued random
function Un : Ω → Yn such that
∀ω ∈ Ω, Un (ω) ∈ Yn is s.t. Un (ω, t) = (Un (ω))(t) =
n
X
Ujn (ω) ψjn (t)
| {z }
j=1
∈R
has the total variation (TV) probability distribution in Yn if the Rn valued random vector [U1n (ω), ..., Unn (ω)]T has the probability density function
n
∂ X n j
(n) n
πpr
(u1 , ..., unn ) = Cn exp −αn k
u ψ (t) kL1 (0,1)
∂t j=1 j n
!
n+1
X
= Cn exp −αn
|unj − unj−1 | ,
!
(2.7)
(2.8)
j=1
= Cn exp (−αn T V (un )) ,
(2.9)
where un0 = unn+1 = 0, Cn is a normalization constant and un = Un (ω) is a
function in Yn .
Remark. We require that un0 = unn+1 = 0 for the total variation distribution
to be a probability density function.
Remark. Note that a function in Yn is completely determined by the values
it assumes on xnj for j = 1, ..., n. Thus, it is clear that for every ω in Ω,
un = Un (ω) ∈ Yn
is completely identified by its components
un1 = U1n (ω), ..., unn = Unn (ω).
Remark. Note that in general, since un is a real-valued function defined on
the real interval [0, 1], then
T V (un ) = sup
P ∈P
nX
P −1
!
|un (xi+1 ) − un (xi )| ,
i=0
18
where the supremum runs over the set of all partitions
P = {P = {x0 , ..., xnP } s.t. P is a partition of [0, 1]}
of the given interval. In this particular case un ∈ Yn and, due to the definition
of Yn , we have that
n+1
X
T V (un ) =
|unj − unj−1 |.
j=1
At this point we are ready to introduce our model for measurement device
and for noise distribution.
Definition 2.3.4 (Borel measures with finite variation).
Denote by Z the set of Borel measure P(dt) with finite variation that are
supported on compact subset of (0, 1). Recall that P(dt) has finite variation
if there are finite non-negative measures P1 (dt) and P2 (dt) such that
P (dt) = P1 (dt) − P2 (dt).
The norm of Z is given by
k P kZ = inf{P1 ([0, 1]) − P2 ([0, 1]) : P (dt) = P1 (dt) − P2 (dt)}.
Definition 2.3.5 (Measurement model).
Let Akj ∈ Z for j = 1, ..., k. Given a realization of U , that is u ∈ Y , let
the measurement mk be the corresponding realization of the random vector
Mk = Ak U + Ek in Rk . Then, mk ∈ Rk has components
Z 1
k
k
mj = (Ak u)j + ej =
u(t)Akj (dt) + ekj
j = 1, ..., k
(2.10)
0
where ekj , for j = 1, ..., k, are the components of realization ek of Ek .
Definition 2.3.6 (Noise distribution model).
Let σ > 0. We suppose the noise to be distributed with Gaussian independent components, i.e. the components of random vector Ek , Ejk ∼ N (0, σ 2 )
are independent Gaussian errors for j = 1, ..., k.
Finally we are ready to describe the posterior distribution relative to our
model.
19
Definition 2.3.7 (Posterior distribution).
Modelling a prior knowledge about U with the total variation distribution in
the space Yn leads to the following posterior distribution
(n)
πpr (un )Γkn (mk |un )
πkn (un |mk ) =
γkn (mk )
(n)
πpr (un )πnoise (mk − Ak un )
=
γkn (mk )
!
n+1
X
1
2
= Ckn exp − 2 k Ak un − mk kRk − αn
|unj − unj−1 | ,
2σ
j=1
(2.11)
where Ckn is a normalization constant.
2.3.1
Convergence of MAP estimates
For each n ∈ N, we obtained a posterior distribution, i.e. πkn (un |mk ) for
the unknown. We can therefore construct the MAP estimates related to
these posterior distributions and obtain in this manner a sequence of MAP
estimates of un , one for each n. These are defined as
!
n+1
X
1
AP
uM
(·; αn ) ∈ arg max n exp − 2 k Ak un − mk k2Rk −αn
|unj − unj+1 | ,
n
un ∈Yn 'R
2σ
j=1
(2.12)
where αn is considered as a parameter.
Since we utilize this sequence, it seems natural to analyze its convergence
for n = n(`) = 2` − 1 and ` → ∞. In fact, we would like to obtain a limit
function for this sequence that will be considered as the right point estimate
for u ∈ Y .
Remark. We chose n(`) in this way since it allows us to obtain
Yn(`) ⊂ Yn(`)+1 ,
which will be needed in the next steps.
20
Remark. The problem of constructing the MAP estimates is an optimization problem. Note that the functional we are trying to maximize in (2.12)
is not strictly convex. Hence the solution of our maximization problem may
not be unique.
Mathematical preliminaries
Let us recall some notions that will be used in the next sections.
Definition 2.3.8 (Lower-semicontinous functions).
Suppose X is a topological space and x0 ∈ X. A function
u : X → R = R ∪ {−∞, +∞}
is called lower-semicontinuous (l-sc) in x0 if
u(x0 ) ≤ lim inf u(x).
x→x0
The function u is called lower semicontinuous if it is lower semicontinuous at every point of its domain.
Definition 2.3.9 (Γ-convergence).
Let {un }n∈N , u be l-sc functions. We say that {un }n∈N Γ-converges to u if
at each x ∈ X we have that
lim inf un (xn ) ≥ u(x) for every sequence xn → x
(2.13)
lim sup un (xn ) ≤ u(x) for some sequence xn → x.
(2.14)
n→∞
n→∞
Let us also recall the definitions of some function spaces and their topologies.
Definition 2.3.10 (Weak derivatives).
Consider u ∈ L1loc (0, 1), i.e. the space of functions which are Lebesgue integrable over every compact subset of (0, 1). It is possible to associate to u a
distribution Tu , i.e. a continuous linear functional over the set of infinitely
differentiable functions φ : (0, 1) → R with compact support. Let us denote
this space by D(0, 1). Tu is then defined by
Z 1
< Tu , φ >=
u(x)φ(x)dx, ∀φ ∈ D(0, 1).
0
21
In this sense we can define the derivative of u in the sense of distributions (or weak derivative), as the derivative of the distribution Tu
associated to u. The derivative of the distribution Tu is defined as the continuous linear functional ∂s Tu over D(0, 1) such that
< ∂s Tu , φ >= − < Tu , ∂s φ >, ∀φ ∈ D(0, 1),
Z 1
f (x)g(x)dx.
where < f, g >=
0
Definition 2.3.11 (Sobolev spaces Wk,p (0, 1) and W0k,p (0, 1)).
The Sobolev space Wk,p (0, 1) for k, p ≥ 1 is defined to be the subset of
Lp (0, 1) such that u and its weak derivatives up to some order k have a finite
Lp (0, 1) norm. Equipped with the norm
k u kW k,p (0,1) =
k Z
X
1
1/p
,
|u (t)| dt
(i)
p
(2.15)
0
i=0
it is a Banach space.
The Sobolev space W0k,p (0, 1) for k, p ≥ 1 is defined as the closure of
C0∞ (0, 1), i.e. the set of the infinitely differentiable functions on (0, 1) with
compact support, in the norm of W k,p (0, 1).
Remark. Note that the norm defined in (2.15) is equivalent to the following
norm
k u kW k,p (0,1) =k u(k) kLp (0,1) + k u kLp (0,1) .
(2.16)
Let us now define an extension of the spaces W 1,1 (0, 1) and W01,1 (0, 1).
Definition 2.3.12 (Spaces of functions of bounded variation).
The spaces of functions of bounded variation are defined as
BV (0, 1) = {u ∈ L1 (R) : |u|BV < ∞, supp(u) ⊂ [0, 1]},
BV0 (0, 1) = {u ∈ BV (0, 1) : u(0) = u(1) = 0},
where
Z
|u|BV = sup
1
u(s)∂s φ(s)ds : φ ∈
0
22
C01 (0, 1),
k φ kL∞ (0,1) ≤ 1
and C01 (0, 1) is the space of all the functions with continuous first derivatives
and compact support in (0, 1).
These spaces are Banach spaces with respect to the norm defined by
k u kBV =k u kL1 (0,1) +|u|BV ,
(2.17)
which extends the classical norm in W 1,1 (0, 1) as defined in (2.16).
Remark. Note that an equivalent definition of BV (0, 1) is as the set of all
such functions in L1 (0, 1) whose derivative in the sense of distributions ∂s u
is a measure with finite total variation in (0, 1) (i.e. a Radon measure).
We need to characterize the topology defined on BV (0, 1). We can choose
from three different topologies on this space:
- The strong topology of the BV-norm;
- The strict-topology;
- The weak*-topology;
where the latest are defined as:
Definition 2.3.13 (Strict-topology of BV-spaces).
We say that un ∈ BV (0, 1) converge in the strict-topology of BV to a
function u ∈ BV (0, 1) if
k un − u kL1 (0,1) → 0 and |un |BV → |u|BV ,
Definition 2.3.14 (Weak*-topology of BV-spaces).
We say that un ∈ BV (0, 1) converge in the weak*-topology of BV to a
function u ∈ BV (0, 1) if
Z
k un − u kL1 (0,1) → 0 and
φ(∂s un − ∂s u)(s) → 0 for any φ ∈ C0 ([0, 1]),
R
i.e. the measures ∂s un converge weakly to ∂s u. We will write in this case
BV −w∗
un → u.
Remark. According to their names, the three topologies on BV (0, 1) are
ordered from the finest to the coarsest as strong, strict and weak*.
23
Finally, let us define the following operator.
Definition 2.3.15 (The trace operator).
Let p ≥ 1. Consider the linear operator
T : C 1 ([0, 1]) →
R2
u
→ u|∂(0,1) .
Recall that C 1 ([0, 1]) ⊂ W 1,p (0, 1). It can be proven that there exists a constant Cp,(0,1) such that
k T u kLp (∂(0,1)) ≤ Cp,(0,1) k u kW 1,1 (0,1)
∀u ∈ C 1 ([0, 1]).
Then, since it can be also proven that C 1 ([0, 1]) is dense in W 1,1 (0, 1), it is
clear that T admits continuous extension
T : W 1,p (0, 1) → R2
that is called the trace operator.
Remark. Let us recall that the trace operator defined in Definition 2.3.15 is
continuous from W 1,p (0, 1) into R2 , p ≥ 1 when these two spaces are equipped
with their weak topologies. In the case p = 1, the trace operator can be
naturally extended to BV (0, 1) ⊃ W 1,1 (0, 1). In this case, the extended
trace operator
T : BV (0, 1) →
R2
u
→ (u(0), u(1))
is not continuous from BV (0, 1) into R2 when BV (0, 1) is equipped with its
weak*-topology. But it becomes continuous, if we endow BV (0, 1) with its
strict-topology.
Note also that, on the other hand, for fixed 0 ≤ s0 ≤ 1, the trace operator
T : BV (0, 1) →
R
u
→ u(s0 ).
is continuous from the weak*-topology of BV (0, 1) to R.
24
Main results
We are now ready to present the main result of this section.
Theorem 2.3.1 (Limit behavior of MAP estimates).
Let σ > 0, mk ∈ Rk and n = n(`) = 2` − 1 for ` = 2, 3, · · · .
Assume Akj ∈ L1 (0, 1) ∩ Z for j = 1, ..., k, where Akj are the ones introduced
in (2.10).
We obtained the following convergence results for MAP estimates, depending
on the character of αn .
Case αn constant. Let αn = 2σα̃2 with some α̃ > 0. Then for any sequence
AP ∞
{uM
n(`) }`=2 of maximizers of (2.12) there is a subsequence that converges in weak*-topology of BV (0, 1) to some ũ ∈ BV0 (0, 1).
Case αn going to infinite. Let the parameters αn > 0 be defined in such
a way that they satisfy lim αn(`) = ∞. Then
`→∞
AP
lim uM
n(`) = 0
`→∞
in the norm of BV (0, 1).
Proof.
Reformulation of the sequence whose limit we are considering
First of all, note that
!
n−1
X
1
AP
uM
(·; αn ) ∈ arg max exp − 2 k Ak un − mk k2Rk −αn
|unj − unj+1 |
n
un ∈Yn
2σ
j=1
= arg min k Ak un − mk k2Rk +2σ 2 αn |un |BV
un ∈Yn
= arg min k Ak un − mk k2Rk +βn |un |BV ,
un ∈Yn
(2.18)
with βn = 2σ 2 αn = α̃.
25
Continuum and discrete setting
Now we consider the following optimization problem, the limit of optimization problems in finite dimensional spaces:
ũ(·; β) ∈ arg
min
S(u)
(2.19)
S(u) =k Ak u − mk k2Rk +β|u|BV .
(2.20)
u∈BV0 (0,1)
Next we use the methods of convex analysis and approximate the problem
with a discrete minimization problem
ũn (·; βn ) ∈ arg
min
Sn (u)
(2.21)
Sn (u) =k Ak u − mk k2Rk +βn |u|BV + 1Yn (u)
(2.22)
u∈BV0 (0,1)
where Sn : BV0 (0, 1) → R+ = [0, ∞] is a convex non-linear function, βn > 0
are such that β = lim βn , and 1Yn is the convex indicator function, i.e. the
n→∞
/ Yn .
function defined by 1Yn (u) = 0 if u ∈ Yn and 1Yn (u) = ∞ if u ∈
Thus, noting that Yn ⊂ BV0 (0, 1), it is clear that the discretized problem
(2.21) is equivalent to the following one, which is in turn the same as the
problem in (2.18):
ũn (·; βn ) ∈ arg min Sn (un )
(2.23)
Sn (un ) =k Ak un − mk k2Rk +βn |un |BV .
(2.24)
un ∈Yn
Case αn constant
Approximations of functions in BV0 (0, 1)
Now we are going to show that each function in BV0 (0, 1) can be approximated by a sequence of functions in W01,1 (0, 1), where W01,1 (0, 1) ⊂ BV0 (0, 1).
Let h ∈ L1 (0, 1). Consider the σ-algebra generated by
n(`)
N` = Σ {(0, xj )}j=0,1,...,n(`)+1 .
Then the σ-algebra generated by
N =Σ
∞
[
`=2
26
!
N`
is the standard Borel σ-algebra of (0, 1).
Since N`+1 ⊂ N` , by Doob’s second martingale theorem, it follows that
lim k E(h|N` ) − h kL1 (0,1) = 0,
`→∞
(2.25)
where, of course, E(h|N` ) is the conditional expectation with respect to the
σ-algebra N` .
Consider now the operators Tn : Y → Yn defined by
(Tn u)(t) =
n
X
u(xnj )ψjn (t),
j=1
where xnj and ψjn are the same as in Definition 2.3.1.
Note that for u ∈ W01,1 (0, 1) the function (Tn u)0 is piecewise constant and by
definition
(Tn(`) u)0 = E(u0 |N` ).
(2.26)
Hence, for every u ∈ W01,1 (0, 1) we get
lim k Tn(`) u − u kW 1,1 (0,1)
`→∞
0
(2.16)
=
lim k Tn(`) u − u kL1 (0,1)
`→∞
+ lim k (Tn(`) u − u)0 kL1 (0,1)
`→∞
=
lim k Tn(`) u − u kL1 (0,1)
`→∞
+ lim k (Tn(`) u)0 − u0 kL1 (0,1)
`→∞
(2.26)
=
(2.27)
lim k Tn(`) u − u kL1 (0,1)
`→∞
+ lim k E(u0 |N` ) − u0 kL1 (0,1)
`→∞
(2.25)
=
0.
Moreover, recall that the Poincaré inequality states that for every p such
that 1 ≤ p < ∞, E ⊂ Rn a bounded open set, there exists a constant Cp,E
such that for all u ∈ W01,p (E)
k u kLp (E) ≤ Cp,E k ∇u kLp (E) ,
(2.28)
where ∇ is the weak gradient. An important consequence of (2.28) is that
for every u ∈ W01,p (E) it holds that
(2.16)
k ∇u kpLp (E) ≤k u kpW 1,p (E) = k u kpLp (E) + k ∇u kpLp (E)
(2.28)
≤ (1 + Cp,E ) k ∇u kpLp (E) .
27
Hence, in our case (i.e. p = 1 and E = (0, 1)), it follows that
k Tn(`) u kW 1,1 (0,1) ≤ c k (Tn(`) u)0 kL1 (0,1) ≤ c k u kL1 (0,1) ≤ c k u kW 1,1 (0,1) ,
0
0
where c is a constant which does not depend on n. Thus, Tn are continuous
from W01,1 (0, 1) to W01,1 (0, 1) in the norm topology and their norms are uniformly bounded.
We have just proven that every function u ∈ W01,1 (0, 1) can be approximated
in the norm of W01,1 (0, 1) by the sequence {Tn(`) u}. Now, we apply the
properties of operators Tn to approximate functions in BV0 (0, 1).
BV −w∗
Note that for any u ∈ BV0 (0, 1) and any ur → u we have
|u|BV ≤ lim inf |ur |BV .
(2.29)
r→∞
BV −w∗
In fact, this follows from the fact that if ur → u, then in particular we
have that k ur − u kL1 (0,1) → 0. Hence, if φ ∈ C01 (0, 1) and k φ kL∞ (0,1) ≤ 1,
then
Z 1
Z 1
u(s)∂s φ(s)ds = lim
ur (s)∂s φ(s)ds ≤ lim inf |ur |BV .
r→∞
0
r→∞
0
Thus, the result follows by taking the supremum over all such φ.
At this point, from the theory related to regularization of BV functions (see
for instance [13], [Theorem 5.2.1]), we know that
∀u ∈ BV0 (0, 1) ⇒ ∃{Φr }r∈N ∈ W 1,1 (0, 1) ∩ C ∞ (0, 1) s.t.
Φr
BV −w∗
→
u and
lim |Φr |BV = |u|BV ,
(2.30)
r→∞
in other words the sequence {Φr }r∈N ∈ W 1,1 (0, 1) ∩ C ∞ (0, 1) converges to u
in the strict-topology of BV (0, 1).
Now, note that actually we can directly choose Φr such that
Φr ∈ W01,1 (0, 1) ∩ C ∞ (0, 1).
In fact, Φr are not assumed to vanish at the boundary points and so they
are not in W01,1 (0, 1). However, as anticipated in the previous section, since
28
the trace operator is continuous in the strict topology, we can consider the
sequence {φj }j∈N defined as
φj (s) = Φj (s) − (1 − s)Φj (0) − sΦj (1) ∈ W01,1 (0, 1).
It is clear that this sequence converges to u in the weak*-topology of BV0 (0, 1)
and moreover that, by (2.30), it is still true that
lim |φj |BV = |u|BV .
j→∞
Now, since φj ∈ W01,1 (0, 1), by (2.27), we get that
lim k Tn(`) φj − φj kW 1,1 (0,1) = 0.
`→∞
0
Hence for any j, it is possible to find an `j such that for any ` ≥ `j
1
k Tn(`) φj − φj kW 1,1 (0,1) ≤ .
0
j
(2.31)
This implies that the sequence {Tn(`j ) φj } converge to u in the weak*-topology.
In fact, we know that φj converges to u in the weak*-topology. Hence, if we
consider all the following limits in the weak*-topology, we obtain
0 = lim (Tn(`j ) φj − φj )
j→∞
= lim Tn(`j ) φj − lim φj
j→∞
(2.32)
j→∞
= lim Tn(`j ) φj − u.
j→∞
In addition
lim |Tn(`j ) φj |BV = lim |Tn(`j ) φj − φj + φj |BV
j→∞
j→∞
≤ lim |Tn(`j ) φj − φj |BV + lim |φj |BV
j→∞
j→∞
(2.16)
≤ lim k Tn(`j ) φj − φj kW 1,1 (0,1) + lim |φj |BV
j→∞
(2.31)
0
j→∞
(2.33)
1
+ lim |φj |BV
j→∞ j
j→∞
≤ lim
= |u|BV .
In the following we will denote nj = n(`j ) and unj = Tnj φj .
In summary, we have just shown that for any u ∈ BV0 (0, 1), there exists a
sequence unj in Ynj which converges to u in the strict-topology of BV0 (0, 1).
29
Γ-convergence
The proof of the convergence of the MAP estimate for the posterior distribution defined in (2.11) is based on the Γ-convergence of Sn to S.
Recall that we chose Akr in L1 (0, 1) ∩ Z for r = 1, ..., k.
It can be found in the literature (see for instance [5]) that the embedding
BV0 (0, 1) → L∞ (0, 1) is continuous if we endow BV0 (0, 1) with the weak*topology and L∞ (0, 1) with the norm topology.
BV −w∗
Hence, since unj → u, it follows that, if we consider them as elements of
L∞ (0, 1), k unj − u kL∞ (0,1) → 0. Thus it clearly follows that
lim < Akr , unj >=< Akr , u >
j→∞
R1
for r = 1, ..., k, where we would like to recall that < Akr , u >= 0 u(t)Akr (dt).
Finally, also note that in this case βnj = 2σ 2 αnj = α̃ for any j and that, by
(2.33), lim |Tn(`j ) φj |BV = |u|BV .
j→∞
Thus, for any u ∈ BV0 (0, 1) there is a sequence {unj } ∈ Ynj such that
lim sup Snj (unj ) = lim sup k Ak unj − mk k2Rk +βnj |unj |BV
j→∞
j→∞
(2.34)
=k Ak u − mk k2Rk +α̃|u|BV
= S(u).
Now consider u in BV0 (0, 1). Then, for any sequence {uj } ∈ BV0 (0, 1) such
BV −w∗
that uj → u, as before, we get lim < Akr , uj >=< Akr , u > but we also
j→∞
know by (2.29) that |u|BV ≤ lim inf |uj |BV is always true. Thus, with the
j→∞
same calculations as in (2.34) we obtain
S(u) ≤ lim inf Sj (uj ).
(2.35)
j→∞
Hence, by definition, (2.34) and (2.35) mean that Sn epiconverge to S.
Then, by [3] [Proposition 2.9] about Γ-convergence, it follows that
lim sup arg min(Sn ) ⊂ B = {u ∈ BV0 (01) : S(u) ≤ s0 + },
(2.36)
n→∞
where s0 = inf(S).
AP
Thus if u` = uM
n(`) ∈ arg min(Sn(`) ), we see that lim S(u` ) = inf(S). More`→∞
over |u` |BV are uniformly bounded.
30
Hence by the compactness properties of the weak*-topology of BV0 (0, 1) (see
for example [5]), there exists a subsequence that converges in the weak*topology. If ũ is a limit of such subsequence we have
S(ũ) = lim S(u` ) = inf(S).
`→∞
Finally, by (2.35), for any converging subsequence the limit is a minimizer of
S.
Case αn going to infinity
AP
|BV = 0, proving the
If lim αn = ∞, then lim βn = ∞. Hence lim |uM
n
n→∞
n→∞
n→∞
assertion.
Remark (Critical value of αn for convergence of the MAP estimates).
Note that with the previous theorem we have obtained a critical value for
the choice of αn in the case we want our MAP estimates to converge to a
non-trivial limit. In fact, for αn constant we have convergence to a function,
but for αn of a greater order, i.e. for αn going to infinity when n goes to
infinity, we have convergence of the MAP estimates to zero.
2.3.2
Convergence of CM estimate
We are now interested in analyzing the convergence of the posterior distribution (2.11) depending on the parameter αn as the discretization is refined
arbitrarily or n → ∞. In particular, we are interested in the convergence of
the CM estimates
uCM
n (·; αn ) = E(U |Mk = mk )
!
Z
n
X
=
unj ψjn (·) πkn (un1 , ..., unn |mk )dun1 ...dunn ,
Rn j=1
where (un1 , ..., unn ) are the coefficient of un in the basis {ψjn }.
31
(2.37)
Mathematical preliminaries
We need to introduce some mathematical concepts that will be useful in the
next part.
Definition 2.3.16 (Convergence weakly in distribution).
Let Vn and V be C(0, 1)-valued random variables. We say that Vn converges
to V weakly in distribution if for any A ∈ Z, the random sequence
Z 1
Vn (·, t)A(dt)
< A, Vn >=
0
converges to the random variable
Z
1
V (·, t)A(dt)
< A, V >=
0
weakly in distribution.
Definition 2.3.17 (Convergence in probability).
Let Vn and V be C(0, 1)-valued random variables. Fix t ∈ (0, 1), Vn (t) and
V (t) are real random variables. We say that Vn (t) converges to V(t) in
probability if for every > 0 we have
lim P ({|Vn (t) − V (t)| > }) = 0.
n→∞
Definition 2.3.18 (Uniform integrability).
Let Vn be C(0, 1)-valued random variables for n ∈ N. Fix t ∈ (0, 1), Vn (t) are
real random variables. We say that the family Vn (t) is uniformly integrable
if
lim
c→∞
sup E(|Vn (t)|1|Vn (t)|>c )
= 0,
n
where, as usual, 1{|Vn (t)|>c} is the indicator function of the set {|Vn (t)| > c}.
Definition 2.3.19 (Versions of a random variable).
Let A(ω, t) and B(ω, t) be random variables defined for any t ∈ (0, 1), ω ∈ Ω.
We say that the random variable A(ω, t) is a version of B(ω, t) if the distributions (a(t1 ), ..., a(t` )) and (b(t1 ), ..., b(t` )) in R` coincide for any t1 , ..., t`
in (0, 1) and ` > 0.
32
Let us now define some quantites that can generally be constructed from the
random variables Vn and V . When we apply the following general results to
our particular inverse problem, these quantities will represent the measurements in the practical measurement model and in the computational model
respectively.
Definition 2.3.20 (Measurement σ-algebras M and Mn ).
Consider Akj ∈ Z for j = 1, ..., k and independent errors Ejk ∼ N (0, 1) for
j = 1, ..., k.
Define the random variables Mk and Mkn with components for j = 1, ..., k
R1
Mjk = < Akj , V > + Ejk = 0 V (·, t)Akj (dt) + Ejk
R1
Mjkn = < Akj , Vn > + Ejk = 0 Vn (·, t)Akj (dt) + Ejk .
Let the σ-algebras M and Mn in F be generated by the sets
!
M
=
Mn
\
= Σ
{ω ∈ Ω : Mjkn (ω) < λj }, λj ∈ R ;
Σ
\
j
{ω ∈ Ω : Mjk (ω) < λj }, λj ∈ R ;
!
j
respectively.
Remark. Note that, for the sake of simplicity, in this section we are taking
σ = 1 and we are considering the same noise processes for Mk and Mkn .
For fixed t ∈ (0, 1), we will in the following consider the conditional expectations of the random variables V (t) and Vn (t), given the σ-algebras M and
Mn .
Remark. Note that, by definition, M = (Mk )−1 (BR ) where Mk : Ω → Rk is
a random variable and BRk is the Borel σ-algebra defined on Rk .
Let us recall that the conditional expectation of V (t) given M, that is
E(V (t)|M) = E(V (t)|Mk ),
is a M-measurable random variable E(V (t)|M)(·) : Ω → Rk . Hence (see for
example [14] [Theorem 4.2.8]), there exists a deterministic function
Ẽ(V (t)|·) : Rk →
Rk
mk → Ẽ(V (t)|mk )
33
such that Ẽ(V (t)|·) ◦ Mk = Ẽ(V (t)|Mk (·)) = E(V (t)|M)(·) : Ω → Rk are
almost surely equal. We will call Ẽ(V (t)|mk ) the conditional mean with
measurement mk and occasionally denote it Ẽ(V (t)|Mk = mk ).
Remark. Let B(mk , r) ⊂ Rk be the ball with center mk and radius r. Then,
for any fixed ω in Ω and mk in Rk such that mk = Mk (ω),
Ẽ(V (t)|mk ) = E(V (t)|M)(ω)
1
= lim
r→0 P ({Mk (ω) ∈ B(mk , r)})
Z
V (t, ω)P (dω),
{Mk (ω)∈B(mk ,r)}
(2.38)
for almost every ω ∈ Ω.
Remark. If a conditional probability density function πV (t) (·|mk ) exists and
Ek has independent normally distributed components, then the Radon- Nikodym
derivative of the density of Mk with respect to the Lebesgue measure is
smooth and we can write
Z
Ẽ(V (t)|mk ) = s · πV (t) (s|mk )ds.
R
In the same way, we can define the conditional expectation Ẽ(Vn (t)|mk ).
We are now ready to introduce the central theorem of this section.
Theorem 2.3.2.
Let Vn and V be C(0, 1)-random variables. Suppose that
- Vn → V weakly in distribution when n goes to infinity;
- for any t ∈ (0, 1), the family of random variables {Vn (t)}n∈N are
uniformly integrable;
- Vn (t, ω), V (t, ω) ∈ L1 (Ω);
then
- lim Ẽ(Vn (t)|Mk ) = Ẽ(V (t)|Mk ) a.s.;
n→∞
34
- lim Ẽ(F (Vn (t))|Mk ) = Ẽ(F (v(t))|Mk ) a.s.;
n→∞
where F = 1(−∞,λ] , λ ∈ R.
Proof.
Construction of almost sure convergent versions
First of all, note that, by applying Skorohod’s representation theorem and
enlarging the probability space Ω if necessary, we have:
Lemma 2.3.1 (Almost sure convergent versions).
Fix t ∈ (0, 1). The random variables Vn (t), V (t), and Pkn , Pk whose components are Pjkn =< Vn , Akj > and Pjk =< V, Akj > for j = 1, ..., k have
such versions Ṽn (t), Ṽ (t), P̃kn and P̃k whose components are P̃jkn and P̃jk for
j = 1, ..., k that almost surely
lim Ṽn (t) = Ṽ (t)
n→∞
lim P̃jkn
n→∞
=
P̃jk
j = 1, ..., k.
Now, using the versions P̃ kn and P̃ k , we can define the random variables M̃kn
and M̃k whose components are
M̃jkn = P̃jkn + Ejk j = 1, ..., k.
M̃jk = P̃jk + Ejk j = 1, ..., k.
(2.39)
Here, Ejk for j = 1, ..., k are the components of the random vector Ek defined
as in (2.3.6).
L1 (Ω)-convergence
Fix t ∈ (0, 1).
Let F be either the identity map or the characteristic function
F (s) = 1(−∞,λ] (s)
where λ ∈ R. Now, the random variables F (Ṽn (t)) are clearly in L1 (Ω), uniformly integrable by hypothesis, converge almost surely and thus in probability, and the limiting variable F (Ṽ (t)) ∈ L1 (Ω). Hence, by Vitali’s convergence
theorem, it follows that
lim k F (Ṽn (t)) − F (Ṽ (t)) kL1 (Ω) = 0.
n→∞
35
Different expression of the conditional expected values
Let us group together all the variables in the same random vector. Define
thus the random vectors
0
Zkn = (Ṽn (t), P̃1kn , ..., P̃kkn ) = (Zn0 , Zkn
) ∈ Rk+1
Zk = (Ṽ (t), P̃1k , ..., P̃kk ) = (Z 0 , Zk0 ) ∈ Rk+1 .
Assume given a realization m̃k ∈ Rk of the random variables defined in (2.39)
and define the function
g : Rk × Rk → R
(y 0 , m̃k ) → πEk (y 0 − m̃k ),
(2.40)
where Ek has independent normally distributed components
Ejk
j = 1, ..., k.
Define also y = (y 0 , y 0 ) ∈ R1+k .
Zkn and Zk absolutely continuous
Let us consider the case wherein the random variables Zkn and Zk just defined
have laws PZkn and PZk absolutely continuous with respect to the Lebesgue
measure in Rk+1 with continuous probability density functions πZkn (y) and
πZk (y).
Note that M̃kn has a smooth positive probability density function in Rk given
by
Z
π(M̃kn = m̃k ) =
πEk (y 0 − m̃k )πZkn (y 0 , y 0 )dy
k+1
R
Z
(2.41)
0
0 0
=
g(y , m̃k )πZkn (y , y )dy.
Rk+1
The same can be proven for π(M̃k = m̃k ) and thus we have that
0
π(M̃kn = m̃k ) = E(g(Zkn
, m̃k ))
0
π(M̃k = m̃k ) = E(g(Zk , m̃k )).
36
Since g is clearly a bounded continuous function and, by hypothesis Vn con0
verges to V weakly in distribution, then, by definition, Zkn
converges to Zk0
in distribution and we see that
lim π(M̃kn = m̃k ) = π(M̃k = m̃k ).
n→∞
(2.42)
Moreover, since by (2.41) g is smooth, πM̃kn is a smooth function and we
obtain by Bayes’ formula
Z
1
0
π(Ṽn (t) = y |M̃kn = m̃k ) =
π(Zkn ,M̃kn ) (y 0 , y 0 , m̃k )dy 0
π(M̃kn = m̃k ) Rk
Z
1
=
g(y 0 , m̃k )πZkn (y 0 , y 0 )dy 0 .
π(M̃kn = m̃k ) Rk
(2.43)
By (2.38):
R
Ẽ(F (Ṽn (t))|M̃kn = m̃k ) = lim
B(m̃k ,r)
r→0
R
0
0
0
R F (y )π(Ṽn (t) = y , M̃kn = w)dy P (dw)
R
π(M̃kn = w)P (dw
B(m̃k ,r)
π(Ṽn (t) = y 0 , M̃kn = m̃k ) 0
dy
π(M̃kn = m̃k )
R
Z
1
F (y 0 )g(y, m̃k )πZkn (y 0 , y 0 )dy
=
π(M̃kn = m̃k ) Rk+1
1
0
=
E(F (Zn0 )g(Zkn
, m̃k )).
π(M̃kn = m̃k )
(2.44)
Z
=
F (y 0 )
It can be proven that a similar formula is valid for Ẽ(F (Ṽ (t))|M̃k = m̃k ),
that is
Ẽ(F (Ṽ (t))|M̃k = m̃k ) =
1
0
E(F (Z 0 )g(Zkn
, m̃k )).
π(M̃k = m̃k )
(2.45)
Zkn and Zk not absolutely continuous
In the general case where the laws PZkn of Zkn and PZk of Zk are not absolutely continuous, we cannot use the formula (2.41). Instead we can write
Z
π(M̃kn = m̃k ) =
R
g(y 0 , m̃k )PZkn (dy).
k+1
37
(2.46)
Since g is smooth it is clear by (2.52) that πM̃kn is smooth. Hence, in place
of formula (2.44) we obtain
R
R
F (y 0 )πEk (y 0 − w)PZkn (dy)P (dw)
B(m̃k ,r) Rk ×R
Ẽ(F (Ṽn (t))|M̃kn = m̃k ) = lim
.
R
r→0
π(M̃kn = w)P (dw)
B(m̃k ,r)
(2.47)
Since π(M̃kn = w) is smooth, there is a constant C > 0 such that
R
0
π
(y
−
w)P
(dw)
E
k
B(m̃k ,r)
R
≤ C for any y 0 ∈ Rk .
π(M̃kn = w)P (dw) B(m̃k ,r)
Thus, by Fubini’s theorem and Lebesgue’s theorem of dominated convergence
we obtain a formula analogous to the former case (2.44):
Z
1
F (y 0 )g(y 0 , m̃k )PZkn (dy).
Ẽ(F (Ṽn (t))|M̃kn = m̃k ) =
k+1
π(M̃kn = m̃k ) R
These formulas imply (2.42), (2.44) and (2.45) also in the general case.
Convergence of the conditional expected values
Let H(y) = F (y 0 )g(y 0 , m̃k ). Since we have proven that F (Zn0 ) converges to
F (Z 0 ) in L1 (Ω) and by definition, |g| ≤ 1, we have
lim H(Zkn ) = H(Zk )
n→∞
in L1 (Ω).
(2.48)
Hence it follows that
1
0
E(F (Zn0 )g(Zkn
, m̃k ))
n→∞ π(M̃
kn = m̃k )
(2.42) and (2.48)
=
Ẽ(F (Ṽ (t))|M̃k = m̃k ).
lim Ẽ(F (Ṽn (t))|M̃kn = m̃k ) = lim
n→∞
(2.49)
Since, by construction, the distributions of Vn (t) and Ṽn (t), V (t) and Ṽ (t),
Mkn (t) and M̃kn (t), Mk (t) and M̃k (t) coincide, by (2.38) we have
Ẽ(F (Vn (t))|Mkn = m̃k ) = Ẽ(F (Ṽn (t))|M̃kn = m̃k ),
Ẽ(F (V (t))|Mk = m̃k ) = Ẽ(F (Ṽ (t))|M̃k = m̃k ),
The proposition clearly follows.
38
(2.50)
Main results
At this point we are ready to present the results presented in [22] about
convergence of CM estimates.
Definition 2.3.21 (Brownian bridge UB ).
Define the Brownian bridge UB as stochastic process UB (t) = UB (t, ω)
defined for any t ∈ [0, 1], ω ∈ Ω and such that UB (t) is Gaussian for any
t ∈ [0, 1] and
E(UB (t))
= 0
E(UB (t1 )UB (t2 )) = σp2 |t1 ||1 − t2 |
R
p
where σp2 = R x2 e−|x| dx.
∀t ∈ [0, 1]
∀t1 , t2 ∈ [0, 1], t1 ≤ t2 ,
(2.51)
Remark. By Kolmogorov’s theorem, whose hypotheses are satisfied thanks
to (2.51), we can choose UB such that its realizations t → UB (t) are almost
surely continuous.
Let us now introduce a lemma that will be useful in proving the main theorem
of this section.
Lemma 2.3.2.
Let α̃ > 0. Let Un , n ∈ N be random variables with TV probability distribu1
tions in Yn and parameter αn = α̃(n + 1) 2 .
1. Fix t in (0, 1). Then:
(i) lim Un (t) = UB (t) in distribution;
n→∞
(ii) The family of random variables Un (t), n ∈ N is uniformly
integrable;
2. Let 0 < s1 < t1 < ... < tQ ≤ s2 < 1. Then
(i) lim (U (t1 ), U (t2 ), ..., U (tQ )) = (UB (t1 ), UB (t2 ), ..., UB (tQ )) in
n→∞
distribution.
(ii) If Akj ∈ Z, j = 1, 2, ..., k are measure supported on [s1 , s2 ]
and we define the following random variables
Z s2
Z s2
kn
k
k
Pj =
Un (·, t)Aj (dt)
Pj =
UB (·, t)Akj (dt).
s1
s1
39
Then
limn→∞ (P1kn , P2kn , ..., Pkkn ) = (P1k , ..., Pkk ) in distribution
for j = 1, ..., k.
Proof.
Definition of a new Y-valued random sequence Hn
n
) be a random vector in Rn+1 with density
Let (H1n , H2n , ..., Hn+1
(
!)
n
X
1
π̃n (hn1 , ..., hnn+1 ) = Ĉn exp −α̃(n + 1) 2 |hn1 | +
|hnj+1 − hnj |
, (2.52)
j=1
where Ĉn is a normalization constant and Ĉn =
n+1
1
1
α̃(n+1) 2
.
(n)
By definition, it is easy to show that π̃n (y1 , ..., yn , 0) = C̃n πpr (y1 , ..., yn ),
Ĉn
where C̃n = C
and Cn is the same as in Definition 2.3.3.
n
Define a Y -valued random function Hn such that for any ω ∈ Ω,
hn = Hn (ω) ∈ Y
is the piecewise linear function defined by
Hn (ω, t) = hn (t) =
n+1
X
j=1
hnj ψjn (t),
|{z}
t ∈ [0, 1]
(2.53)
=Hjn (ω)
n
where ψjn ∈ Yn for j = 1, ..., n are the same as in Definition 2.3.2 and ψn+1
n
n
is the piecewise linear function satisfying ψn+1
(1) = 1 and ψn+1
(t) = 0 for
n
0 ≤ t ≤ n+1 .
Convergence in distribution of Hn (t) for n going to infinity and
fixed t ∈ (0, 1)
Define for j = 1, ..., n
1
n
Xjn = α̃(n + 1) 2 (Hjn − Hj−1
),
40
(2.54)
where H0n = 0. It clearly follows that for j = 1, ..., n
Hjn
=
j
X
1
α̃(n + 1)
1
2
X`n .
(2.55)
`=1
Now we are able to calculate the joint probability density of the random
n
vector Xn = (X1n , ..., Xn+1
) since this random vector is just a transformation
n
of the random vector Hn = (H1n , ..., Hn+1
), whose joint probability density is
known. The joint probability density of X is then
!
n+1
n+1
X
1
π̃n,X (xn1 , ..., xnn+1 ) =
exp −
|xnj | .
(2.56)
2
j=1
Since the joint probability density of X can be factorized into the product
of the single probability densities of its component, we see that the Xjn for
j = 1, ..., n + 1 are identically distributed independent variables, that is
Xjn ∼ X s.t. πX (t) =
1
exp(−|t|),
2
j = 1, ..., n + 1
(2.57)
and that their distribution does not depend on n.
We are thus able to define a sequence of identically distributed independent
random variables X` ∼ X for ` = 1, 2, 3, ... and define the random variables
j
1 X
√
Sj =
X`
j `=1
j = 1, 2, 3, ...
(2.58)
Fix t ∈ (0, 1). Note that, by construction, the random variables
Hn (·, t) = Hn (t)
can be expressed in the following way:
(n + 1)t − θ(n, t) n
θ(n, t)1/2
Sθ(n,t) (ω) +
Xθ(n,t)+1 (ω),
1/2
α̃(n + 1)
α̃(n + 1)1/2
(2.59)
j
where θ(n, t) is the largest j such that n+1 ≤ t.
Denote
θ(n, t)1/2
(n + 1)t − θ(n, t)
kn,t =
rn,t =
.
1/2
α̃(n + 1)
α̃(n + 1)1/2
Hn (ω, t) = Hn (t) =
41
Now we are interested in the limit in distribution of the random sequence
Hn (t), for n ∈ N.
n
To that end, note that the distribution of the random variables Xθ(n,t)+1
for
n ∈ N is given by (2.57) and hence does not depend on n. Thus the limit in
n
distribution of the sequence of random variables Xθ(n,t)+1
is just a random
variable X with distribution (2.57) independent of n. On the other hand, by
definition we have that
0≤t−
1
θ(n, t)
≤
n+1
n+1
(2.60)
and thus
0 ≤ t(n + 1) − θ(n, t) ≤ 1,
and t(n + 1) − θ(n, t) remains bounded. Hence
(n + 1)t − θ(n, t)
1
≤ lim
= 0,
1/2
n→∞
n→∞ α̃(n + 1)1/2
α̃(n + 1)
0 ≤ lim
(2.61)
and the second term on the right side in (2.59) goes to zero in distribution.
As regards the first term, we can use the central limit theorem. In fact,
θ(n,t)
X
θ(n, t)1/2
1
S
=
X` ,
θ(n,t)
α̃(n + 1)1/2
α̃(n + 1)1/2 `=1
(2.62)
where the X` are independent identically distributed random variables with
zero mean. Now since, by (2.60),
θ(n, t)
= 0,
lim t −
n→∞
n+1
we can say that as n goes to infinity θ(n, t) is asymptotic to n. Hence, by the
central limit theorem, we finally get that for n going to infinity, Hn (t) goes in
distribution to a Gaussian random variable H(t) with zero mean such that
E[(H(t) − H(s))2 ] = |t − s|σ12
when t ≥ s, that is the stochastic process H(ω, t) is the Brownian motion.
42
Convergence in distribution of Un (t), for n going to infinity and
fixed t ∈ (0, 1)
What we have already proven is still not enough for our purposes, since we
are interested in the convergence in distribution of Un (t), rather than Hn (t).
Let us prove some properties of the characteristic functions of Hn (t) and X
which will be useful in the following, where X is a random variable distributed
as in (2.57).
Denote the characteristic function of X by
Z
1 −|x|
isX
isx
dx,
e
φ(s) = E(e ) = e
2
R
which is the Fourier transform of the function 12 exp (−|t|) and therefore in
C ∞ (R). It therefore makes sense to calculate φ00 (s) = (−i)2 E(X 2 ) = −σ12 .
Thus, it follows that there exist σ02 < σ12 and > 0 such that
|φ(s)| ≤ exp(−
σ02 s2
)
2
|s| < .
(2.63)
Now decompose φ into the sum of two different functions φ1 and φ2 , such
that the former vanishes outside of the interval (−, +) whereas the latter
is zero inside (−, +), i.e.
φ(s) = φ1 (s) + φ2 (s) s ∈ R
φ1 (s) = 0
|s| ≥ φ2 (s) = 0
|s| < .
Further define a = sup |φ2 (s)| which is smaller than 1 since any characteristic
function, by definition, is bounded by 1.
Also note that, being a sum of independent identically distributed random
variables, it is easy to calculate the characteristic function of Sj which is
j
s
Ψj (s) = φ( √ ) .
j
By the central limit theorem and the fact that convergence in distribution
implies pointwise convergence of the respective characteristic functions, we
have that
2 2
σ |s|
s ∈ R,
(2.64)
lim Ψj (s) = exp − 1
j→∞
2
43
where the right side of the equality is just the characteristic function of a
normal distributed random variable with zero mean and variance σ12 , that,
by the central limit theorem, is the limit in distribution of Sj .
As per definition the supports of φ1 and φ2 are disjoint, it holds that
j
s
Ψj (s) = φ( √ )
j
j
s
s
= φ1 ( √ ) + φ2 ( √ )
j
j
(2.65)
j j
s
s
= φ1 ( √ ) + φ2 ( √ )
j
j
= Ψ1j (s) + Ψ2j (s).
Now for any q ≥ 1 we see that
k
Ψ2j
kqLq (R)
Z
s
|Ψ2j ( √ )|q ds
j
ZR
s
= |φ2 ( √ )|jq ds
j
R
Z
s
≤ a(j−1)q |φ2 ( √ )|q ds
j
ZR
p
≤ a(j−1)q |φ2 (t)|q jdt
R
=
(2.63)
≤ a(j−1)q j 1/2 k
a<1, j→∞
→
(2.66)
0,
where k is a constant. It follows that for any q ≥ 1 the sequence k Ψ2j kLq (R) ,
for j=1,2,..., converging to zero as j goes to infinity, is limited.
Moreover we have the following estimate for Ψ1j :
σ2 “ ”2 j
(2.63)
σ 2 s2
− 0 √s
1
|Ψj (s)| ≤
e 2 j
= exp(− 0 ).
2
σ02 |·|2
)
2
∈ Lq (R) for any q ≥ 1, we have just proventhat the
σ12 |·|2
sequence Ψj for j=1,2,..., pointwise converges to the function exp − 2
and is dominated by a function in Lq (R) for any q ≥ 1. Hence by Lebesgue
theorem of dominated convergence
2 2
σ |·|
kLq (R) = 0
1 ≤ q < ∞.
(2.67)
lim k Ψj (·) − exp − 1
j→∞
2
Now, since exp(−
44
Also note that, as previously mentioned, φ ∈ C ∞ (R). Thus, due to its
smoothness and the fact that rn,t goes to 0 when n goes to infinity and
φ(0) = 1 and φ0 (0) = 0, we obtain
lim k φ(rn,t s) − 1 kC 1 (−L,L) = 0
n→∞
(2.68)
for any L, where for any f in C 1 (−L, L),
k f kC 1 (−L,L) =k f k∞ + k f 0 k∞
and
k f k∞ = sup |f (x)|.
x∈R
Also, by definition, |φ(rn,t s)| ≤ 1.
We will use properties of the characteristic functions in order to prove convergence in distribution of Un (t). Note that, by definition, the following formula
ties together Un (t) and Hn (t):
Un (t) = E(Hn (t)|Hn (1) = 0) a.s.
(2.69)
Let us try to simplify the expression of the conditional probability density
used in (2.69). The Lipschitz continuity and positivity of Un (t) and Hn (t)
justify the use of Bayes’ formula for probability density functions, and we
obtain for a ∈ R
π(Hn (t) = a) ∩ (Hn (1) = 0))
π(Hn (1) = 0)
π(Hn (t) = a)π(Hn (1) = 0|Hn (t) = a)
=
.
π(Hn (1) = 0)
π(Hn (t) = a|Hn (1) = 0) =
(2.70)
Now, since Hn (1) − Hn (t) has the same distribution as Hn (1 − t), by (2.70)
it follows that
π(Hn (t) = a|Hn (1) = 0) = dn π((Hn (t) = a)π(Hn (1 − t) = −a),
(2.71)
1
where dn = π(Hn (1)=0)
.
At this point, note that, by (2.59), Hn (t) is a sum of independent random
variables, hence its characteristic function is
Vn (s) = Ψθ(n,t) (kn,t s)φ (rn,t s) .
45
(2.72)
The characteristic function of Hn (1 − t), denoted by Gn (s), has a similar
expression. Hence, by (2.71), the characteristic function of Un (t) has the
form
Φn (t) = dn (Vn ∗ Gn )(s).
(2.73)
Thus, by (2.67), (2.68), (2.72) and the fact that |φ(rn,t s)| ≤ 1.,
s∈R
lim (Vn ∗ Gn )(s) = V ∗ G(s),
n→∞
(2.74)
where V (s) = lim Vn (s) and G(s) = lim Gn (s), V and G are Gaussian
n→∞
n→∞
2
functions with zero mean and variance, and σV2 = tσ12 and σG
= (1 − t)σ12 ,
respectively. Also, Φn being a characteristic function, Φn (0) = 1 and, V and
G being Gaussian, V ∗ G(0) is a constant. Thus dn has to converge to a
positive constant when n goes to infinity. Finally, it is clear from properties
of Gaussian random variables that
lim Φn (s) = Φ(s, t) := e−
σ 2 (t)s2
2
n→∞
1
1
1
= 2+
.
2
σ (t)
tσ1 (1 − t)σ12
(2.75)
Since the limit of (2.75) exists at every s and the limit function Φ(s; t) is
continuous at s = 0, it follows from Levy’s continuity theorem that there
exists such a random variable Ut that
lim Un (t) = lim E(Hn (t)|Hn (1) = 0) = U (t) i.d.
n→∞
n→∞
and the characteristic function of U (t) is Φ(s; t). Since a random variable
is completely identified by its characteristic function and Φ(s; t) is also the
characteristic function of UB (t), it follows that U (t) is UB (t).
Uniform integrability of Un (t) for n ∈ N and fixed t ∈ (0, 1)
Consider L2 -bounds. We see that
Z
2
E(|Un (t)| ) = s2 π(Un (t) = s)ds = −∂s2 Φn (s)|s=0
R
= −dn (∂s Vn ) ∗ (∂s Gn )(s)|s=0 .
46
(2.76)
We would like to estimate the right side of (2.76) with a constant which does
not depend on n.
To this end, let us first consider ∂s Ψj
1
s
s
s
(2.77)
∂s Ψj (s) = ∂s (φ( √ ))j = j √ (∂s φ)( √ )(φ( √ ))j−1 .
j
j
j
j
R
Note that, since φ(s) = R (1/2) exp(isx − |x|)dx, it follows that
s∈R
1
|φ(s)| ≤ K (1+|s|)
2
φ(s)
= 1−
σ12 2
s
2
+ O(s3 ) s near 0.
(2.78)
Thus, | ∂s φ(s)
| ≤ K 0 for s ∈ R and we get the estimate
s
|∂s Ψj (s)| = |
(∂s φ)( √sj )
√s
j
s
s
||s(φ( √ ))j−1 | ≤ K 0 |s(φ( √ ))j−1 |.
j
j
If we use the decomposition of φ we used before, i.e. φ = φ1 + φ2 , we see
that for 1 ≤ q < ∞
Z
s j−1 q
s
q(j−3)
k s(φ2 ( √ ))
kLq (R) ≤ a
sq |φ2 ( √ )|2q ds
j
j
R
(2.79)
a<1,
j→∞
00 q(j−3) (1+q)/2
≤K a
j
→ 0.
Moreover, we see that
σ02 ( √sj )2 j−1
s j−1
s2
|s(φ1 ( √ )) | ≤ s(exp(−
))
≤ s exp(−σ02 ),
2
4
j
which is, again, an integrable bound. Thus, as before, using the Lebesgue
theorem of dominated convergence, we see that
√
σ 2 s2
|Vn (s)| + |∂s Vn (s)| ≤ C2 s exp(− 0 ),
|s| < n,
8
Z
(2.80)
q
q
(|V
(s)|
+
|∂
V
(s)|
)ds
=
0
1
≤
q
<
∞.
lim
n
s n
n→∞ R\[−√n,√n]
Thus we have obtained an uniform integrable bound on the interval
√ √
[− n, n]
and in the complement of this interval we can estimate Lq (R)-norms uniformly. Using (2.80), we see for Vn (s) and Gn (s) that
|(∂s Vn ) ∗ (∂s Gn )(0)| ≤k ∂s Vn kL2 (R) k ∂s Gn kL2 (R) ≤ C3 .
Thus E(|Un (t)|2 ) ≤ C4 and in particular the family Un (t) for n ∈ N is
uniformly integrable.
47
Convergence in distribution for n going to infinity of
(< Ak1 , Hn >, ..., < Akk , Hn >, Hn (t1 ), ..., Hn (tQ ))
We still have to prove the convergence in distribution of the joint distributions
Un (tr ) for r = 1, ..., Q and of the joint distributions Pjkn for j=1,...,k.
We prove them together by considering the joint distribution of Un (tr ) for
r = 1, ..., Q and Pjkn =< Akj ; Un > for j = 1, ..., k. To do so, let us standardize
the notations and denote Akk+r = δ(t − tr ) for r = 1, ..., Q. In fact, in this way
we are able to represent Un (tr ) as Un (tr ) =< Akk+r , Un > for r = 1, ..., Q.
For j = 1, ..., k, ` = 1, ..., n + 1 and r = 1, ..., Q, denote
Z 1
∞
X
j
cjn` .
(2.81)
cn` =
ψ`n (t)Akj (dt)
bjnr =
0
`=r
Using (2.53), (2.55) and (2.81) and changing the order of summation we
obtain
Z 1
k
< Aj , Hn > =
Hn (t)Akj (dt)
0
!
Z 1 X
n+1
=
H`n ψ`n (t) Akj (dt)
0
=
=
n+1
X
`=1
n+1
X
`=1
H`n
Z
1
ψ`n (t)Akj (dt)
0
H`n cjn`
(2.82)
`=1
n+1
X
!
`
X
1
=
Xin cjn`
1/2
α̃(n + 1)
i=1
`=1
!
n+1
n+1
X X j
1
cn` Xin
=
1/2
α̃(n + 1)
i=1
`=i
n+1
X
1
=
bjni Xin .
1/2
α̃(n + 1)
i=1
for j = 1, ..., k + Q. Now, for i = 1, ..., n + 1 and j = 1, ..., k + Q define
Ynij =
1
bjni Xin .
1/2
α̃(n + 1)
48
P
j
It is clear that, for j = 1, ..., k + Q, < Akj , Hn >= n+1
i=1 Yni . If we define
n + 1 random vectors in Rk+Q whose components are Ynij and we denote
them Yni for i = 1, ..., n + 1, we see that, thanks to properties of Xin , for
i = 1, ..., n + 1, the Yni are independent. Thus we have just expressed our
random vector (< Ak1 , Hn >, ..., < Akk+Q , Hn >) in an extremely convenient
manner as
k
k
(< Akj , Hn >)k+Q
j=1 = (< A1 , Hn >, ..., < Ak+Q , Hn >)
=
n+1
X
n+1
n+1
X
X
Yni = (
Yni1 , ...,
Ynik+Q ).
i=1
i=1
(2.83)
i=1
We will prove that (2.83) converges in distribution to a Gaussian random
variable when n goes to infinity. By the Fabian-Hannan version of the Lindeberg central limit theorem it is enough to show that
lim
n→∞
n+1
X
E(|β · Yni |2 1{|β·Yni |>} ) = 0,
(2.84)
i=1
for arbitrary β ∈ Rk+Q and > 0. Prove (2.84). Let β ∈ Rk+Q and > 0.
k+Q
X
For i = 1, ..., n + 1, set γni =
βj bjni . Note that it is easy to show that
j=1
!1/2
k+Q
|bjni | ≤k Akj kZ and thus |γni | ≤ C|β|, where |β| =
X
βj2
.
j=1
Note also that for i = 1, ..., n + 1, we have
k+Q
X
k+Q
β · Yni =
X
βj Ynij =
j=1
βj bjni Xin
j=1
α̃(n + 1)1/2
=
γni Xin
.
α̃(n + 1)1/2
(2.85)
Estimate
n+1
X
i=1
E(|β · Yni |2 1{|β·Yni |>} ) ≤
n+1
X
i=1
E
!
γni Xi 2
α̃(n + 1)1/2 1{|γni Xi |>α(n+1)1/2 }
n+1
X
c
2
≤ 2
γni
α̃ (n + 1) i=1
49
Z
{|γni Xi |>α̃(n+1)1/2 }
|x|
x2 e− 2 dx
n+1
X
α̃(n + 1)1/2
c
2
γ exp −
≤ 2
α̃ (n + 1) i=1 ni
3|γni |2
n+1
X
c
α̃(n + 1)1/2
2
2
≤ 2
C |β| exp −
α̃ (n + 1) i=1
3C 2 |β|2
c
(n + 1)C 2 |β 2 | exp −C 0 (n + 1)1/2
= 2
α̃ (n + 1)
n→∞
= C exp(−C 0 (n + 1)1/2 ) → 0.
Hence, by (2.84), (< Akj , Hn >)k+Q
j=1 converges in distribution to a Gaussian
k+Q
k
variable (< Aj , H >)j=1 where H is Brownian motion on [0, 1] and H(0) = 0.
Convergence in distribution for n going to infinity of
(< Ak1 , Un >, ..., < Akk , Un >, Un (t1 ), ..., Un (tQ ))
We have already proven convergence in distribution of
(< Ak1 , Hn >, ..., < Akk , Hn >, Hn (t1 ), ..., Hn (tQ ))
for n going to infinity, but we are actually interested in convergence in distribution of the same sequence with Un instead of Hn . Thus, as we did before,
we first prove convergence in Lq (R) for every 1 ≤ q < ∞ of
(< Ak1 , Hn >, ..., < Akk , Hn >, Hn (t1 ), ..., Hn (tQ ))
when n goes to infinity.
To simplify the notation we replace s2 by the approximation
sn =
rn
,
s
rn = θ(n, s2 ) + 1.
n
The characteristic function of the random vector (Xrn , b1nr Xrn , ..., bk+Q
nr Xr ) is
k+Q
k+Q
E(exp(i(ζXrn
+
X
η j bjnr Xrn )))
j=1
= φ(ζ +
X
bjnr η j ),
j=1
where we would like to recall that φ is the characteristic function of X distributed as in (2.57). Thus, using (2.83), we see that the characteristic
50
function of (Hn (sn ), < Ak1 , Hn >, ..., < Ak+Q , Hn >) is
k+Q
Φn (ζ, η 1 , ..., η k+Q ) = E(exp(i(ζHn (sn ) +
X
η j < Akj , Hn >)))
j=1
=
rn
Y
Pk+Q
φ
j=1
ζ+
α̃(n
j
j
j=1 bnr η
+ 1)1/2
(2.86)
!
,
where we used bjnr = 0 for r > rn . Since in the previous section we have
shown that (< Akj , Hn >) converges in distribution to a Gaussian random
variable when n goes to infinity, then there exists such a Gaussian function
Φ(ζ, η) that
lim Φn (ζ, η) = Φ(ζ, η).
(2.87)
n→∞
for any ζ ∈ R and η ∈ R
.
Now, let , σ2 and a < 1 be such that
k+Q
|φ(t + s)| ≤ exp(−σ2 |t + s|2 /2) for |s| < , |t| ≤ 2
|φ(t + s)| ≤ a
for |s| < , |t| > 2.
k+Q
1
Fix η = (η , ..., η
k+Q
) and set β = sup |
n,r
that
|β|
α̃(n+1)1/2
X
bjnr η j |. Take n sufficiently large
j=1
< . Then
Pk+Q j Qrn
ζ+ j=1 bnr η j 1{|ζ|<2α̃(n+1)1/2 } r=1 φ
α̃(n+1)1/2
Pk+Q j 2 !
P n ζ+ j=1 bnr ηj ≤ 1{|ζ|<2α̃(n+1)1/2 } exp − σ22 rr=1
α̃(n+1)1/2 ≤ exp σα̃2 (β 2 − 16 |ζ|2 )
(2.88)
and moreover
ζ+
Pk+Q
bjnr η j
k 1{|ζ|<2α̃(n+1)1/2 } r=1 φ
α̃(n+1)1/2
Pk+Q j q R t+ j=1 bnr η j dt
≤ aq(rn −1) R φ
α̃(n+1)1/2 Qrn
j=1
kqLq (R)
(2.89)
≤ caq(rn −1) α̃(n + 1)1/2 → 0 as n → ∞.
This and (2.87) show that for any fixed η
lim k Φn (·, η) − Φ(·, η) kLq (R) = 0,
n→∞
51
1 ≤ q < ∞.
(2.90)
→
−
−
k+Q
. Then
Consider next →
a n = (< Akj , Un >)k+Q
j=1 and b = (b1 , ..., bk+Q ) ∈ R
→
−
−
Rn (f, b1 , ..., bk+Q ) = π(Hn (sn ) = f, →
a n = b |Hn (1) = 0)
→
−
1
−
π(Hn (sn ) = f, →
a n = b , Hn (1) = 0)
=
π(Hn (1) = 0)
→
−
→
−
−
−
= dn π(Hn (1) = 0|Hn (sn ) = f, →
a n = b )π(Hn (sn ) = f, →
an = b)
→
−
−
= d π(H (1) = 0|H (s ) = f )π(H (s ) = f, →
a = b ).
n
n
n
n
n
n
n
Note that the last equality was obtained by using the fact that Hjn is a Markov
sequence for j = 1, ..., n and Akj are supported on [s1 , s2 ] with s2 < sn , so
that
→
−
−
π(Hn (1) = 0|Hn (sn ) = f, →
a n = b ) = π(Hn (1) = 0|Hn (sn ) = f ).
Let us now introduce an auxiliary variable d and a function
→
−
−
Rn1 (d, f, b1 , ..., bk+Q ) = dn π(Hn (1) = 0|Hn (sn ) = f −d)π(Hn (sn ) = f, →
a n = b ),
(2.91)
for which Rn (f, b1 , ..., bk+Q ) = Rn1 (0, f, b1 , ..., bk+Q ). Define functions
W n : R1+k+Q → R and K n : R → R
by
→
−
→
−
−
W n (f, b ) = π(Hn (sn ) = f, →
an = b)
K n (f )
= π(Hn (1) = 0|Hn (sn ) = f ) = π(Hn (1 − sn ) = −f ),
and denote the Fourier transform by Ŵ n (ξ, η) and K̂ n (ξ). If ζ is the Fourier
variable corresponding to d, the Fourier transform of Rn1 (d, f, b1 , ..., bk+Q ) is
by (2.91)
R̂n1 (ζ, η 1 , ..., η k+Q ) = dn Ŵ n (ζ, η 1 , ..., η k+Q )K̂ n (ξ + ζ).
(2.92)
−
Then the characteristic function of the variable (Un (sn ), →
a n ) is
Z
Φn (ξ, η 1 , ..., η k+Q ) = dn Rn1 (ζ, ξη 1 , ..., η k+Q )dζ.
R
Since |Ŵ n (ζ, η 1 , ..., η k+Q )| ≤ 1 and K̂ n converges to a Gaussian function in
L1 (R) by (2.64) and (2.90), we see that
Z
1
k+Q
lim Φn (ξ, η , ..., η
) = ( lim dn ) ( lim Rn1 (ζ, ξ, η 1 , ..., η k+Q ))dζ.
n→∞
n→∞
R n→∞
52
Here, Rn1 converges to a Gaussian function. Thus the characteristic function
Φn (ξ, η) with fixed ξ and η converges to the Gaussian function Φ(ξ, η), which
implies convergence in distribution of joint distributions.
At this point, we can finally formulate the main result of this section.
Theorem 2.3.3 (Limit behavior of CM estimates).
Let α̃ > 0 and Un , n ∈ N be random variables with TV probability density in
Yn .
1
1
Case αn = α̃(n + 1) 2 . If αn = α̃(n + 1) 2 , then for any t ∈ [0, 1]
(i) the prior distributions converge i.e.
lim Un (t) = UB (t)
n→∞
in distribution;
(ii) the posterior distributions converge in distribution i.e.
lim Ẽ(F (Un (t))|m̃k ) = Ẽ(F (UB (t))|m̃k )
n→∞
almost sure, where F = 1(−∞,λ] and λ ∈ R;
(iii) the CM estimates converge i.e.
lim Ẽ(Un (t)|m̃k ) = Ẽ(UB (t)|m̃k )
n→∞
almost sure.
Case αn = α̃(n + 1)q , q > 12 . If αn = α̃(n + 1)q where q > 21 , then for any
t ∈ [0, 1], lim Un (t) = 0.
n→∞
Case αn = α̃(n + 1)q , q < 12 . If αn = α̃(n + 1)q where q < 21 , then for any
t ∈ [0, 1], the random variables Un (t) do not converge even in distribution.
Proof.
53
Case αn = α̃(1 + n)1/2
By Lemma 2.3.2 we know that Un → UB weakly in distribution, and that,
for fixed t ∈ (0, 1), Un (t) → UB (t) as n goes to infinity, Un (t) for n ∈ N
are uniformly integrable and UB (t) ∈ L1 (Ω). Hence, all the hypotheses of
Theorem 2.3.2 are satisfied and Theorem 2.3.3 is just the same as Theorem
2.3.2 with Un and UB instead of Vn and V .
Case αn = α̃(1 + n)q where q <
1
2
In this case we see that Wn (t) = α̃(n+1)q−3/2 Ũn (t) converges to the Brownian
bridge. Thus Ũn (t) converges to zero in L1 (Ω) when q < 1/2.
Case αn = α̃(1 + n)q where q >
1
2
As in the previous case, Wn (t) = α̃(n + 1)q−3/2 Ũn (t) converges to the Brownian bridge and thus Un (t) cannot converge even in distribution in the case
q < 1/2.
There are two important consequences of this theorem:
Remark (Total variation priors in Yn are not discretization-invariant).
Note that we have just proved that for αn constant and for any fixed t ∈ [0, 1],
Un (t) converges to UB (t) which is Gaussian distributed. Since any linear discretization of Gaussian distributions are also Gaussian, we see that it is not
possible to consider total variation priors, that are not Gaussian at all, as discretizations of their limit random variable, i.e. the Brownian bridge. Thus,
we conclude that total variation priors are not discretization invariant.
Remark (Critical value of αn ).
Note that we have just proven that the critical value for αn in the case of
CM estimates is αn = (1 + n)1/2 . In fact, this is the only value of αn for
which we have convergence in distribution of our priors to a non-trivial limit.
In all the other cases, i.e. for values of αn of greater or smaller order than
(1 + n)1/2 , we have convergence in distribution to 0 or no convergence at all.
54
Chapter 3
New proposal for the priors
In this work, our aim is to disentangle the inconsistency shown by total
variation priors when the limit for n going to infinity is considered. We
presented this inconsistency in the previous chapter. Motivated by the desire
to find an alternative prior for the unknown which would have eventually been
edge-preserving and discretization invariant, we considered, as priors, slight
modifications of the total variation priors. In fact, our idea is that considering
a prior based on total variation but decaying faster than the simple total
variation would lead us to the desired properties of discretization invariance
for the priors. It seems then natural to consider a new class of priors for the
unknown, defined as follows.
Notation. In this chapter we will refer to all the quantities introduced in
the previous chapter, using them exactly in the same way as in Chapter 2.
3.1
A new setting for our problem
We are going to introduce the new class of priors for random functions Un .
Simplifying what have been done in [22], we would like to suppose elements in
Yn to be piecewise constant functions (instead of piecewise linear functions)
on [0, 1] such that their value in 0 and 1 is zero. Thus, taking a Y -valued
random function U , where Y = C0 ([0, 1]), we can write its discretizations as
the Yn -valued random functions:
55

1
t ∈ {1} ∪ [0, n+1
)

 0
n
X
Un (ω, t) =
1
Ujn (ω)1[ j , j+1 ) (t)
t ∈ [ n+1
, 1).


n+1 n+1
j=1
At this point we choose the random functions Un to be distributed according
to the following probability density function:
Definition 3.1.1 (q-total variation priors).
Let n ∈ N and q ≥ 1 be integers. The Yn -valued random function Un : Ω → Yn
has the q-total variation (TV) probability distribution in Yn if the Rn valued random vector [U1n (ω), ..., Unn (ω)]T has the probability density function
(n,q) n
πpr
(u1 , ..., unn ) = Cq,n exp −αn
n+1
X
!q !
|unj − unj−1 |
(3.1)
j=1
= Cq,n exp (−αn (T V (un ))q )
where un0 = unn+1 = 0 and Cq,n is a normalization constant.
Remark. Note that if we choose q = 1, we simply obtain the total variation
priors treated in the previous chapter.
3.2
Behavior of MAP estimates
As regards the behavior of MAP estimates, our new priors do not produce
a significant modification of what we have already obtained in the case of
total variation priors. In fact, note that modeling prior knowledge about U
with the q-total variation distribution in the space Yn leads to the following
posterior distribution:
1
q
πkn
(un |mk ) = Cq,kn exp − 2 k Ak un − mk k2Rk −αn
2σ
n+1
X
!q !
|unj − unj−1 |
,
j=1
(3.2)
56
where Cq,kn is a constant. Also, recall that the MAP estimates in this case
are defined as
q
AP
uM
n,q (·; βn ) ∈ arg max πkn (un |mk )
un ∈Yn
!q !
n+1
X
1
= arg max exp − 2 k Ak un − mk k2Rk −βn
|unj − unj+1 |
un ∈Yn
2σ
j=1
(3.3)
AP
Hence, if we have uM
n,q (·; βn ) as in (3.3), then it also solves the maximization
problem
!!
n+1
X
1
arg max exp − 2 k Ak un − mk k2Rk −αn
|unj − unj+1 |
,
un ∈Yn
2σ
j=1
where
AP q−1
αn = βn T V (uM
.
n,q )
(3.4)
It follows that it is just a MAP estimate relative to the problem with total
variation priors and a different parameter. In other words,
AP
M AP
uM
(·; αn ),
n,q (·; βn ) = un
AP q−1
where αn = βn T V (uM
.
n,q )
It is easy to show that the opposite is still true, that is, if we have a MAP
AP
estimate relative to the problem with total variation priors, i.e. uM
(·; αn )
n
M AP
and T V (un ) 6= 0 then it also solves the following problem:
1
q
2
arg max exp − 2 k Ak un − mk kRk −βn (T V (un )) ,
un ∈Yn
2σ
n
where βn = (T V (uMαAP
.
))q−1
n
Hence, it follows that it is just a MAP estimate relative to the problem with
q-total variation priors and different parameters, that is
AP
AP
uM
(·; αn ) = uM
n
n,q (·; βn ),
where
βn =
αn
AP ))q−1
(T V (uM
n
57
.
(3.5)
AP
In the case that T V (uM
) = 0,
n
1
M AP
un (·, αn ) = arg max exp − 2 k Ak un − mk
un ∈Yn
2σ
1
= arg max exp − 2 k Ak un − mk
un ∈Yn
2σ
1
= arg max exp − 2 k Ak un − mk
un ∈Yn
2σ
2
kRk −αn T V (un )
2
k Rk
2
q
kRk −αn (T V (un ))
AP
= uM
n,q (·, βn ),
(3.6)
with βn = αn .
Note that the last case appears for αn approaching its limit.
Hence we can conclude that there exist αnmax and βnmax such that
M AP
AP
βn > βnmax T V (uM
n,q ) = 0 ⇒ un,q (·, βn ) = 0
.
AP
AP
(·, αn ) = 0
) = 0 ⇒ uM
αn > αnmax T V (uM
n
n
We can therefore state that there exists a bijection
[0, αnmax ] → [0, βnmax ],
that, if needed, can be inverted.
AP
(·; αn )
It is now straightforward to claim that all the results we proved for uM
n
M AP
are still valid for un,q (·; βn ) and, further recall that the only case in which
we obtained convergence for the MAP estimates in the case of total variation priors was for αn bounded. It is clear that αn bounded implies βn
bounded and hence we can effectively obtain the same results of convergence
we obtained before for MAP estimates, in the new case with q-total variation
priors.
3.3
Behavior of CM estimates
Dealing with the CM estimates is much more difficult in the case in which we
choose for Un the q-total variation probability distribution in Yn with q 6= 1.
In fact, the argument presented in [22] basically shows that it is possible
to express the random functions Hn , which are slight modifications of our
58
random functions Un , as a sum of independent and identically distributed
random variables with zero mean. Thanks to this formulation, the authors
could appeal to the central limit theorem or to the law of large numbers for
proving convergence in distribution to a smooth random variable or convergence in distribution to zero, respectively
Unfortunately, in our case, even if we are still able to express Un as a sum
of random functions, the assumption of mutual independence of these random variables is no longer valid. This fact has actually been seen as a good
sign. In fact, since the central limit theorem is no longer applicable, we
avoid the problem of convergence to a smooth limit which in the case of
total variation priors had implied no discretization invariance. Hence, we
tried to find an alternative to the central limit theorem for proving convergence in distribution of Un , which would imply convergence of the priors of
Un . In the end, we were able to show convergence in the weak topology of
Lp ([0, 1]) = Lp ([0, 1], B[0,1] , λ|[0,1] ) (where B[0,1] is the σ-algebra of Borel sets
of [0, 1] and λ|[0,1] is the Lebesgue measure in [0, 1]) for 1 < p < ∞ of a subsequence of {Un }n∈N . In fact, note that Yn ⊂ Lp ([0, 1]) for every 1 ≤ p ≤ ∞
and thus the random functions Un can be considered as Lp ([0, 1]) valued random functions for every 1 ≤ p ≤ ∞.
Let us analyze in further detail how we obtained these convergence results.
Recall that we are supposing q-total variation probability distribution for Un ,
that is we are supposing the random vector (U1n , ..., Unn ) to have probability
density function
!q !
n+1
X
(n,q) n
πpr
(u1 , ..., unn ) = Cq,n exp −αn
|unj − unj−1 |
,
j=1
where un0 = unn+1 = 0.
n
We further define for any n ∈ N, the random vector Xn = (X1n , ..., Xn+1
)
such that
n
,
for j = 1, ..., n + 1,
Xjn = Ujn − Uj−1
n
where we recall that U0n and Un+1
are random variables almost surely equal
to zero.
59
Note that, as mentioned in the previous case, for j = 1, ..., n + 1
Ujn
=
j
X
Xkn .
k=1
Note also that, by definition of Un , we have that when Un (ω, t) 6= 0
θ(n,t)
Un (ω, t) =
n
(ω)
Uθ(n,t)
=
X
Xjn (ω),
j=1
j
where θ(n, t) is again the biggest index j such that n+1
≤ t.
We have also already noted in the previous chapter that for n going to infinity,
θ(n, t) ∼ (n + 1). Thus, we can conclude that when Un (ω, t) is non-zero, for
n going to infinity, it can be expressed as
Un (ω, t) ∼
n+1
X
Xjn .
j=1
Our next step consists in investigating the properties of the random variables
Xjn for j = 1, ..., n + 1. First of all, from theorems about transformation of
n
random variables, we know that X = (X1n , ..., Xn+1
) has the distribution
!q !
n+1
X
π̃n,q,X (xn1 , ..., xnn+1 ) = Cq,n exp −αn
|xnj |
.
j=1
Using the above formula, it is particularly easy to show that, for any fixed
n, Xjn for j = 1, ..., n + 1 are identically distributed.
In fact, we have that for j = 1, ..., n + 1 the probability density function of
Xjn is defined as
R +∞
πXjn (xnj ) =
−∞
Pn+1 n q
R +∞
· · · −∞ e−αn ( j=1 |xj |) dxn1 · · · dxnj−1 dxnj+1 · · · dxnn+1
. (3.7)
R +∞
R +∞ −αn (Pn+1 |xn |)q n
n
j=1
j
·
·
·
e
dx
·
·
·
dx
1
n+1
−∞
−∞
It is clear from the definition above that there is no difference among the
distributions of Xjn for j = 1, ..., n + 1, since the distribution in (3.7) is completely symmetric in Xjn .
On the other hand we note that in effect there is a difference among the
distributions of Xjnn and Xjmm when n 6= m. This is due to the fact that we
60
introduced in the joint distribution of Xjn , for j = 1, ..., n + 1, an exponentiation to the q-th power of the total variation. This fact does not allow to
obtain a common distribution independent on n for the random variables
Xjn . The same argument can be used to prove that, for any fixed n ∈ N, the
random variables Xjn , for j = 1, ..., n + 1, are not mutually independent. In
fact their joint distribution cannot be expressed as the product of the single
distributions, again because of the presence of the exponentiation to the q-th
power of the total variation.
Since the assumption of independence is no longer applicable, we tried to
show the validity of weaker assumptions. In fact, it is possible to prove that,
for any fixed n ∈ N, the random variables Xjn , for j = 1, ..., n + 1, are uncorrelated, i.e. that for any fixed n and for any 1 ≤ i, k ≤ n + 1 such that i 6= k,
then E(Xin Xkn ) = 0 = E(Xin )E(Xkn ). In fact this follows from the following
computations:
Pn+1
q
xn xn e−αn ( j=1 |xj |) dxn1 · · · dxnn+1
−∞ k i
E(Xin Xkn ) = −∞
Pn+1 n q
R +∞
R +∞
· · · −∞ e−αn ( j=1 |xj |) dxn1 · · · dxnn+1
−∞
R +∞
R +∞
· · · −∞ xnk Ii (xn1 , · · · , xni−1 , xni+1 , · · · , xnn+1 )dxn1 · · · dxni−1 dxni+1 · · · dxnn+1
−∞
=
R +∞
R +∞ −αn (Pn+1 |xn |)q n
j=1
j
···
e
dx · · · dxn
R +∞
···
R +∞
−∞
n
1
−∞
n+1
where
Ii (xn1 , · · ·
, xni−1 , xni+1 , · · ·
, xnn+1 )
Z
+∞
=
−∞
+∞
Z
=
Z 00
+
Pn+1
xnj e−αn (
j=1
Pn+1
xni e−αn (
j=,j6=i
dxni
q
j=1,j6=i
Pn+1
xni e−αn (
q
|xn
j |)
n
|xn
j |+xi )
−∞
If we make the change of variables
t=
n+1
X
|xnj |
+
xni
and
j=1,j6=i
u=
n+1
X
j=1,j6=i
61
q
n
|xn
j |−xi )
|xnj | − xni
dxni
dxni .
in the first and in the second integrals defining Ii , then we obtain that
!
Z +∞
n+1
X
q
t−
|xnj | e−αn t dt
Ii (xn1 , · · · , xni−1 , xni+1 , · · · , xnn+1 ) =
0
Z
+∞
+
0
j=1,j6=i
n+1
X
!
q
|xnj | − u e−αn u du = 0.
j=1,j6=i
(3.8)
Hence it clearly follows that E(Xin Xkn ) = 0. With the same argument we can
prove that E(Xjn ) = 0 for every n and j.
As we mentioned, we are interested in the convergence in distribution of
Un . Based on what has been proven in [22], we think that there is a critical
value for αn that allows us to obtain convergence in distribution to a random
variable which is not almost surely zero. In particular in the case q = 1
studied in [22] this critical value for αn has been shown to be αn = (n + 1)1/2
and the convergence in distribution has been shown to be to a Gaussian
random variable. We start by showing the results we obtained in the case
q = 2 and we will then present some ideas of how one might try to extend
the results to the case of general q.
3.3.1
Case q = 2
As we have already said, we would like to prove convergence in distribution
of the random variables Un for n going to infinity. To reach this goal, we
started from the idea that if Un converges in distribution to some non-trivial
limit, then it seems reasonable all the moments of Un to have a finite limit
for n going to infinity (actually this is not a necessary condition in general,
but it seemed to be a reasonable starting point). We therefore tried to search
for the value of αn that allows us to obtain finite limits for the moments.
Since the first moment is clearly zero, we immediately start studying the
character of the second moment. Actually, an exact computation of the
second moment is not possible, but we are able to estimate it in a satisfactory
way.
62
Proposition 3.3.1 (Character of the second moment of Un (t)).
For any fixed t ∈ [0, 1], we obtain the following estimate for the second moment of the random variables Un (t), n ∈ N, as n goes to infinity:
1
1
1
2
. E(|Un (t)| ) . 2 −
.
(3.9)
2αn
n + 1 2αn
Proof.
The first inequality in (3.9) is quite easy to prove. In fact, it follows from
the following inequalities:
2 
2 


θ(n,t)
θ(n,t)
X
X
E(|Un (t)|2 ) = E 
Xjn  
Xjn  = E 
j=1
j=1
θ(n,t)
= E(
X
θ(n,t)
(Xjn )2 )
j=1
= E(
X
+ E(
i,j=1,i6=j
θ(n,t)
θ(n,t)
X
X
(Xjn )2 ) = E(
j=1
(3.10)
|Xjn |2 )
j=1

≥
Xin Xjn )
θ(n,t)
2 
1
E 
|Xjn |  ,
θ(n, t)
j=1
X
Pθ(n,t)
where we used the fact that E( i,j=1,i6=j Xin Xjn ) = 0 since the random variables are uncorrelated and the sum of squares inequality which states that if
P
P
xi ∈ R, then ni=1 x2i ≥ n1 ( ni=1 xi )2 .
Now we try to compute the last expected value we obtained in (3.10). Recalling that for n going to infinity and for any t ∈ [0, 1]:
θ(n,t)
X
|Xjn | ∼
j=1
n+1
X
|Xjn |,
j=1
Pn+1 n
we can directly calculate the expected value of
j=1 |Xj |. It is possible
to compute this expected value by using a particular change of variables.
Denoted
∞
X
`1 = {x = {xn }n∈N , xn ∈ R|
|xn | < ∞},
k=1
63
we use a polar decomposition in the `1 -norm sphere, i.e. the change of variables
Rn+1 3 xn+1 = (xn1 , ..., xnn+1 ) = r · z,
where r ∈ R+ and z ∈ S n , the n-dimensional sphere. It follows that

R +∞ Pn+1 n 2 −αn (Pn+1 |xn |)2 n
!2  R +∞
j
j=1
n+1
dx1 · · · dxnn+1
e
·
·
·
X
j=1 |xj |
−∞
−∞
n


E
|Xj |
=
R +∞ −αn (Pn+1 |xn |)2 n
R +∞
j=1
j
dx1 · · · dxnn+1
e
·
·
·
j=1
−∞
−∞
R +∞ 2 −α r2 n
R
r e n · r dr S n f (z)dσ(z)
0
= R +∞
·R
−αn r2 · r n dr
f (z)dσ(z)
e
Sn
0
R +∞
2
rn+2 e−αn r dr
= R0 +∞
,
n e−αn r2 dr
r
0
(3.11)
where f (z) is a function that does not depend on r and that represents the
term relative to the Jacobian of the change of variables we use.
Using integration by parts, we obtain

!2 
R +∞ n −α r2
n+1
X
r e n dr
n
+
1
n+1
· R0+∞
=
.
E
|Xjn |  =
2
n e−αn r dr
2α
2α
r
n
n
j=1
0
Hence, it follows from (3.10) that for n going to infinity

2 
θ(n,t)
X
1
n+1
1
1
E(|Un (t)|2 ) ≥
E 
|Xjn |  ∼
·
∼
,
θ(n, t)
θ(n, t) 2αn
2αn
j=1
and the first inequality is proven.
Proving the second inequality is more difficult. We start by noting that

θ(n,t)
E(|Un (t)|2 ) = E 
X
2 

θ(n,t)
Xjn   = E 
X
j=1

(Xjn )2  = θ(n, t)E((Xjn )2 ),
j=1
where the second equality is due to the uncorrelation of Xjn and the third
equality is due to the fact that Xjn are identically distributed. Now, chosen
64
any index 1 ≤ i ≤ n + 1, we would like to calculate E((Xin )2 ).
Note that in effect it is not possible to calculate E((Xin )2 ), but we are able
to produce an estimate from above.
In fact, as before we use polar decomposition in the `1 -norm sphere, but this
time we use this change of variables only for the variables xnj with j 6= i. We
thus obtain
R +∞ R +∞ n 2 −α (xn +r)2 n−1 n
r dxi dr
(x ) e n i
n 2
E((Xi ) ) = 0 R +∞0 R +∞i
n
2
e−αn (xi +r) rn−1 dxni dr
0
0
(3.12)
R +∞ R +∞ (xni )2
n
2
· (xni + r)e−αn (xi +r) rn−1 dxni dr
xn
+r
0
0
=
.
R +∞i R +∞ −α (xn +r)2
n−1 dxn dr
n i
e
r
i
0
0
By integrating by parts with respect to xni and using the equality
(xni )2
r2
n
=
x
−
r
+
,
i
xni + r
xni + r
we obtain
R +∞ R +∞
E((Xin )2 ) =
=
2
2
n
1
1 − (xnr+r)2 e−αn (xi +r) rn−1 dxni dr
2αn
0
0
i
R +∞ R +∞ −α (xn +r)2
e n i
rn−1 dxni dr
0
0
R +∞ R +∞ r2
2
−αn (xn
i +r) r n−1 dxn dr
2e
i
1 0
1
(xn
+r)
0
.
−
R +∞ R +∞i −α (xn +r)2
n
2αn 2αn
e n i
rn−1 dxi dr
0
0
(3.13)
Now, using as before the fact that Xin are identically distributed and returning to the original coordinates, we obtain
θ(n,t)
θ(n, t)E((Xin )2 )
=
X
E((Xin )2 )
i=1
θ(n, t)
2αn
P
n 2
R +∞ (Pn+1
Pθ(n,t) R +∞
n )2
−αn ( n+1
x
j=1,j6=i xj )
n
n
j=1 j
··· 0
e
dx1 · · · dxn+1
P
i=1
0
n 2
1
( n+1
j=1 xj )
−
.
R +∞
R +∞ −αn (Pn+1 xn )2 n
2αn
j=1 j
···
e
dx1 · · · dxnn+1
=
0
0
Now, using the sum of squares inequality we can obtain the following estimate
65
from below:

Z
θ(n,t) Z +∞
X
···

0
i=1
+∞

Pn+1
n
j=1,j6=i xj

 P
n+1
0
j=1
+∞
Z
+∞
Z
0
Z
i=1
≥
P
0
+∞
Z
+∞
···
0
Pn+1
n
j=1,j6=i xj
1
θ(n,t)
n+1
j=1
P
θ(n,t)
i=1
xnj
P
n+1
j=1
0
n 
n
 dx1 · · · dxn+1 
Pn+1
xnj
j=1
2
xn
j)
dxn1 · · · dxnn+1
2
2
2
xn
j)
e−αn (
n+1
j=1,j6=i
xnj
j=1
2 2
P


−α (
2 e n
Pθ(n,t) Pn+1
···
=
xnj
2
Pn+1
Pn+1
e−αn (
n
j=1 xj
j=1
2
xn
j)
dxn1 · · · dxnn+1
2
(θ(n, t) − 1) ·
Pn+1 n 2
e−αn ( j=1 xj ) dxn1 · · · dxnn+1
2
P
n+1 n
0
0
j=1 xj
Z
Z
+∞
Pn+1 n 2
(θ(n, t) − 1)2 +∞
=
···
e−αn ( j=1 xj ) dxn1 · · · dxnn+1 ,
θ(n, t)
0
0
=
1
θ(n, t)
Z
+∞
Z
+∞
···
where we have used the equality
θ(n,t)
X
n+1
X
i=1
j=1,j6=i
!
xnj
= (θ(n, t) − 1) ·
n+1
X
xnj .
j=1
It follows that for n going to infinity
E(|Un (t)|2 ) = θ(n, t)E((Xin )2 )
1 (θ(n, t) − 1)2
1
θ(n, t)
−
=
≤
2αn
2αn
θ(n, t)
2αn
1
1
∼
2−
,
2αn
n+1
1
2−
θ(n, t)
and hence the proposition is proven.
Remark. The previous proposition is really important since it implies that
the sequence {E(|Un (t)|2 )}n∈N of the second moments of Un (t) for any fixed
t ∈ [0, 1] is bounded in R and does not converge to zero if and only if αn is
bounded and does not converge to zero. These values for αn seem to be the
critical ones we were looking for in the case q = 2.
66
Remark. Note that in the case q = 1, if we compute E(|Un (t)|2 ), we obtain
θ(n,t)
θ(n,t)
2
E(|Un (t)| ) = E((
X
Xjn )2 )
=
X
E((Xjn )2 ) = θ(n, t)E((Xjn )2 )
j=1
j=1
R∞
=
=
=
2
x2 e−αn |x| dx
−∞
θ(n, t) R +∞
e−αn |x|2 dx
−∞
R ∞ 2 −α x2
x e n dx
θ(n, t) R0 +∞
e−αn x2 dx
0
(3.14)
2θ(n, t)
n+1
∼2 2
2
αn
αn
Thus, we have that the second moments of Un (t) remain bounded if for n
going to infinity αn ∼ (n + 1)1/2 , which are exactly the values of αn for which
in [22] the authors have obtained convergence to a Gaussian random variable.
At this point, we would like to distinguish the different behaviors of the CM
estimates, depending on the choice of αn . The most significant case will be
the case in which the αn are bounded and do not converge to zero.
Case αn bounded and not converging to zero.
In this case, Proposition 3.3.1 has some important consequences. In particular, note that if αn is bounded and does not converge to zero, then by
Proposition 3.3.1, it follows that for any t ∈ [0, 1]
sup E(|Un (t)|2 ) = K < ∞,
n∈N
(3.15)
where K is independent of n and t.
This property allows us to prove the following proposition:
Proposition 3.3.2 (Uniform boundedness of the expected values of
the Lp ([0, 1])-norms for any 1 ≤ p ≤ ∞).
Consider our random functions Un and note that for any fixed t ∈ [0, 1],
Un (·, t) is a random variable, while for any fixed ω ∈ Ω, Un (ω, ·) is a function
in Lp ([0, 1]) for any 1 ≤ p ≤ ∞.
67
Then for any 1 ≤ p ≤ ∞ the expected values of the Lp ([0, 1])-norms of
Un (ω, ·) are uniformly bounded, i.e.
E(k Un (ω, ·) kLp ([0,1]) ) ≤ 1 + K,
where K is a constant independent of n.
Proof.
First of all recall that for any 1 ≤ p < ∞, then
k Un (ω, ·) kLp ([0,1]) ≤k Un (ω, ·) kL∞ ([0,1]) = max |Un (ω, t)|.
t∈[0,1]
In fact this is an immediate consequence of the following inequalities:
1
Z
1/p
|Un (ω, t)| dt
p
k Un (ω, ·) kLp ([0,1]) =
0
Z
1
≤
0
1/p
( max |Un (ω, t)|) dt
p
(3.16)
t∈[0,1]
= max |Un (ω, t)| =k Un (ω, ·) kL∞ ([0,1]) .
t∈[0,1]
For any 1 ≤ p < ∞, by monotonicity of the expected value, we obtain that
E(k Un (ω, ·) kLp ([0,1]) ) ≤ E(k Un (ω, ·) kL∞ ([0,1]) ).
Now note that, using (3.15), we can effectively find a uniform bound from
above for E(k Un (ω, ·) kL∞ ([0,1]) ) and hence for E(k Un (ω, ·) kLp ([0,1]) ).
In fact, for fixed ω ∈ Ω, recall that the random variable Un (ω, ·) lies in Yn and
therefore it is piecewise constant. In addition, we know that all the different
1
n
values it can take are {0, Un (ω, n+1
), · · · , Un (ω, n+1
)}.
Thus we can always find
M = max {|Un (ω,
0≤j≤n+1
j
)|}.
n+1
θ
Define θ as (one of) the index (indeces) j such that M = |Un (ω, n+1
)|.
θ
θ+1
At this point it is clear that for every t ∈ [ n+1 , n+1 ),
max |Un (ω, t)| = |Un (ω, t)|.
t∈[0,1]
68
θ
θ+1
Hence, for every t ∈ [ n+1
, n+1
), we have
k Un (ω, ·) kL∞ ([0,1]) = |Un (ω, t)|.
It follows then by (3.15) that
E(k Un (ω, ·) k2L∞ ([0,1]) ) = E(|Un (ω, t)|2 ) ≤ K,
and thus that
E(k Un (ω, ·) kLp ([0,1]) ) ≤ E(k Un (ω, ·) kL∞ ([0,1]) ) ≤ E(1+ k Un (ω, ·) k2L∞ ([0,1]) )
≤ 1 + K.
At this point, we would like to use the uniform boundedness of the expected
values of the Lp ([0, 1])-norms (1 ≤ p ≤ ∞) of Un in order to prove tightness of
Un with respect to the weak topology of Lp ([0, 1]). We will see that tightness
in the weak topology of Lp ([0, 1]) can be proven only in the case 1 < p < ∞.
Let us first recall when a family of random functions is tight.
Definition 3.3.1 (Tightness of a sequence of probability measures).
Consider the measurable metric space (S, S). A sequence of probability measures {Pn }n∈N on (S, S) is called tight if for any > 0, there exists a compact
set K such that
sup Pn (Kc ) < .
n∈N
Definition 3.3.2 (Tightness of a sequence of random elements).
Consider the measurable metric space (S, S). A sequence of (S, S)-valued
random functions {Xn }n∈N is called tight if the sequence of probability distributions of {Xn }n∈N is tight, i.e., given > 0, there exists a compact set
K such that
sup P (Xn ∈ Kc ) < .
n∈N
Let us try to prove tightness of Un .
Note that, for any 1 ≤ p ≤ ∞ we can use Markov’s inequality for the random
69
variables k Un (ω, ·) kLp ([0,1]) , in order to obtain the following estimate:
∀M ∈ R,
E(k Un (ω, ·) kLp ([0,1]) )
M
1+K
.
≤
M
(3.17)
P ({ω ∈ Ω : k Un (ω, ·) kLp ([0,1]) ≥ M }) ≤
At this point, for every > 0, if we choose M =
that for any 1 ≤ p ≤ ∞:
1+K
+ 1, then we obtain
P ({k Un (ω, ·) kLp ([0,1]) > M }) ≤ P ({k Un (ω, ·) kLp ([0,1]) ≥ M }) < .
Now for any > 0, we define the set
K = {un ∈ Lp ([0, 1]) s.t. k un kLp ([0,1]) ≤ M }.
Now, recall that from Banach-Alaoglu theorem we know that closed balls are
compact in the weak topology of a normed space S if and only if the space
is reflexive. We also know that the spaces Lp ([0, 1]) are reflexive if and only
if 1 < p < ∞. Hence we can conclude that in the spaces Lp ([0, 1]) with
1 < p < ∞ all closed balls are compact in the weak topology and thus, for
any > 0, K is a compact set if and only if 1 < p < ∞. Therefore it is
clear that for any 1 < p < ∞ we have just shown that for every > 0, there
exists a set K , which is compact in the weak topology of Lp ([0, 1]), such that
P (Un (ω, ·) ∈ Kc ) < . But this is exactly the definition of tightness in the
weak Lp ([0, 1]) topology (1 < p < ∞) for the random functions Un .
Now, tightness is a very important property because it allows to get convergence in distribution of a subsequence of Un . The link is given by Prohorov’s
theorem:
Theorem 3.3.1 (Prohorov’s theorem).
Suppose {Pn : n ≥ 1} is a tight collection of probability measures on a complete, separable metric space (S, S).
Then, there exists a subsequence n0 and a probability measure P on S such
that Pn0 is weakly convergent to P as n goes to infinity.
70
We can therefore conclude that there exists a subsequence Unk of Un which
is weakly convergent in Lp ([0, 1]) for every 1 < p < ∞. This means, of course
that we have obtained convergence of the priors, but further, recalling that
UnCM = E(Un |Y ), we have also obtained that for any φ ∈ (Lp ([0, 1]))∗ , i.e. in
the dual space of Lp ([0, 1]) with respect to the norm-topology,
k→∞
E(< UnCM
, φ >) → E(< U CM , φ >),
k
that is convergence in distribution of a subsequence of the CM estimates.
Remark. Note that we did not obtain convergence in L∞ ([0, 1]). Actually,
we are not able to show that in effect there is not convergence in L∞ ([0, 1]),
but the fact that we could not prove convergence seems to be a good sign. In
fact, we should recall that the random function we are trying to approximate
is the random function U which takes values in the space of the continuous
function. A convergence in L∞ ([0, 1]) would imply a convergence to some
smooth limit which is exactly what we are trying to avoid in order to obtain
discretization invariance of our random functions Un , which have been built
in order to underline the discontinuities rather than smoothen them.
Remark. Note that BV ([0, 1]) can be embedded in any Lp ([0, 1]) space for
p < ∞. Hence, since we are actually looking for a limit in BV ([0, 1]), a result
of convergence in Lp ([0, 1]) seems to be very reasonable.
Case αn going to infinity
Note that, by Proposition 3.3.1, for every t ∈ [0, 1]
lim αn = ∞ ⇒ lim E(|Un (t)|2 ) = 0.
n→∞
n→∞
Hence, by the inequalities we have already used, for any 1 ≤ p ≤ ∞
0 ≤ E(k Un (ω, ·) k2Lp ([0,1]) ) ≤ E(k Un (ω, ·) k2L∞ ([0,1]) ) ≤ E(|Un (ω, t)|2 ),
and thus it follows that
0 ≤ lim E(k Un (ω, ·) k2Lp ([0,1]) ) ≤ 0.
n→∞
71
By Fatou’s lemma we then have that
E( lim k Un (ω, ·) k2Lp ([0,1]) ) = 0,
n→∞
and thus
lim k Un (ω, ·) kLp ([0,1]) = 0 for a.e. ω ∈ Ω.
n→∞
Using Fatou’s lemma again we obtain that
k lim Un (ω, t) kLp ([0,1]) = 0,
n→∞
and finally that
lim Un (ω, t) = 0
n→∞
for a.e. ω ∈ Ω and for a.e. t ∈ [0, 1].
Thus, we have obtained almost sure convergence of Un to a random function
which is almost surely equal to zero, i.e. to a trivial limit.
Case αn going to zero
Again, by Proposition 3.3.1,
lim αn = 0 ⇒ lim E(|Un (t)|2 ) = ∞.
n→∞
n→∞
and thus we cannot have convergence of Un in our setting.
3.3.2
Case general q
Regarding the general case, we are not able to find exactly which is the
critical value for αn , but we can at least make some useful remarks. In fact,
the following inequalities can be proven:

2 

2 
θ(n,t)
θ(n,t)
X
X
1
· E 
|Xjn |  ≤ E(|Un (t)|2 ) = E 
|Xjn |  ,
θ(n, t)
j=1
j=1
where the first inequality is obtained with the same argument we used in
(3.10) and the second one using the inequality
θ(n,t)
X
X n θ(n,t)
X
≤
|Xjn |.
j
j=1
j=1
72
As before, considering the fact that for n going to infinity and for any t ∈ [0, 1]
θ(n,t)
X
|Xjn |
∼
n+1
X
j=1
|Xjn |,
j=1
we are able to calculate, using the same change of variables with polar coP
n
ordinates in the `1 -norm sphere, the expected value of n+1
j=1 |Xj |, obtaining
the following result:
n−[n]q
!2 
q
Y
C
2
1
[n+2]q
n
E
|Xj |  =
1+
,
C[n]q (qαn )2/q i=1
n + 1 − qi
j=1

n+1
X
where [n]q is the unique remainder of the division of n by q and
Z ∞
q
t[n]q e−t dt.
C[n]q =
0
Hence it follows that we can bound E(|Un (t)|2 ) in the following way:

1  C[n+2]q
1

θ(n, t)
C[n]q (qαn )2/q
≤ E(|Un (t)|2 ) ≤

n−[n]q
q
Y 1+
i=1
2


n + 1 − qi
(3.18)
C[n+2]q
1
C[n]q (qαn )2/q
n−[n]q
q
Y 1+
i=1
2
n + 1 − qi
.
Thus, we cannot find the exact value of αn for which we get convergence but
we can at least conclude that if
n−[n]q
q
Y 1+
lim
i=1
2
n + 1 − qi
= 0,
2/q
n→∞
αn
then we have convergence to zero, while if
n−[n]q
q
Y 1+
1
n→∞ n + 1
lim
i=1
2
n + 1 − qi
2/q
αn
then we do not have convergence.
73
= ∞,
Chapter 4
Computational results
In this chapter we would like to report some computational results we have
obtained. In fact, though computational analysis is certainly not the main
focus of this thesis, we would like to investigate if the computational results
and the theoretical findings of Chapter 2 and Chapter 3 agree.
4.1
The model problem
In order to perform the computations, we choose a particular model for the
theoretical inverse problem introduced in (2.1). In particular, we try to reproduce the results described in [22] in the case of total variation priors, i.e.
in the case q = 1, and we produce new results relative to the case of q-total
variation priors in the case q = 2 and q = 10. Since we would like to mantain the analogy with the case presented in [22] as close as possible, we have
chosen the same model problem used in [22].
In particular, in [22], the authors try to reproduce a rough model for the measurement of the intensity distribution of light on one row of charge coupled
devices (CCD) pixels. As a reminder, recall that CCD devices are commonly
used in digital cameras and medical X-rays imaging devices and that they
typically consist of a two-dimensional array of pixels capable of measuring the
amount of visible light illuminating the area of the pixel over a period of time.
74
In further detail, we choose the function u in (2.1) to be a real-valued function
on the unit interval [0, 1]. Now, given k > 1, we divide the subinterval
1 k+1 j
k+1
,
⊂ [0, 1] into k pixels [xk+1
, xk+1
= k+2
for
j
j+1 ], where as usual xj
k+2 k+2
j = 1, ..., k + 1. Thus, for j = 1, ..., k the measurement of the j-th pixel is
mkj
=<
Akj , u
>
+ekj
Z
xk+1
j+1
=
xk+1
j
u(t)dt + ekj ,
where Akj = 1(xk+1 ,xk+1 ) and ekj are independent normally distributed random
j
j+1
numbers with standard deviation σ > 0. The numbers ekj model measurements errors resulting from quantum and electronic noise of the CCD. The
model we have just described likely appears as in the following figure:
Figure 4.1: One dimensional model. The function represents a distribution
of light on the interval [0, 1]. Pixels are represented by intervals and measurements are integrals of light intensity over those intervals. No measurement
is made on the leftmost and rightmost interval.
As previously mentioned, we use the q-total variation distribution in Yn with
q = 1, q = 2 or q = 10 for representing a priori information of the unknown
U . Also, for convenience we take the number of pixels to be of the form
k = 2L − 2 with L > 1, and the dimension n = 2` − 1 is chosen to be greater
than the number of measurements, i.e. ` > L. Note that it follows that the
k+1
grid xk+1
is a subset of the grid xn0 , ..., xnn+1 .
1 , ..., xk
Now, if we denote un = [un1 , ..., unn ]T , the posterior distributions for the q-total
75
variation priors in Yn can be formulated as
q
(un |mk )
πkn
1
= Cq,kn exp − 2 k Ak un − mk k2Rk −αn
2σ
1
= Cq,kn exp − 2 k Ak un − mk k2Rk −αn
2σ
n+1
X
!q !
|unj
−
unj−1 |
j=1
n+1
X
!q !
|(Dun )j |
,
j=1
(4.1)
where the matrices Ak and D are defined below.
The k × n matrix Ak implements the measurement: for j = 1, ..., k
(Ak un )j =< Akj ,
n
X
Z
unj ψjn >=
xk+1
j+1
xk+1
j
j=1
ψjn
Z
un (t)dt =
xk+1
j+1
n
X
xk+1
j
!
unj ψjn (t) dt
j=1
(4.2)
for the space Yn are as in Definition
where the roof-top basis functions
2.3.2.
Integration of the piecewise linear functions in (4.2) over the intervals
[xk+1
, xk+1
j
j+1 ]
is implemented by the classical trapezoidal rule. Thus we can explicitly write
out the matrix Ak . In fact, the j-th row of Ak takes the form:
1
1
[0, 0, ..., 0, ∆xn , ∆xn , ..., ∆xn , ∆xn , 0, 0, ..., 0],
|
{z
} 2
| {z } 2
j2`−L −1
where ∆xn =
2`−L −1
1
.
n+1
On the other hand, prior information is encoded into D, the (n + 1) × n
matrix defined by
(Du)j = unj − unj−1 ,
where, as before, un0 = unn+1 = 0.
76
j = 1, ..., n + 1,
4.2
Computation of the MAP estimates
Since in the previous chapter we have proven that, in effect, the MAP estimates relative to the case of q-total variation priors (q 6= 1) can be treated
as the case of total variation priors with some restrictions on the parameters, we describe the computational method we use solely in the case of total
variation priors and then we show how to choose the right parameter in the
case in which q 6= 1.
4.2.1
The TV MAP estimates
Computing the MAP estimate consists of solving the following optimization
problem:
AP
uM
(·; αn )
n
∈ arg minn
un ∈R
!
n+1
X
1
2
k Ak un − mk kRn +αn
|(Dun )j | . (4.3)
2σ 2
j=1
In [22], this problem is recasted into a quadratic programming problem, which
can be solved with standart tools. However, we found that this approach is
not very robust for practical use. Instead, we implemted the Split Bregman
algorithm described in [16] since it seems to produce better results.
4.2.2
The q-TV MAP estimates
In general, without loss of generality, we can simply reduce the case of q-total
variation prior with q 6= 1 to the case q = 1, with some caution on the parameters’ choice. The standard procedure we use is as follows: suppose we would
AP
M AP
M AP
like to calculate uM
(·, αn ),
n,q (·; βn ). We know that un,q (·; βn ) = un
where the relationship between βn and αn is as in formulas (3.4) or (3.5).
AP
Since we are just able to reconstruct uM
(·, αn ), we try to numerically inn
vert formula (3.5). It is possible to easily invert such a function since the
relationship between αn and βn happens to be almost linear. By simply fitting a linear function to the last two tested αn , the αn leading to the desired
βn is found within a few steps. Thus we are able to find a reconstruction
for the MAP estimates in the case of q-total variation priors for general q,
77
starting from any parameter βn , finding the corresponding αn through the
AP
inverse function of (3.5) and finally calculating uM
(·; αn ) with the method
n
described in the previous section.
4.3
Computation of the CM estimates
Recall that, by definition, if we suppose Un to have q-total variation priors,
then:
!
Z
n
X
q
unj ψjn (·) πkn
(un1 , ..., unn |mk )dun1 ...dunn ,
(4.4)
uCM
n,q (·; αn ) =
n
R j=1
q
where πkn
(un |mk ) is defined as in (3.2), i.e.
q
πkn
(un |mk )
1
= Cq,kn exp − 2 k Ak un − mk k2Rk −αn
2σ
1
= Cq,kn exp − 2 k Ak un − mk k2Rk −αn
2σ
n+1
X
!q !
|unj − unj−1 |
j=1
n+1
X
!q !
|(Dun )j |
,
j=1
(4.5)
where D is the same matrix of the previous section.
Hence, finding the CM estimates is clearly an integration problem and we
have to compute the integral in (4.4) numerically. Due to the high dimensionality of the source space, this is intractable by means of traditional quadratures. Integration by Monte Carlo methods can avoid these difficulties, because the rate of convergence does, in principle, not depend on the dimension n. For our application, the best would be if one could find a sequence
q
u(1) , ..., u(N ) ∈ Rn of samples independently drawn from πkn
(un |mk ), because
in this case, the law of large numbers would guarantee that
!
Z
n
N
X
1 X (i) N →∞ CM
q
(un1 , ..., unn |mk )dun1 ...dunn ,
u
→ un,q (·; αn ) =
unj ψjn (·) πkn
N i=1
Rn j=1
almost surely and in `1 with rate O(N −1/2 ), i.e. the empirical mean of the
sequence converges to the expected value of the posterior, [19]. A difficulty in
our setting is that the posterior is not given in a form that allows for drawing
78
independent samples, since it is only known up to a normalizing constant and
does not belong to a class of distributions for which such sampling schemes
are known. However, due to the strong ergodic theorem, the above convergence and its rate still hold if the sequence is dependent, but originates from
q
(un |mk ) as its equilibrium distribution,
an ergodic Markov chain that has πkn
[19]. Techniques to construct such chains are called Markov chain Monte
Carlo (MCMC) methods. Some of them are able to sample the posterior
without knowing the normalizing constant. They either work with ratios of
probabilities only, or sample along some coordinates in one step while keeping
the others fixed, such that the posterior conditioned on the fixed coordinates
takes a simple form. For details on the theory of Markov chains and MCMC
techniques, we refer to [24], [18] and [19]. Practically, the dimension n affects
the time to derive a new sample, which does not affect the rate of convergence, but can render the method too slow to be a practical alternative to
MAP estimation. The speed of convergence relies heavily on the method
used to construct the chain, and its mixing properties [24]. Furthermore, the
generation of each sample point usually requires one or more evaluations of
the forward mapping.
In summary, we will compute an approximation of the CM estimate by generating a sequence u(1) , ..., u(N ) ∈ Rn such that
!
Z
N
n
X
1 X (i) N →∞ CM
q
u
→ un,q (·; αn ) =
unj ψjn (·) πkn
(un1 , ..., unn |mk )dun1 ...dunn .
N i=1
n
R j=1
Since we rely on a Markov chain for the generation of the sequence, the
Markov property assures that within this scheme, the generation of the subsequent state u(i+1) only relies on the current state u(i) .
Note that, if we use a MCMC method for obtaining samples of the posterior
distribution, then our approximation of the CM estimates becomes
uCM
n,q (·; αn ) ≈
N
X
1
u(i) ,
N − N0 i=N +1
0
where the first N0 > 0 samples have been discarded because MCMC algorithms typically need a burn-in period before the samples start to explore
the posterior distribution representatively.
79
In particular, among all the MCMC methods, we use the Metropolis-Hastings
(MH) algorithm as it has been done in [22]. A detailed description of the
MCMC methods in general and of the MH method in particular can be found
in Section 3.6 of [18].
To implement the MH algorithm we define the proposal distribution Q(v, ·)
on Rn , where v ∈ Rn is the current state, as follows.
Fix 1 ≤ Nupdate ≤ n and ν > 0. Pick randomly Nupdate distinct numbers from
the set 1, 2, ..., n according to a uniform probability distribution. Order the
numbers and denote them j1 , ..., jNupdate . Then a candidate vector u ∈ Rn is
picked according to Q(v, ·) if u = v + ν , where
ν = [0, ..., 0, 0j1 , 0, ..., 0, 0j2 , 0, ..., 0, 0jN
update
, 0, ..., 0]T
with 0j` ∼ N (0, ν) independent random numbers. Note that if πQ (v, u)
denotes the density of Q(v, ·), the transition probabilities are symmetric, i.e.
πQ (v, u) = πQ (u, v).
Due to the above symmetry, the MH algorithm takes the simple form:
1. Set i = 0 and initialize u(0) := [0, 0, ..., 0]T ;
2. Set u = u(i) + ν ;
q
q
3. If πkn
(u|mk ) ≥ πkn
(u(i) |mk ), then set u(i+1) = u and go to 5;
q
q
4. If πkn
(u|mk ) < πkn
(u(i) |mk ), draw a random number s from the uniform
π q (u|m )
distribution on [0, 1]. If s ≤ πq kn(u(i) |mk k ) , then set u(i+1) := u, otherwise
kn
set u(i+1) = u(i) ;
5. If i = N then stop; otherwise set i ← i + 1 and go to 2.
Finally, define the acceptance rate of the Markov chain produced by the MH
algorithm (discarding the first N0 samples) as
r=
number of accepted candidates
.
N − N0
A high acceptance rate close to 1 means, that the chain is moving very slow,
which in return leads to a very slow exploration of the posterior and thus
80
a slow convergence of the mean. A low acceptence rate close to 0 leads
to the opposite, as the chain does not move at all for a large number of
samples. Both extrema lead to a high autocorrelation of the chain, and
thus to a slow convergence. To attain an acceptence rate around 0.35, ν is
tuned in the burn-in period: we start with a very small ν (which leads to
a high acceptance rate, since the current state and the proposal will not be
very different). Every 10000 sample, the acceptance rate of these samples is
computed, and ν is multiplied by 1.1 if it is above 0.35.
4.4
Results
In our numerical examples we take u to be the simplest step function possible,
i.e.
(
u(t) =
1
0
t ∈ [1/3, 2/3]
.
otherwise
1
0
0
1/3
2/3
1
Figure 4.2: Simulated intensity distribution u(t).
We then consider a measurement with k = 25 − 2 = 30 pixels and random
errors Ejk ∼ N (0, σ 2 ) for j = 1, ..., k where σ = 0.001. In Figure 4.3 we
reproduce the simulated noisy measurements.
4.4.1
Results for the MAP estimates
For the MAP estimates, we perform all the computations relative to the cases
q = 1 and q = 2.
81
0.03
0.02
0.01
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 4.3: Simulated noisy measurements mk .
TV MAP reconstruction
We first reconstruct the MAP estimates for the case q = 1. We perform
the reconstructions for two different choices of the parameters αn , i.e. for
√
αn constant and αn ∼ n + 1, and for four different choices of n, i.e.
n = 63, 255, 1023, 4095 or equivalently ` = 26 , 28 , 210 , 212 . From the theory presented in Section 2.3.1 we expect convergence to some limit in the
case in which αn is constant and convergence to zero in the case in which
√
αn ∼ n + 1.
In the case in which αn is constant, we choose it to be equal to the same
constant which has been used in [22], i.e. 135. This has been determined
by numerical experimentation as the most suitable parameter for MAP estimates with TV priors at fixed discretization level ` = 6 and n = 63.
√
In the case in which αn ∼ n + 1, we choose the constant in front of the
√
square root in such a way that α63 = 135, i.e. αn = 135
n + 1. The result
8
is shown in Figure 4.4. From Figure 4.4 we can see that the TV MAP estimates converge to some function in the case in which αn is constant and that
√
there is a steady loss of contrast in the case in which αn ∼ n + 1, hence a
convergence to zero. The theoretical expectations are thus confirmed.
2-TV MAP reconstruction
AP
In the case of 2-TV priors, we reconstruct the MAP estimates uM
n,2 (·, βn )
√
with βn constant and βn ∼ n + 1 for n = 63, 255, 1023 and 4095 as before.
We still expect from the theory presented in Section 3.2 a convergence to
some limit in the first case and a convergence to zero in the second case. We
choose βn in such a way that using formula (3.5) with q = 2 we obtain an
82
1
1
n = 63
n = 63
0
0
0
1/3
2/3
1
1
0
1/3
2/3
1
1/3
2/3
1
1/3
2/3
1
2/3
1
1
n = 255
n = 255
0
0
0
1/3
2/3
1
1
0
1
n = 1023
n = 1023
0
0
0
1/3
2/3
1
1
0
1
n = 4095
n = 4095
0
0
0
1/3
2/3
1
0
(a) αn = 135
1/3
(b) αn =
Figure 4.4: TV MAP estimates.
83
√
135
8
n+1
1
1
n = 63
n = 63
0
0
0
1/3
2/3
1
1
0
1/3
2/3
1
1/3
2/3
1
1/3
2/3
1
2/3
1
1
n = 255
n = 255
0
0
0
1/3
2/3
1
1
0
1
n = 1023
n = 1023
0
0
0
1/3
2/3
1
1
0
1
n = 4095
n = 4095
0
0
0
1/3
(a) βn =
2/3
135
AP (·,135))
T V (uM
63
1
0
≈ 70
1/3
(b) βn =
√
70
8
n+1
Figure 4.5: 2-TV MAP estimates.
αn which agrees with the one chosen for the TV MAP estimates. In further
detail, in the case βn constant, we choose βn = T V (uM135
AP (·,135)) ≈ 70 and in the
63
√
√
case βn ∼ n + 1, we choose βn ≈ 70
n + 1. The result is shown in Figure
8
4.5, which once more confirms our theoretical expectations, i.e. convergence
to some limit function in the case in which βn is constant and loss of contrast
√
in the case in which βn ∼ n + 1.
4.4.2
Results for the CM estimates
As regards the CM estimates, we perform the computations in the cases
q = 1, 2 and 10.
84
1
1
1
n = 63
n = 63
0
n = 63
0
0
1/3
2/3
1
1
0
0
1/3
2/3
1
1
n = 255
n = 255
0
1/3
2/3
1
1
0
1/3
2/3
1
1
1/3
2/3
1
0
1/3
2/3
1
0
(a) αn = 135
1
1/3
2/3
1
1/3
2/3
1
n = 4095
0
2/3
1
n = 1023
n = 4095
1/3
2/3
1
n = 4095
0
1/3
0
1
0
1
1
0
0
0
n = 1023
1
2/3
0
n = 1023
0
1/3
n = 255
0
0
0
1
0
0
1/3
(b) αn =
2/3
√
135
8
n+1
1
0
(c) αn =
135
64 (n
+ 1)
Figure 4.6: TV CM estimates.
TV CM estimates
In the TV case, we produce the CM estimates with the parameters αn con√
stant, αn ∼ n + 1, and αn ∼ n + 1. From the theory presented in Section
2.3.2, we expect the CM estimates not to converge, to converge to a smooth
limit, and to converge to zero, respectively. We calculate the CM estimates
for n = 63, 255, 1023 and 4095 as before. We choose the parameters in the
√
following way. For αn constant, we just choose αn = 135; for αn ∼ n + 1,
√
we choose αn = 135
n + 1; for αn ∼ n+1, we choose αn = 135
(n+1). In this
8
64
way, we obtain the same value for αn , when n = 63. The results obtained
are shown in Figure 4.6. These results do not seem to be very satisfactory,
√
since in the case αn ∼ n + 1 we should obtain convergence to a smooth
limit but we do not. Actually, in [22] it has been aready noted that the CM
estimate for n = 4095 was not of the best possible quality. The reason has
been attributed to the very slow convergence of the chain. In fact, it is not
completely clear if the results we have obtained really represent the CM estimates or if they are just a sign of the fact that the chain is not converging,
and thus of the fact that the method we are using is not the most efficient
possible.
85
1
1
n = 63
n = 63
0
0
0
1/3
2/3
1
1
0
1/3
2/3
1
1/3
2/3
1
1/3
2/3
1
2/3
1
1
n = 255
n = 255
0
0
0
1/3
2/3
1
1
0
1
n = 1023
n = 1023
0
0
0
1/3
2/3
1
1
0
1
n = 4095
n = 4095
0
0
0
1/3
(a) βn =
2/3
135
AP (·,135))
T V (uM
63
1
0
≈ 70
1/3
(b) βn =
√
70
8
n+1
Figure 4.7: 2-TV CM estimates.
2-TV CM estimates
In the 2-TV case, we reconstruct the CM estimates uCM
n,2 (·, βn ) in the case
√
βn constant and βn ∼ n + 1. From the theory presented in Section 3.3.1,
we expect the CM estimates to converge to some limit and to converge to
zero, respectively. We calculate the CM estimates for n = 63, 255, 1023 and
4095 as before. The parameters are chosen in the following way: for the case
in which βn is constant, we choose βn = T V (uM135
AP (·,135)) ≈ 70, so that βn is
63
obtained from αn = 135 through the inverse of the formula (3.5) with q = 2;
√
√
for the case in which βn ∼ n + 1, we choose βn ≈ 70
n + 1. In this way, we
8
obtain the same value for βn , when n = 63, which is the value corresponding
to αn = 135, where αn is obtained from βn through inversion of formula (3.5)
with q = 2. The results we have obtained are shown in Figure 4.7. Again
the CM estimates in the case n = 4095 are probably not reliable.
86
1
1
n = 63
n = 63
0
0
0
1/3
2/3
1
1
0
1/3
2/3
1
1/3
2/3
1
1/3
2/3
1
2/3
1
1
n = 255
n = 255
0
0
0
1/3
2/3
1
1
0
1
n = 1023
n = 1023
0
0
0
1/3
2/3
1
1
0
1
n = 4095
n = 4095
0
0
0
1/3
(a) βn =
2/3
135
AP
9
(T V (uM
63,10 (·,135)))
1
0
≈ 0.33
1/3
(b) βn ≈
√
0.33
8
n+1
Figure 4.8: 10-TV CM estimates.
10-TV CM estimates
In the 10-TV case, we produce CM estimates uCM
n,10 (·, βn ) for βn constant and
√
βn ∼ n + 1. As mentioned in Section 3.3.2, we are not able to find the
right scaling for βn which would allow us to obtain convergence of the CM
estimates in the case q = 10. Hence, we do not have a precise idea of what
we should expect from 10-TV CM estimates. We calculate the CM estimates
for n = 63, 255, 1023 and 4095 as before. We choose the parameters in the
135
following way: for βn constant we choose βn = (T V (uM AP
9 ≈ 0.33; for
63,10 (·,135)))
√
√
βn ∼ n + 1 we choose βn ≈ 0.33
n + 1. In this way, we obtain the same
8
value for βn , when n = 63, which is the value corresponding to αn = 135,
where αn is obtained from βn through the inverse of formula (3.5) with q = 10.
The results we have obtained are shown in Figure 4.8.
√
They look quite interesting. In fact, in the case in which βn ∼ n + 1 we
87
1
n = 63
0
0
1/3
2/3
1
1/3
2/3
1
1/3
2/3
1
1/3
2/3
1
1
n = 255
0
0
1
n = 1023
0
0
1
n = 4095
0
0
Figure 4.9: 10-TV CM estimates; βn = 0.01.
have convergence to zero, which could be reasonable. In the case in which
βn is constant, the CM estimates seem to converge to some defined limit
function. We decided then to produce another group of CM estimates with
βn constant. In this case we choose a smaller value for the constant, i.e.
βn = 0.01. The results are shown in Figure 4.9. It seems that we again have
convergence of CM estimates to some limit function. This might mean that
the right scaling for βn to obtain convergence is βn constant, even in the case
q 6= 2.
4.4.3
Summary
Finally we would like to summarize all the results we have obtained in the
following tables.
88
q
MAP ESTIMATES
n
Expected limit
63, 255, 1023, 4095
some limit
αn or βn
135
1
√
135
8
135
n+1
≈ 70
T V (uM AP (·,135))
Obtained limit
some limit
63, 255, 1023, 4095
63, 255, 1023, 4095
0
some limit
0
some limit
63, 255, 1023, 4095
0
0
63
2
70
8
√
q
1
n+1
αn or βn
135
135
8
√
n+1
135
(n
64
2
CM ESTIMATES
n
Expected limit
63, 255, 1023, 4095 does not exist
+ 1)
135
AP (·,135))
T V (uM
63
≈
70
8
≈ 70
√
n+1
135
AP (·,135)))9
(T V (uM
63
≈ 0.33
Obtained limit
does not exist
63, 255, 1023, 4095
smooth limit
inconclusive
63, 255, 1023, 4095
0
0
63, 255, 1023, 4095
some limit
inconclusive
63, 255, 1023, 4095
0
0
63, 255, 1023, 4095
unknown
some limit
63, 255, 1023, 4095
unknown
some limit
63, 255, 1023, 4095
unknown
0
10
0.01
≈
0.33
8
√
n+1
89
Conclusions
In conclusion, the introduction of the novel class of q-total variation priors
has been shown to be significant for finding an edge-preserving, discretization
invariant modelization of the a priori knowledge of the unknown. In fact,
convergence of the MAP estimates has not been changed by the introduction
of the new priors, but, on the contrary, we have obtained convergence results
for the CM estimates in the case of q-total variation priors with q = 2.
These convergence results do not imply convergence of the CM estimates to
a smooth limit. This fact leaves open the possibility of convergence of the
CM estimates to a non-smooth limit, which would preserve discretization
invariance of the novel priors. On the other hand, we began the investigation
of convergence of the CM estimates in the case of q-total variation priors
with q 6= 2. The computational results from the case q = 10 suggest for the
general case that the correct choice for the parameter scaling is a constant;
however, a rigorous proof of convergence must still be carried out.
90
Bibliography
[1] L. Ambrosio, N. Fusco and D. Pallara, Functions of Bounded Variation
and Free Discontinuity Problems, Oxford Mathematical Monographs,
Oxford University press, 2000.
[2] A. Araujo and E. Giné, The Central Limit Theorem for Real and Banach
Valued Random Variables, Wiley Series in Probability and Statistics,
John Wiley & Sons, Inc, 1980.
[3] H. Attouch, Variational Convergence for Functions and Operators, Applicable Mathematics Series. Pitman Advanced Publishing Program,
1984.
[4] H. Attouch, G. Buttazzo and G. Michaille, Variational Analysis in
Sobolev and BV Spaces. Applications to PDE’s and Optimization, MPSSIAM Series on Optimization, MPS and SIAM, 2006.
[5] G. Aubert and P. Kornprobst, Mathematical Problems in Image Processing: Partial Differential Equations and the Calculus of Variations,
Applied Mathematical Sciences, 147, Springer, 2000.
[6] M. Benning and M. Burger, Error Estimates for general Fidelities, Electronic transactions on Numerical Analysis, 38, Kent State University,
2011.
[7] P. Billingsley, Weak Convergence of Measures: Applications in Probability, CBMS-NSF Regional Conference Series in Applied Mathematics,
5, SIAM, 1971.
91
[8] P. Billingsley, Convergence of Probability Measures, Wiley Series in
Probability and Statistics, John Wiley & Sons, Inc, 1999.
[9] M. Burger and S. Osher, Convergence Rates of Convex Variational Regularization, Inverse Problems, 20, 2004.
[10] M. Burger and S. Osher, A Guide to the TV Zoo I: Models and Analysis,
Level Set and PDE-Based Reconstruction, Springer, 2009.
[11] V. Capasso and D. Morale, Una Guida allo Studio della Probabilitá e
della Statistica Matematica, Progetto Leonardo, Societá Editrice Esculapio s.r.l., 2009.
[12] T. F. Chan and J. Shen, Variational Image Deblurring-A Window into
Mathematical Image Processing, WSPC/Lecture Notes Series, 2004.
[13] F. Demengel and R. Temam, Convex functions of a measure and applications, Indiana Univ. Math. J., 33, 1984.
[14] R. M. Dudley, Real Analysis and Probability, Cambridge studies in advanced mathematics, Cambridge University Press, 2004.
[15] P. P. B. Eggermont, V. N. LaRiccia and M. Z. Nashed, On Weakly
Bounded Noise in Ill-Posed Problems, Inverse Problems, 25, 2009.
[16] T. Goldstein and S. Osher, The Split Bregman Method for L1Regularized Problems, SIAM Journal on Imaging Sciences, 2009.
[17] T. Helin, Discretization an Bayesian Modelling in Inverse Problems and
Imaging, Aalto University School of Science and Technology, 2010.
[18] J. Kaipio and E. Somersalo, Statistical and Computational Inverse Problems, Applied Mathematical Sciences, 160, Springer, 2004.
[19] A. Klenke, Probability Theory: A Comprehensive Course, Springer,
2008.
[20] S. Lasanen, Discretizations of Generalized Random Variables with Applications to Inverse Problems, Academia Scientiarum Fennica, 2002.
92
[21] M. Lassas, E. Saksman and S. Siltanen, Discretization-Invariant
Bayesian Inversion and Besov Space Priors, Inverse Problems and Imaging, 3, 2009.
[22] M. Lassas and S. Siltanen, Can One Use Total Variation Prior for EdgePreserving Bayesian Inversion?, Inverse Problems, 20, 2004.
[23] M. S. Lehtinen, L. Päivärinta and E. Somersalo, Linear Inverse Problems
for Generalized Random Variables, Inverse Problems, 5, 1989.
[24] D. MacKay, Information Theory, Inference and Learning Algorithms,
Cambridge University Press, 2003.
[25] P. Mathé and U. Tautenhahn, Regularization under General Noise Assumptions, Inverse Problems, 27, 2011.
[26] V. Maz’ya, Sobolev Spaces with Applications to Elliptic Partial Differential Equations, A Series of Comprehensive Studies in Mathematics,
342, Springer, 2011.
[27] A. Neubauer and H. K. Pikkarainen, Convergence Results for the
Bayesian Inversion Theory, Inverse Ill-PosedProblems, 16, 2008.
[28] M. Nikolova, Model Distortions in Bayesian MAP Reconstruction, Inverse Problems and Imaging, 2, 2007.
[29] L. I. Rudin, S. Osher and E. Fatemi, Nonlinear Total Variation Based
Noise Removal Algorithms, Physica D, 60, 1992.
[30] O. Scherzer, Handbook of Mathematical Methods in Imaging, Springer,
2011.
[31] S. Siltanen, Discretization-Invariant Bayesian Inversion and Besov Space
Priors, Inverse and Partial Information Problems: Methodology and
Applications, RICAM, Linz, Austria, 2008.
[32] S. Siltanen, Discretization-Invariant Bayesian Inversion, Inverse Problems and Control Theory Research Group Seminar, Centre de Mathématiques et Informatiques, Marseille, France, 2011.
93
[33] N. N. Vakhania, V. I. Tarieladze and S. A. Chobanyan, Probability
Distributions on Banach Spaces, Mathematics and Its Applications, D.
Reidel Publishing Company,1987.
[34] W. P. Ziemer, Weakly Differentiable Functions. Sobolev Spaces and
Functions of Bounded Variation, Graduate Texts in Mathematics, 120,
Springer, 1989.
94

Download Report

A novel class of priors for edge-preserving methods in Bayesian

Paperzz.com

Your Paperzz