be a probability space and X : Ω → R na random vector

Probabilities and density functions
Let (Ω, Σ, P ) be a probability space and X : Ω → Rn a random vector.
• The pdf is a tool for calculating the probabilities P (X ∈ B).
• This tool has some limitations:
– All random vectors do not have pdf.
– Pdf is not unique.
Example 33 (Probability distribution without pdf). Let X be a random vector with pdf
fX : R → [0, ∞). The random vector (X, X) does not have pdf.
Proof by contradiction: Assume that there is pdf f(X,X) (x, y). Denote
B = {(x, y) ∈ R × R : x 6= y}
(is a Borel set whose indicator function 1B (x, y) is Riemann integrable). The probability
distribution gives to the set B the value P ((X, X) ∈ B) = 0 since (X, X) ∈
/ B. From the
existence of pdf, it follows that
Z
0 = P ((X, X) ∈ B) =
f(X,X) (x, y)dxdy
B
Z ∞ Z x
Z ∞ Z ∞
=
f(X,X) (x, y)dxdy +
f(X,X) (x, y)dxdy = 1,
x=−∞
y=−∞
x=−∞
y=x
which is impossible. (The slightly dubious Fubini is not actually necessary here, since we
can also divide the integration area into the corresponding parts). Hence, there is no pdf
for (X, X).
Example 34 (Pdf is not unique). Let X : Ω → Rn be rv with pdf
fX (x) = 1[0,1] (x).
(5.1)
For every a < b it holds that
Z
P (X ∈ [a, b]) =
b
Z
a
b
1(0,1) (x)dx.
1[0,1] (x)dx =
a
Hence, also
feX (x) = 1(0,1) (x)
is the pdf of X. Clearly feX 6≡ fX . For multidimensional example, take n-dimensional
random vector X = (X1 , . . . , Xn ), with statistically independent components with pdf
given by (5.1). Then
fX (x1 , . . . , xn ) = 1[0,1]n (x1 , . . . , xn )
and
feX (x1 , . . . , xn ) = 1(0,1)n (x1 , . . . , xn )
define the same probability distribution.
78
Definition 31. Let X : Ω → Rn be a random vector. Different pdfs f : Rn → [0, ∞)
satisfying
Z
P (X ∈ B) =
fX (x)dx
B
for all hyperrectangulars B ⊂ Rn , are called versions of the pdf of X.
Remark 20. Let X be n-dimensional and Y such an m-dimensional random vector that
the random vector (X, Y ) has (joint) pdf f(X,Y ) (x, y). When the marginal pdf
Z
fX (x) = f(X,Y ) (x, y)dy
exists, then it is a version of the pdf of X. Indeed,
P (X ∈ B) = P (X ∈ B ∩ Y ∈ Rm ) = P ((X, Y ) ∈ B × Rm )
Z
Z
=
f(X,Y ) (x, y)dxdy =
fX (x)dx
B×Rm
B
n
for every hyperrectangular B ⊂ R .
The next lemma shows that the continuous version of pdf is unique.
Lemma 7. Let X : Ω → Rn be a random vector whose pdf has versions fX and feX . If
there exists such an open set O ⊂ Rn that
Z
fX dx = 1
O
and fX : O → [0, ∞) and feX : O → [0, ∞) are continuous, then
fX (x) = feX (x)
when x ∈ O.
Proof. Let fX , feX and O satisfy the assumptions. Proof by contradiction: Assume that
fX 6= fe in O. Then there exists x0 ∈ O so that
fX (x0 ) − feX (x0 ) ≥ δ > 0
(if the difference is negative, just change the roles of f and fe). By continuity of g :=
fX − feX there exists r > 0 so that
|g(x0 ) − g(x)| < δ/2
whenever x ∈ B(x0 , r). Then
g(x) = g(x0 ) − (g(x0 ) − g(x)) ≥ g(x0 ) − |g(x0 ) − g(x)| ≥ δ − δ/2 = δ/2
for every x ∈ B(x0 , r). Let B ⊂ B(x0 , r) be a hyperrectangular. Then
Z
Z
δ
δV ol(B)
g(x)dx ≥
dx ≥
> 0.
2
B
B 2
On the other hand, the pdf of X are related to the same distributions so that
Z
Z
g(x)dx =
fX (x) − feX (x)dx = 0,
B
B
which we proved above to be positive. Hence g ≡ 0 in O.
79
Conditional pdfs
Let (Ω, Σ, P ) be a probability space.
Definition 32. Let X : Ω → Rn and Y : Ω → Rm be random vectors with joint pdf
f(X,Y ) : Rn × Rm → R whose marginal pdf 0 < fY (y0 ) < ∞ at y0 ∈ Rm . The conditional
pdf of X given Y = y0 is the mapping
Rn 3 x 7→ fX (x|Y = y0 ) =
f(X,Y ) (x, y0 )
.
fY (y0 )
(5.2)
Other aspects of conditional pdf can be seen by using abstract measure theory (not
done in this course).
Remark 21. The condition Y = y0 means that a random event has happened
and the random vector Y has attained the sample value y0 = Y (ω0 ), where ω0 ∈
Ω. In practical inverse problems this means that the noisy value of the data is
observed/measured (the noisy data is then available aka given).
• Conditional pdf is pdf
Z
fX (x|Y = y0 )dx =
Rn
1
fY (y0 )
Z
f(X,Y ) (x, y0 )dx
Rn
marginal pdf
=
fY (y0 )
= 1.
fY (y0 )
• If X and Y are statistically independent, then knowing the value of Y does not
affect the distribution of X, since
fX (x|Y = y0 ) =
f(X,Y ) (x, y0 )
fY (y0 )
independenc
=
fX (x)fY (y0 )
= fX (x).
fY (y0 )
The significance of the conditional pdf in statistical inverse problems is based on the
fact than there is dependence between the unknown X and the data Y . When Y = y0 is
given, it can change the distribution of the unknown X.
Conditional pdfs are more easily handled by the Bayes’ formula.
Theorem 16. (Bayes’ formula) Let X : Ω → Rn and Y : Ω → Rm be two random vectors
and let fX be the pdf of X and fY (y|X = x) be the conditional pdf of Y given X = x (or
fY be pdf of Y and fX (x|Y = y) be conditional pdf of X given Y = y).
If the mapping
(x, y) 7→ fY (y|X = x)fX (x)
(or (x, y) 7→ fX (x|Y = y)fY (y))
is integrable, then it is a (Riemann-integrable) version of the joint pdf f(X,Y ) (x, y).
Proof. Skipped. (not hard with Lebesgue’s integral but requires pretty much more material with Riemann integral ).
Different versions are united with the help of continuity.
80
Corollary 4. Let X : Ω → Rn and Y : Ω → Rm be two random vector. If there exists
open sets O1 ⊂ Rn and O2 ⊂ Rm , satisfying
Z
Z
fX (x)dx = 1and
fY (y|X = x)dy = 1 ∀x
(5.3)
O1
O2
and, moreover, fX is a bounded continuous function on O1 and (x, y) 7→ fY (y|X = x) is
a bounded continuous function on O1 × O2 , then
fX (x|Y = y0 ) = R
fY (y0 |X = x)fX (x)
fY (y0 |X = x)fX (x)dx
(5.4)
is a conditional
R pdf of X = x that is on O1 uniquely determined and continuous. whenever
y0 ∈ O2 and fY (y0 |X = x)fX (x)dx > 0.
Proof. The product of two Riemann integrable bounded functions is Riemann integrable.
By Theorem 16, the product fY (y|X = x)fX (x) is a version of f(X,Y ) (x, y). Since
Z
Z
Z
(5.3)
Fubini
fX (x)fY (y|X = x)dxdy =
fX (x)
fY (y|X = x)dy dx = 1,
O1 ×O2
O1
O2
then Lemma 7 holds for fX (x)fY (y|X = x) when O = O1 × O2 , which proves the uniqueness of f(X,Y ) on O by continuity. By Def. 32,
fX (x|Y = y0 ) =
f(X,Y ) (x, y0 )
fY (y0 |X = x)fX (x)
=R
fY (y0 )
fY (y0 |X = x)fX (x)dx
whenever the denominator is positive. The finiteness of the denominator follows from the
boundedness of the pdfs. The value
Z
Z
Z
fY (y0 |X = x)fX (x)dx =
fY (y0 |X = x)fX (x)dx +
fY (y0 |X = x)fX (x)dx
O1C
O1
does not depend on the second integral, since
Z
Z
fY (y0 |X = x)fX (x)dx ≤
sup fY (y0 |X = x)
fX (x)dx
x
O1C
O1C
Z
(5.3)
=
(1 − 1O1 (x))fX (x)dx = 1 − 1 = 0.
Definition 33. Let X and Y be as in Def. 32. The conditional expectation of X given
Y = y0 is
Z
E[X|Y = y0 ] =
xfX (x|Y = y0 )dx,
Rn
if the integral exists.
81
We do not prove the following theorem, since the proof requires measure theoretic
tools.
Theorem 17. Let X be Rn -valued random vector that is statistically independent from
the Rm -valued random vector Z. Let G : Rn × Rm → Rk be a continuous mapping and
let G(x0 , Z) have pdf for every x0 ∈ Rn . Then
fG(X,Z) (y|X = x0 ) = fG(x0 ,Z) (y)
for all y ∈ Rk .
Transformations of random vectors
Theorem 18. Let G : Rn → Rm be a continuous function and let X : Ω → Rn be random
vector. Then G(X) is a random vector.
Proof. We actually need only to prove that the preimage G−1 (B) of an open set B ⊂ Rm
is open. The same holds then also when the word open is replaced with the word Borel.
But this requirement is just the topological definition of the continuity.
Example 35. Let X : Ω → Rn ja ε : Ω → Rm be random vectors. The following are also
random vectors
1. aX, a ∈ R
2. X + a , a ∈ Rn
3. kXk (is a random variable)
4. Y = F (X) + ε, where F : Rn → Rm is continuous.
Example 36. Let’s determine the pdf of X + a when the pdf of X is fX and a ∈ Rn is
some constant vector. The distribution of X + a is of the form
Z
fX (x)dx
(5.5)
P (X + a ∈ B) = P (X ∈ B − a) =
B−a
where B − a is a translation of a hyperrectangular B, that is,
B − a = {x − a : x ∈ B}.
Let’s make the change of variables H(x) = x − a in (5.5). We obtain
Z
Z
P (X + a ∈ B) =
fX (x)dx =
fX (x − a)dx
B−a
B
for every hyperrectangular B, so that fX+a (x) = fX (x − a).
The combination of Example 36 and Theorem 17 leads to the following corollary.
Corollary 5. Let F : Rn → Rm be continuous, X : Ω → Rn and ε : Ω → Rm be two
statistically independent random vectors and fε pdf of ε. When Y = F (X) + ε, then
fY (y|X = x0 ) = fε (y − F (x0 ))
Proof. Choose G(x, y) = F (x) + y in Theorem 17. Then
fY (y|X = x0 )
Theorem 17
=
fε+F (x0 ) (y)
82
Ex. 36
=
fε (y − F (x0 )).
5.2
Statistical inverse problems
• Consider an inverse problem, where we are given the noisy data y0 = F (x0 )+ε ∈ Rm
about the unknown x0 ∈ Rn . The direct theory F : Rn → Rm is here continuous.
• We often have statistical information about the noise ε. For example, ε = (ε1 , ..., εm )
could consist of statistically independent components with the probability distributions
Z b
1 2
1
exp − y dy,
P (a ≤ εi ≤ b) = √
2σ
2πσ a
where i = 1, ..., m, a < b ∈ R and σ > 0.
• When F is a linear mapping whose matrix is M , then the Morozov’s discrepancy
principle is unavailable, since
P (|ε| > e) ≥ P (|εi | > e) > 0
for any e ≥ 0. One option is to consider statistical solution methods.
Let (Ω, Σ, P ) be a probability space
Finite dimensional statistical inverse problem
1. The unknown X : Ω → Rn and noise ε : Ω → Rn are random vectors. The direct
theory F : Rn → Rm is a continuos function.
2. Data Y = F (X) + ε is a random vector Y : Ω → Rm (Example 35).
3. The given data y0 = F (x0 ) + ε0 ∈ Rm is the value of the sample Y (ω0 ) =
F (X(ω0 )) + ε(ω0 ).
4. The distribution of the unknown B →
7 P (X ∈ B) is called the prior distribution
and its density fX (x) the prior pdf. We denote fpr (x) = fX (x).
5. The solution of the statistical inverse problem is the posterior distribution whose
pdf is
fpost (x) := fX (x|Y = y0 ) = R
fY (y0 |X = x)fpr (x)
(Cor. 4, Cor. 5)
fY (y0 |X = x)fpr (x)dx
Example 37. Let the noise ε ∼ N (0, Cε ), the unknown X ∼ N (0, CX ), the noise and
the unknown are statistically independent, F : Rn → Rm is linear and has matrix M .
The given data y0 = M x0 + 0 is sample of the random variable Y = M X + ε. Then by
Cor. 5, we have
1
1
T −1
fY (y|X = x) = fε (y − M x) = p
e− 2 (y−M x) Cε (y−M x) ,
(2π)m det(Cε )
83
which is continuous and bounded. The prior pdf
1 T −1
1
e − 2 x CX x
fpr (x) = p
m
(2π) det(CX )
is also continuous and bounded. By Cor. 4 the posterior pdf is
T C −1 (y −M x)
0
ε
1
fpost (x) = Cy0 e− 2 (y0 −M x)
1 T −1
CX x
e− 2 x
,
where Cy is a norming constant. Let’s simplify this expression. Consider
1
1
1
1
−1
− (y0 − M x)T Cε−1 (y0 − M x) − xT CX
x = − y0T Cε−1 y0 + xT M T Cε−1 y0
2
2
2
2
1 T −1
1 T
−1
+
y0 Cε M x − x M T Cε−1 M + CX
x.
2
2
Denote
−1
Cpost = M T Cε−1 M + CX
−1
and add terms so that we have a quadratic form
1
1
1
1
−1
−1
Cpost M T Cε−1 y0
x = − (y0T Cε−1 y0 ) + xT Cpost
− (y0 − M x)T Cε−1 (y0 − M x) − xT CX
2
2
2
2
1 T −1
1
−1
−1
+
y0 Cε M Cpost Cpost
x − xT Cpost
x
2
2
1
1
−1
= − (y0T Cε−1 y0 ) − (x − mpost )T Cpost
(x − mpost )
2
2
1 T
+
m C −1 mpost
2 post post
where
−1
mpost = Cpost M T Cε−1 y0 = M T Cε−1 M + CX
−1
M T Cε−1 y0 .
Posterior pdf is (up to a norming constant now) a Gaussian pdf! The norming constant
Cy0 , is then well-known and we obtain
1
1
T −1
e− 2 (x−mpost ) Cpost (x−mpost ) .
fpost (x) = p
(2π)n det(Cpost )
The posterior pdf is multinormal having the mean
−1
mpost = M T Cε−1 M + CX
−1
M T Cε−1 y0
and covariance matrix
−1
Cpost = M T Cε−1 M + CX
−1
.
Especially, if Cε = δI and CX = cI, then
−1
δ
T
mpost = M M + I
M T y0 ,
c
which leads to
δ
mpost = argmin |M x − y0 |2 + |x|2 .
c
x∈Rn
84
When the noise is distributed as N (0, δI) and the prior distribution is N (0, cI), then
posterior expectation coincides with a Tikhonov-regularised solution with regularisation
parameter α = δ/c.
Prior can be interpreted so that
Xi ∼ N (0, c)
represents knowledge of the unknown that rouhgly tells as that the values of the unknown
are not exactly known but we trust that the negative and positive values of the components are as likely and large values of the component are quite unlikely. Independence
between components allows large variation between the values of the components.
85