Notes on Markov Kernels
Prakash Panangaden
Sherry Shanshan Ruan
June 30, 2014
1 Prelude: binary relations
The whole point of this note is that Markov kernels are a natural probabilistic generalization of
the notion of binary relations. We first introduce some basic definitions and algebra of binary
relations.
Definition A binary relation R ⊆ X × Y is a set of (x, y) pairs where x ∈ X and y ∈ Y , and we
write xRy to indicate (x, y) ∈ R
We can compose two binary relations or take the converse of a binary relation. The formal definitions are given as follows:
Definition Given R ⊆ X × Y and S ⊆ Y × Z, S ◦ R is defined as
x(S ◦ R)z
if there exists y ∈ Y such that xRy and ySz
Definition Given R ⊆ X × Y , Rc ⊆ Y × X is defined as
yRc x
if xRy
2 Binary relations on power sets
We then proceed by introducing binary relations from power sets, which can be regarded as the
“discrete analogue” of probabilistic relations. Hence, this can facilitate the presentation of probabilistic relations later. Binary relations from power set are defined as follows:
Definition Let X, Y be sets and f : X → P(Y ), then binary relation F ⊆ X × Y is defined as
xF y
if y ∈ f (x)
Given a function mapping from one set to another set, it is often useful to manipulate it to obtain
some mappings defined on power sets. We have the following two special functions:
Definition Let X, Y be sets and f a function from X to Y , then function P (f ) : P(X) → P(Y )
is defined as
P(f )(A) = {f (x) | x ∈ A}
for any A ⊆ X
and function f −1 : P(Y ) → P(X) is defined as
f −1 (B) = {x | f (x) ∈ B}
for any B ⊆ Y
Note that P(f ) and f −1 are not inverse functions of each other because they may not be bijections.
But we can roughly think of them as inverses. Subsequently, we introduce two functions which are
essential to composing binary relations. One maps a set to its power set, and the other maps a
power set to the set itself.
1
Definition For any set X, function {·}X : X → P(X) is defined as
{·}X (x) = {x}
for all x ∈ X
Definition For any set X, function U : P(P(X)) → P(X) is defined as
[
U ({Ai | i ∈ I}) =
Ai
for any Ai ∈ X
i∈I
Now we are ready to compose two functions which map a set to its power set.
Definition Given f : X → P(Y ) and g : Y → P(Z) together with corresponding binary relations
F and G, we define the composition function g # f : X → P(Z) (thereby G # F ) as follows:
g # f = U ◦ P(g) ◦ f : X → P(Z)
To justify the composition is well-defined, we first check the consistency of types:
P(g) : P(Y ) → P(P(Z))
P(g) ◦ f :
X → P(P(Z))
U ◦ P(g) ◦ f : X → P(Z)
Furthermore, we prove the associativity of compositions holds.
Proof. Let f : X → P(Y ), g : Y → P(Z), h : Z → P(W ) be three arbitrary functions. Then
h # (g # f ) = h # (U ◦ P(g) ◦ f ) = U ◦ P(h) ◦ (U ◦ P(g) ◦ f ). By the associativity of ◦, this is equal
to U ◦ (P(h) ◦ U ◦ P(g)) ◦ f . Then we want to establish the equality between P(h) ◦ U ◦ P(g) and
P(U ◦ P(h) ◦ g).
Let A ∈ P(Y ) be arbitrary. We want to show that (P(h) ◦ U ◦ P(g))(A) = (P(U ◦ P(h) ◦ g))(A).
z ∈ (P(h) ◦ U ◦ P(g))(A)
⇔z ∈ P(h)(U ({g(x) | x ∈ A}))
[
⇔z ∈ P(h)(
g(x))
x∈A
⇔z ∈ {h(y) | y ∈
[
g(x)}
x∈A
⇔z ∈ {
[
h(y) | x ∈ A}
y∈g(x)
⇔z ∈ {U ({h(y) | y ∈ g(x)}) | x ∈ A}
⇔z ∈ {U (P(h)(g(x))) | x ∈ A}
⇔z ∈ {(U ◦ P(h) ◦ g)(x) | x ∈ A}
⇔z ∈ (P(U ◦ P(h) ◦ g))(A)
Therefore, U ◦ (P(h) ◦ U ◦ P(g)) ◦ f = U ◦ P(U ◦ P(h) ◦ g) ◦ f = (U ◦ P(h) ◦ g) # f = (h # g) # f .
Thus we obtain h # (g # f ) = (h # g) # f
Also note that this indeed defines a composition of binary relations, as justified below:
2
x (G#F ) z
⇔z ∈ g#f (x)
⇔z ∈ U (P(g)(f (x)))
⇔z ∈ U ({g(y) | y ∈ f (x)})
[
⇔z ∈
g(y)
y∈f (x)
⇔∃ y, y ∈ f (x) ∧ z ∈ g(y)
⇔∃ y, x F y ∧ y G z
Since we have established a well-defined composition operation, we can regard {·}X as the identity (or the equivalence relation). Given any function f : X → P(X), it is easy to see that
{·} # f = f # {·} = f . We give the proof of the second equality, and then the first holds due to
the commutative property. Let x ∈ X be given, then
(f # {·})(x)
=
(U ◦ P(g) ◦ {·})(x)
= U ({g(x)})
= g(x)
3 Probabilistic relations
Probabilistic relation can be treated as the probabilistic analogue of power sets. Note that P(X)
can be seen as a map from X → {0, 1}, so similarly, we can define ν as a subprobability measure
from ΣX to [0, 1]. The probabilistic analogue of (X, ΣX ) is given below:
Definition Given a set X, we define
ΠX = {ν | ν : ΣX → [0, 1] and ν is a subprobability measure}
Definition For every A ∈ ΣX , we define PA : ΠX → [0, 1] by
PA (ν) = ν(A)
Definition We define ΣΠX as the smallest Σ-algebra on ΠX such that ∀A ∈ ΣX , PA is measurable.
Note such a Σ-algebra exists because the Σ-algebra taken as the power set P(ΠX) can guarantee
that all PA maps are measurable.
Recall uncurrying is the technique of transforming a higher-order function that returns a new
function as output into a function that takes a tuple of arguments. Therefore, given f : X → P(Y ),
i.e. f : X → (Y → {0, 1}), we can uncurry it to f : X × Y → {0, 1}. Now we can define
a probabilistic relation to be a measurable function h : X → ΠY , or (X × ΣY ) → [0, 1] by
uncurrying.
Definition A probabilistic relation h : X × ΣY → [0, 1] is a Markov kernel if
(1) ∀B ∈ ΣY , λx.h(x, B) is a measurable function
(2) ∀x ∈ X, λB.h(x, B) is a measurable function
3
It is under the auspices of P(f ), U , and {·} that we successfully composed binary relations on
power sets. Thus, in order to compose probabilistic relations, we need to describe their probabilistic
counterparts first.
Definition Let X and Y be two sets and f a measurable function from (X, ΣX ) to (Y, ΣY ), we
define Πf : ΠX → ΠY as
(Πf )(ν)(B) = ν(f −1 (B))
for any ν ∈ ΠX, B ∈ ΣY
Hence, similar to P(f ) : P(X) → P(Y ), we have Πf : ΠX → ΠY as a probabilistic counterpart.
The followings are analogous to {·}X : X → P(X) and U : P(P(X)) → P(X) respectively.
Definition Given (X, ΣX ) we define ηX : X → ΠX by
(
1, if x ∈ A
ηX (x)(A) = δx =
0, if x 6∈ A
Definition We define ξ : Π2 X → ΠX as
Z
ξ(Ω)(A) =
PA d Ω
for A ∈ ΣX
for Ω ∈ Π2 X and A ∈ ΣX
ΠX
Note that PA is a measurable function from ΠX to [0, 1] and Ω is a measure on ΣΠX .
Having acquired these auxiliary functions, we are finally able to define the composition of two
probabilistic relations:
Definition Let f : X → ΠY and g : Y → ΠZ, we define g # f : X → ΠZ as
Z
(g # f )(x, C) = (ξ ◦ Πg ◦ f )(x, C) =
g(y, C)f (x, dy)
for x ∈ X and C ∈ ΣZ
Y
Note that g(y, C) is a measurable function from Y to [0, 1], and f (x, dy) is a measure defined on ΣY
We need to justify that this is well-defined (the second equality in the above definition).
First we prove a preliminary proposition useful in a variety of contexts. This is called the “change
of variables” formula.
Proposition 3.1. Suppose that (X, ΣX ) and (Y, ΣY ) are measurable spaces and f : X → Y is a
measurable function. Suppose that g : Y → [0, 1] is a measurable function and that τ ∈ ΠX; so
that Π(f )(τ ) is a measure on Y (i.e. is in ΠY ). Then
Z
Z
g dΠ(f )(τ ) =
g ◦ f dτ.
Y
X
Proof. We prove this for the special case of g being a characteristic function first. Let B be a
measurable subset of Y and consider the special case with g = χB . Then the lhs reduces to
Π(f )(τ )(B) = τ (f −1 (B)). To evaluate the rhs note that χB ◦ f = χf −1 (B) . Thus the rhs is
τ (f −1 (B)), which is the same as the lhs. Since the integral is linear the claimed equality holds
for all simple functions. By the monotone convergence theorem it then holds for all measurable
functions.
4
Now we can complete the verification of the formula for composing Markov kernels. Suppose x ∈ X
and D ⊆ Z, then
(ξ ◦ Πg ◦ f )(x, D)
=ξ(Πg(f (x))(D))
Z
=
PD d(Πg(f (x)))
ΠZ
Z
(PD ◦ g) df (x)
=
Y
Z
=
g(y)(D)
Y
Z
=
g(y, D)f (x, dy)
Y
We have used the change of variables formula to covert the integral over ΠZ to an integral over Y
in the penultimate step.
As in previous section, we also verify the associativity of compositions:
Proof. Let f : X → ΠY , g : Y → ΠZ, h : Z → ΠW be three arbitrary functions. Suppose x ∈ X
and D ⊆ W .
Z Z
Then ((h # g) # f )(x, D) =
h(z, D)g(y, dz) f (x, dy)
Y
Z
Since h(z, D) is non-negative, by the Simple Approximation Theorem, h(z, D) can be written as
mn
X
an,i XCn,i .
the limit of a monotone sequence of simple functions hn , i.e. h = lim hn = lim
n→∞
n→∞
i=1
Thus,
5
((h # g) # f )(x, D)
Z Z
=
h(z, D)g(y, dz) f (x, dy)
Y
Z
mn
X
Z
=
lim
Z n→∞ i=1
Z X
mn
Y
Z
=
lim
n→∞
Y
Z
=
lim
n→∞
Y
Z
=
lim
n→∞
Y
Z
= lim
n→∞
= lim
n→∞
= lim
n→∞
mn
X
i=1
mn
X
!
an,i XCn,i g(y, dz) f (x, dy)
by monotone convergence theorem
!
Z
XCn,i g(y, dz) f (x, dy)
by linearity
Z
!
an,i g(y, Cn,i ) f (x, dy)
i=1
mn
X
by evaluation of the inner integral
!
an,i g(y, Cn,i ) f (x, dy)
by monotone convergence theorem
g(y, Cn,i )f (x, dy)
by linearity
Y
Z
Z
XCn,i
an,i
Z
i=1
Z
n→∞
g(y, dz)f (x, dy)
by definition of characteristic functions
Y
! Z
mn
X
an,i XCn,i
g(y, dz)f (x, dy)
by linearity
g(y, dz)f (x, dy)
by monotone convergence theorem
Y
i=1
mn
X
! Z
an,i XCn,i
Y
i=1
Z
Z
h(z, D)
Z
by substitution
Z
lim
=
an,i XCn,i g(y, dz) f (x, dy)
an,i
i=1
mn
X
an,i
Z
Z
!
i=1
= lim
n→∞
Z i=1
mn
X
Y
Z
=
by definition
Z
g(y, dz)f (x, dy)
by substitution
Y
=(h # (g # f ))(x, D)
by definition
In the discrete case, the composition operation coincides with the operation of matrix multiplication:
X
(g # f )(x, z) =
f (x, y)g(y, z)
y∈Y
4 Application
We can use probabilistic relations to describe conditional probability in LMPs. For instance, given
a labeled Markov process (S, ΣS , L, τa : S × ΣS → [0, 1], ∀a ∈ L), then
τa : S × ΣS → [0, 1]
can be interpreted as: for any x ∈ S ,B ⊆ ΣS , τa (x, B) is the conditional probability that the
system ends up in B given it starts in x
6
© Copyright 2025 Paperzz