Markov Chain

Markov Chain
Stochastic Matrix An × square matrix = ( , ) is said to be a stochastic matrix if , ≥
0, ∀ , and ∑
, = 1, ∀ . In plain words, is called a stochastic matrix if every element is nonnegative and every row sums up to one, or more simply every row is a multinomial distribution.
In our discussion of Markov chains, a small letter in bold, like , represents a row vector, and
is a column vector as ’s transpose. Also, a big letter in bold, like , represents a matrix. As a
convenience in studying Markov chain, we have assumed letters like stand for row vectors
rather than column vectors.


is a stochastic matrix iff
= , i.e. column vector
= (1,1, … ,1) is an eigenvector of
corresponding to eigenvalue 1. This is clearly a consequence of that every row sums up to 1.
= ( , , … , ) is a multinomial distribution, then
is also a
multinomial distribution. Recall that
is a row vector derived by applying every element of
on row vectors of (see the remarks below), then
=
,⋅ +
,⋅ + ⋯ +
,⋅
where ,⋅ , … , ,⋅ are 1 st,…, th row vectors of , thus ( ) =
+
,
, +⋯+
) is the th element of
and
is the element of at the th row and
, where (
th column. As a convex sum, ( ) ≤ max , , , , … , , = 1. Moreover,
Theorem 8 If row vector
(
) =
=
=
=
,
,
+
+ ⋯+
+ ⋯+
,
+
,
+
+
+ ⋯+
,
,
=1
,
,
+⋯+
,
A stochastic matrix times a column vector can be interpreted as expectation. A multinomial
times a stochastic matrix , i.e.
is still a multinomial, as discussed above. Interestingly, it
also makes sense for the stochastic matrix times a column vector . Suppose is the random
variable distributed according to the th row of , i.e. ,⋅ , and interpret
= ( (1,2, … , ),
(1,2, … , ), … ,
as a vector of functions of vector (1,2, … , ), then
(


) =
,
=
Specially, if is non-negative, i.e. ≥ 0 for all , then
a non-negative random variable cannot be negative.
(
(1,2, … , ))
)
is non-negative, since expectation of
Theorem 9 The product of two stochastic matrixes is still a stochastic matrix. This simply follows
from the previous theorem. Given two stochastic matrixes and , then every row ,⋅ is a
multinomial, thus the ( ) ,⋅ = ,⋅ is a multinomial, which gives that
is stochastic.
Theorem 10 The following three statements are equivalent.
1)
is a stochastic matrix.
2) For any multinomial ,
is a multinomial.
3) For any non-negative column vector ,
is non-negative; moreover
is an eigenvector
of .
This theorem is a summary of previous theorems. 1)  2) and 1)  3) have been well
presented above. We now show 2)  1) and 3)  1) in addition.
For 2)  1), consider the special case when = (0, … ,0,1,0, … ,0) where
= 1 when
= and
= 0 for other , then
= ,⋅ is a multinomial. Since can be any of
1,2, … , , then every row of is a multinomial, and so is stochastic.
For 3)  1), similarly consider the special case when = (0, … ,0,1,0, … ,0) , where =
1 when = and = 0 for other , then
= ⋅, is non-negative. Since can be any
of 1,2, … , , then every column of is a non-negative, i.e. , ≥ 0 for every , . So is
stochastic plus the condition that is an eigenvector of , since it implies every row of
sums up to 1.
REMARK
Recall the following useful identities,
=(
,
,⋅
)
,…,
+
+
…
+
=
,⋅
…
,⋅
where
⋅, … ,
⋅
are row vectors of , and also
=
where
⋅,
,
⋮
0
⋅,
=
⋅,
,…,
⋯
⋱
⋯
⋅,
⋅,
⋮
0
0
⋮
,
⋅,
,…,
+
,⋅
=
,⋅
+
,⋅
+ ⋯+
,⋅
,⋅
…
⋅,
+ ⋯+
⋅,
,⋅
=
⋅,
+
⋅,
⋅,
+ ⋯+
⋅,
are column vectors of . Other two useful matrix identities are
⋯
⋱
⋯
=
0
⋮
⋅,
,
=
⋅,
,…,
⋮
0
⋅,
⋯
⋱
⋯
⋮
0
0
⋮
⋯
⋱
⋯
,⋅
=
,⋅
…
0
⋮
…
,⋅
=
⋅,
,
,⋅
,⋅
,⋅
⋅,
,…,
⋅,
Basics of Markov Chain with Finite State Space. A Markov chain with finite state space can be
informally described as a sequence of random variables , , , … , …, each of which sharing
the same sample space of finite size , named the state space.
is distributed to an dimensional multinomial named initial distribution and , , … , , … are distributed to ×
stochastic matrixes ( ) , ( ) , … , ( ) , … known as transitional matrixes in a way that
ℙ(
= |
= )=
( )
,
for
= 1,2, … , , … . If
( )
is invariant to time , i.e.
=
( )
=
= ⋯ = ( ) = ⋯, then the Markov chain is said to be homogeneous and ( ) is simplified
as , otherwise it is said to be inhomogeneous. A homogenous Markov chain can be represented
by a two tuple ( , ).
( )
Intuitively, a Markov chain can be viewed as a series of state transitions. At the beginning a state
is chosen based on the initial multinomial, and then given the ( − 1)th state is , the th state is
chosen according to the th row in the transition matrix ( ) .
Obviously, Markov chain is not a sequence of independent random variables, since by the above
description, ω is dependent on ω
for = 1,2, …. Rather, it is not hard to observe that Markov
chain is a generalization of i.i.d trials, since an i.i.d. trial is a special case when = ,⋅ = ,⋅ =
⋯ = ,⋅
Here we only formally define a finite Markov chain , , … ,
of finite length + 1 as a
probability space (Ω, 2 , ℙ) where 1) the sample space Ω =
and is the state space, which
is exactly the same as the sample space of ( + 1)-time i.i.d. trials; 2) let = ( , , … , ),
and the probability measure ℙ is defined and denoted as ℙ{ = , = , … ,
= }=
ℙ(
( )
,
,…,
) = ℙ( ) =
( )
,
( )
,
…
( )
where
,
is the initial multinomial and
are stochastic matrixes with both rows and columns indexed by . Here for convenience, we
use notation like ℙ(

) = ℙ(
ℙ(
0 ,..,
0 ,..,
=
0 ,..,
=
= ℙ(
|
=
0, . . ,
−1
…
(
(1)
(2)
…
(
1, 2
1, 2
( )
0
0
(2)
0, 1
0
−1
(1)
0, 1
0
1
0, 1
=
( )
∈
)
−2 ,
−2 ,
0
0
= 1 if
−1 ,
)
−1
)
).
=
−1
is given, then
( )
−1 ,
=⋯
=1
Marginal probability. By similar argument, we can see the marginal probability of (
Note that
ℙ(
Since
Thus ℙ(
,..,
)=
( )
…
( )
,..,
,..,
(
)
…
(
)
)=
=
( )
,..,
=1
( )
…
( )
(
,..,
is given, then
=⋯=

( )
,
To see ℙ is indeed a probability measure, note that ∑
=

),
=
.
(
)
…
)
…
(
,..,
).
( )
)
( )
Concentrated Initial. We may let the initial distribution concentrated at a particular state, i.e.
1
=
for some ∈ , =
, then it can be interpreted as that the initial state of the
0
≠
Markov chain is known, and
ℙ(

)=
,..,
…
( )
( )
=
( )
…
0
=
≠
( )
Theorem 11 Markov property. Define a time as the present time, then time + 1, + 2 , …
are the future and time − 1, − 2, … ,0 are the past. Time − 1 is also called the near past,
and time + 1 is called the near future. What is well known as Markov property is that
ℙ(
Since
ℙ(
|
|
)=
,..,
…
( )
) = ℙ(
,..,
ℙ(
( )
ℙ(
(
,
)
)
|
,.., )
,.., )
= ℙ(
| )
…
Markov property says the future on a Markov chain is only dependent on the present, and is
independent from the past. Actually, a sequence of random variables , , … ,
is a chain
iff the Markov property holds, since conversely if
( )
)=
ℙ( |
, . . , ) = ℙ( |
=
( )
=
( )
(
)
holds for every = 1,2, … , , then
ℙ( , . . ,
= ℙ( |
= ℙ( |
( )
=

) = ℙ(
)ℙ(
)ℙ(
(
)
|
…
|
|
( )
, . . , )ℙ(
, . . , ) = ℙ(
)ℙ(
,.., ) = ⋯
) … ℙ( | )ℙ( )
)ℙ(
|
,..,
)
That’s why Markov property is commonly used as another definition of Markov chain besides
the one mentioned above.
Theorem 12 Independence of future from past. First check
for any finite
ℙ(
ℙ(
,
|
,…,
,
) = ℙ(
,…,
and any 0 ≤ < . This is simply due to
ℙ( , … ,
| ,
,
,…,
,…, ) =
ℙ(
( )
( )
(
)
( )
…
…
,
,
,
=
( )
( )
…
,
,
=
(
,
)
…
( )
=
,
= ℙ(
,
,
,…,
,
,…,
,…,
|
,…,
)
|
)
)
)
Recall that an elementary event is an atom of the sample space (a singleton containing a single
outcome) and any event is a disjoint union of elementary events. In a similar fashion, let’s
define an elementary event determined by time , , … , as the event
{
=
,…,
=
}
and observe that any event determined by
, ,…,
is a disjoint union of elementary
events determined by , , … , , because the elementary events are the finest sample space
partitions we can achieve by information of
, ,…,
. For example, suppose = 4, =
2 − 1 for = 1,2,3,4, and the state space is = {0,1,2,3}, then
≤ 1,
=
≠1
∈
∈{ , , }
{
∈
=
∈{ , }
,
=
,
=
,
=
}
Now choose a fixed as the present time, then 0, … , − 1 are the past times, and + 1, … ,
are the future times. Call the corresponding RVs as the present, , … ,
as the past,
and
,…,
as the future. Let be any event determined by the past, i.e. an event
determined by RVs , … ,
, and let be any event determined by the future, i.e. an event
determined by RVs
, … , . Claim ℙ( , | = ) = ℙ( | = )ℙ( | = ).
We start by letting , be elementary events, i.e. = { = , … ,
=
}, =
{
=
,…,
= } , denoted as
,…,
and
,…,
for convenience as
usual. Note that if ℙ( | , ) = ℙ( | ), then , are independent given , because
and then
ℙ( | , ) = ℙ( | )
ℙ( , , ) ℙ( , )
⇒
=
⇒ ℙ( , , )ℙ( ) = ℙ( , )ℙ( , )
ℙ( , )
ℙ( )
ℙ( | )ℙ( | ) =
ℙ( , ) ℙ( , ) ℙ( , , )ℙ( ) ℙ( , , )
=
=
= ℙ( , | )
ℙ( ) ℙ( )
ℙ( )ℙ( )
ℙ( )
It is already proved that for any
ℙ(
,
|
,…,
Then it immediately follows that
ℙ(
Now let
,…,
,…,
,
,…,
,…,
,
|
we have
,…,
) = ℙ(
) = ℙ(
,
,…,
|
,…,
|
)ℙ(
,…,
)
|
)
be any history that is disjoint union of elementary events determined by
, then we have
ℙ( ,
=
=
= ℙ(
= ℙ(
|
,…,
{
,…,
}⊆
{
,…,
}⊆
,…,
,…,
ℙ(
|
|
ℙ(
) = ℙ( ⋂{
)
,…,
,
,…,
|
{
)ℙ( |
,…,
)
,…,
,…,
)ℙ(
}⊆
ℙ(
}|
|
,…,
,…,
)
)
|
)
|
)
By similar argument, let
be any future that is disjoint union of elementary events
determined by
, … , , then we have
ℙ( , |
=
=
) = ℙ( ⋂ |
{
,…,
}⊆
{
,…,
}⊆
= ℙ( |
)
= ℙ( |

ℙ( ⋂{
ℙ(
{
which completes the proof.
,…,
}⊆
)
ℙ(
)
}|
|
,…,
,…,
)ℙ( |
)
)ℙ( |
)
|
)
,…,
Theorem 13 T-step transition. Probability ℙ(
| ) is called a T-step transition probability.
(
)
Obviously when = 1, ℙ(
| )=
by Markov property. When = 2,
ℙ(
|
∑
=
=
Define matrix
ℙ(
and
)=
ℙ(
ℙ(
= | = )=
, then
= , and we have
( , )
( , )
= ℙ(
=
) (
)=
|
∑
ℙ(
(
=
, =
( ,
=
)
)
(
)
,
ℙ(
=
)
,
)
(
)
)
(
)
)=
)
(
(
=
)
|
ℙ(
ℙ(
,
)
)
,
)
and
( , )
is called a T-setp transition matrix. Clearly, let
(
. We can guess that
ℙ(
)
)ℙ(
=
)
⋅
) (
⋅
( , )
∑
(
=
ℙ(
)
=
)
=
)
=
(
ℙ(
(
⋅
) (
⋅
) (
)
,
)
,
…
)
(
)
)
, and do the
( , )
, and
= ℙ(
( , )
( , )
|
(
That is ( , ) = (
following induction,
ℙ(
)
|
ℙ(
with rows and columns indexed by the state space
( , )
=
Let =
) ∑
ℙ(
=
ℙ( )
|
, )ℙ(
ℙ( )
|
(
)
(
)
( , ) (
⋅
⋅
=
)
( , )
)
=
( , ) (
⋅
⋅
)
)
)
That is ( , ) = ( , ) (
= ( ) ( )… ( ) (
, which completes the
proof. In the special case when the Markov chain is homogeneous, then
( , )
=
EX 5. Suppose the Markov chain is homogenous, show
ℙ{
= ℙ{
KEY. Check the following,
=
=
ℙ(
= ℙ(
,
,
,
=
= ℙ{
,…,
,…,
…
,
=
=
=
=
…
,
,…,
=
,…, =
|
|
=
|
)
,
…
=
,…,
}
)
,…,
…
| =
= }
…
=
|
=
}
Here is an interpretation. Given a homogeneous Markov chain ( , ) where
distribution concentrated at state , then by above result
ℙ{
= ℙ{
=
=
,
,
=
=
,…,
=
,…, =
|
which implies the chain looks as if it “starts afresh” at time if
EX 6. Let
= 0 and
= { , , … } be a set conditions
=
= .
| = }
= }
,
=
,…,
assumed known. For convenience, let as well represent these conditions. Let
determined by
,…,
for = 1,2, … , .Show
KEY. Notice that
where
,…,
ℙ
,…,
ℙ
,…,
=
Theorem 12, choosing
=ℙ
⋂…⋂
ℙ
=ℙ
|
,
,…,
…ℙ
ℙ
,
,…,
,…,
=ℙ
|
=ℙ
=ℙ
ℙ
,…,
|
ℙ
=ℙ
…ℙ
are
=
be an event
|
,…,
is an event determined by up to
By the same argument, we can find ℙ
,…,
ℙ
, and
as the present time, and
⇒ℙ
iteratively ℙ
|
is an initial
is given. By
,…,
|
ℙ
,…,
, and
, which completes the proof.
This claim implies the known times split the Markov chain into kind of “independent” blocks.
Irreducibility & Aperiodicity. From this point on we only discuss homogenous Markov chain unless
otherwise stated. We now relate homogenous Markov chains to its graph representation, which
is quite straightforward: take the transition matrix as a weighted adjacency matrix for a directed
graph = ( , ) such that ( , ) ∈ and weighted
, iff
, > 0. The following gives
two examples of such graph,
=
0
1
1
2
1
3
0
=
1
3
1
2
1
3
1
0
1
2
1
3
0 0
0 0
0 0
0
1
3
1 1 1
3 3 3
1 1
0
3 3
1
0 0
2
1
1
0
3
3
1
1/2
1/2
1/2
3
1
1
3
1/3
1/3
1/3
1/3
1/3
2
1/3
4
2
1
1/2
1/3
1/3
1/3
1/3
1/3
4
1/3
Figure 4 Graph representation of Markov chain by treating its transition matrix as weighted adjacency matrix.
An interpretation of Markov chain as the random walk on graph is now coming forth:
is the
initial vertex of the random walk, and
is the position of the random walk at time . For example,
in the first graph above, if
= 4, then at time 1, the random walk has probability to reach
vertex 1, to reach vertex 2, and to stay at 4. Also, we notice in both examples
=
for any , where deg
denotes the out degree of . Further, the second example presents
a bi-directed graph (a bi-directed graph without loop is equivalent to a undirected graph). A
random walk on a bi-directed graph (may or may not contain loop) with transition matrix st.
=

for any
Recall that
walk from
,
is usually called simple random walk on graph.
is the t-step transition matrix.
> 0 for ≥ 1 intuitively means random
is possible to reach after times of transition.
> 0 for ≥ 1 iff there is a
→
path of length in . Necessity by
induction. The claim trivially holds for = 1 by the definition of , i.e.
> 0 ⇒ (v , v ) ∈
. Suppose the claim holds for time , i.e. if
> 0, then there is a → path of length
. Now for time + 1, suppose
=
>
0, and since both
⋅ ⋅
⋅ and ⋅ are both
non-negative, then there must exist some
st.
> 0,
> 0 , and by assumption
there is a
→
path of length , and ( , ) ∈ , then there exists a
→
path of
length + 1.
Theorem 14
Sufficiency by induction. The claim trivially holds for = 1, i.e. ( , ) ∈ ⇒
> 0.
Suppose if the existence of a → path of length can infer
> 0, then if there
exists a → path of length + 1, then let be the node previous to in that path.
Clearly there is a
→
path of length and so
> 0 ; also
> 0 since
( , ) ∈ . As a result,
>0⇒
=
⋅ ⋅ > 0.
INTERPRETATION: T-Step Transition Probability
We have a deeper understanding of the t-step transition probability
=
,
,…,
. Note
...
∈
where each particular , , … ,
,
represents a particular
→
path of length (see
remark below), and it is easy to observe
...
= 0 iff any of the factors is zero, or in
terms of the graph representation , any of ( , ), ( , ), (
, ) is not in the edge set. Then
represents the probability of reaching
from
in steps by taking all possible
→
paths into consideration. The following inequality,
≥
...
for any , … ,
∈ , can be understood as that the probability of reaching state from in
steps is no smaller than the probability of taking any particular path of length , since there could
be other → paths.
REMARK: Proof of
=∑
…
,…,
∈{ ,…, }
Let be any × square matrix, then
= ∑ … ∈{ ,…, }
,…,
, ∈ {1, … , }can be shown by induction. The base case is established by
Now suppose
= ∑ … ∈{ ,…, }
,…,
holds, then for
,
=
=
,
,…,
=
=
Thus
,
,
,…,
… ∈{ ,…, }
≥
,…,
,
,
,
,
,
,…,
,…,
⋅,
+
,…,
,
,…,
,…,
for any particular
,…,
,
,
,
+ ⋯+
,…,
,
⋅
,…,
for any
=∑
.
,
,
.
Define the Markov chain, or its transition matrix , is irreducible iff for any ( , ) ∈
there
exists some dependent on , st.
> 0. Clearly by the above theorem, an irreducible
Markov chain is represented by a strongly connected graph. Let = { ≥ 1:
> 0}, i.e. is
the set of time a random walk starting at vertex can return to vertex , or the set of lengths of
all possible ~ circuits, then the greatest common divisor of is the period of state ; the chain
or is aperiodic iff all states have period 1, and periodic otherwise.

Theorem 15 If
is irreducible, then every state has the same period, i.e. gcd
= gcd
for
any , . An interpretation is that every state gets a chance to be visited in the long run if
the chain is irreducible.
By Error! Reference source not found., clearly if the Markov chain is irreducible, then its graph
representation is strongly connected, there exists a ~ path of some length , and a
~ path of some length . Let = + , then there is a ~ circuit and a ~
circuit, both of length , i.e. ∈
,
⇒ ∈
⋂ .
l2
l3   v1
v1
v2
l1

Observe that a ~ circuit can consist of the ~ path of length , a ~ circuit and a
~ path of length , then for any ∈
we have + + ∈
, or simply we can
write
⊂
− (note this is a proper subset since ∈
and
contains only positive
numbers). Now observe gcd
can divide every element in
since gcd
divides and
every element of , thus gcd
is a common divisor of
and hence gcd
≥ gcd . By
the same argument that shows
⊂
− , one can derive gcd
≤ gcd , and therefore
gcd
= gcd .
Theorem 16 If
is irreducible and aperiodic, for any , , there exists a number dependent
on , st. ≥ implies
> 0. An interpretation is that every state gets the chance to
be visited for every transition when random walk runs enough long time if its irreducible and
aperiodic. Also a useful corollary of this is if is irreducible and aperiodic, then there exists a
∗
st.
> 0 for all , and ≥ ∗ . Simply let ∗ = max
: , ∈
where
is the “ ” in previous statement with subscripts to emphasize its dependence on , . Recall
that irreducibility guarantees a “ ” st.
> 0 dependent on , , while now aperiodicity
makes this “ ” uniform.
Let
= { , , … }, since gcd
= 1, we use the fact (see the remarks below for proof) that
there exists a finite subsequence , , … ,
st. gcd( , , … , ) = 1 , and there also
exists , , … ,
∈ ℤ st.
1=
Choose st.
any number in
+
+⋯+
> 0 by irreducibility, let
, claim =
+ .
=| |
For any ≥ , write = +
+ where =
(the remainder). Thus 0 ≤ < . Now consider
=
+
=
=
Note the
and we have
+
(
∑
=
which completes the proof.
={ ,
+ +
+ +
+ +
+
(
|
+ ⋯+|
(the quotient) and
+
|
, and let
be
= ( − ) mod
+ ×1
+
∈ ℕ since < , then note
=
REMARK: if
+ =
+|
)
> 0 and
> 0 since ,
∈
)
+
>0
, … } and gcd( ) = 1, there exists a finite subsequence with GCD 1
,
Given a countably infinite sequence = { , , … }, gcd = 1 iff no prime numbers divides all
. Assume is divisible by prime numbers , … , . For each there exists some
st.
| , then we claim gcd( ,
1) if
=
,
) = 1. For any prime ,
,…,
for some = 1,2, … , , then | ;
2) otherwise | .
Thus no prime number divides each of
REMARK: if
={ ,
,…,
,
} and gcd
,
,…,
, and hence gcd( ,
= 1, then 1 = ∑
,
for some
,…,
,…,
) = 1.
∈ℤ
Given = { , , … , } where ∈ ℕ for = 1,2, … , , here we show if gcd = 1, then 1
can be written as a linear combination of elements in with integer coefficients, i.e. there
exists , , … , st. 1 =
+
+⋯+
. Let be the set of integers that can be
written in the form
={
+
+ ⋯+
:
,…,
∈ ℤ}
Note is non-empty and contains positive numbers since + + ⋯ + ∈ ℕ . Let be the
least positive integer in . Note 1 is the only common divisor of since gcd = 1. If we can
show | (denotes divides ) for every , then is a common divisor of , then = 1 ∈
since 1 is the only common divisor of .
Actually we can prove it. Suppose there exists
then since
≤ , so we can write
=
∈
st. | (denotes
+ for some
≥ 1, where
does not divide ),
=
is the quotient
and = mod is the remainder. Since = −
⇒ ∈ , and < , which contradicts
with that is the least positive number in . Thus | for every and is a common divisor of
, which completes the proof.
EX 7. Let be an irreducible transition matrix of period . Show that the state space
partitioned into sets , , … , in such a whay that
> 0 iff ∈ and ∈ (
)
can be
.
Stationary Distribution. Given a homogeneous Markov chain characterized by initial multinomial
and transition matrix , if satisfies
= , then is called a stationary distribution. Here
another Greek letter is used to denote the initial multinomial to indicate its being stationary
distribution. Here is named “stationary” because ℙ{ = } =
is a constant for any and
any state .
Given a Markov chain , = 0,1, … with finite state space and its graph representation =
( , ), define the hitting time for state ∈ as = min{ ≥ 0, = }, i.e. the first time the
random walk arrives at state . Define the positive hitting time for ∈ as
= min{ ≥
1: = }, i.e. the first positive time when the random walk arrives at state . Define the first
return time
= min{ ≥ 1: = }; if the initial distribution is concentrated at some state ,
then
= .
For convenience, let ℙ and
denote the probability measure and expectation assuming the
initial is concentrated at some state (the random walk starts at state ) in the following
discussion, i.e. ℙ {⋅} = ℙ{ ⋅ | = } and [⋅] = [⋅ | = ].

Theorem 17 For any two states , of an irreducible chain, the expected positive hitting time
is finite, i.e. [ ] < ∞ . For convenience, we prove [ ] < +∞ instead, and [ ] =
[ | ] < +∞ since
is independent from .
Since the chain is irreducible, then there exists a positive integer
between every pair of states , . Let = max { } , = min {
, ∈
, ∈
( , )>
st.
>0
} and fix some ∈ ℕ .
Notice the event { > } ⊂ { > ( − 1) }. Pay attention that this is proper subset.
Let , , … be a realization , i.e.
= , = , …., and suppose ( , , … ) > ( −
1) . Let = ( ) , there is a → path of length
≤ st. ( , , … ) ≤ . Replace
each of ( ) , ( ) , … , ( )
with each state of the → path we will get a new
realization
satisfying ( − 1) <
≤ . Thus clearly
∈ { > ( − 1) } but
∉
{ > ( − 1) }. Thus
ℙ {
}<ℙ {
>
> ( − 1) }
Meanwhile, given
> ( − 1) , then no matter what ( ) is, the random walk has at
least probability to reach state within additional steps, that is, the random walk has less
than probability (1 − ) to reach state beyond time . Thus,
ℙ ( ≤ | > ( − 1) ) ≥ ⇒ ℙ ( >
⇒ℙ { > ,
> ( − 1) } < (1 − )ℙ {
Note ℙ { > ,
> ( − 1) } = ℙ {
{
ℙ
> 0} = 1, then
| > ( − 1) ) < (1 − )
> ( − 1) }
} since {
>
ℙ { > } < (1 − )ℙ { > ( − 1) }
< (1 − ) ℙ { > ( − 2) } < ⋯ < (1 − ) ℙ {
Note
is non-negative integer-valued RV, so
then we have
[
]=
= (ℙ {
=
=
ℙ {
> }+
ℙ {
>
]=∑
> }
> 1} + ⋯ ℙ {
ℙ {
=

ℙ {
[
>
}<
>
ℙ {
> } + ⋯ℙ {
+ }≤
ℙ {
> + }+
(1 − ) =
ℙ {
<∞
>
> ( − 1) } , and
> 0} = (1 − )
> − 1}) + (ℙ {
ℙ {
}⊂{
> } (see remark below),
> 2 − 1}) + ⋯
>2 + }+⋯
}
Theorem 18 Existence of stationary distribution for irreducible chains. Let be the transition
matrix of an irreducible Markov chain, then there exists a multinomial st. =
. In
addition,
=
for all states ∈ . We’ll have a general discussion for reducible chains
(
)
later to show stationary distribution always exists.
Let be an arbitrary state. Let ( ) be a RV representing the number of visits of state
before the random walk returns to . Note that
1)
2)
to
( ) = 1 for any state since ( ) means the number of visits of before first return
given the initial state is , which only counts the initial state itself.
( ) counts the number of
=
for ≥ 0 with restriction that
> , i.e.
( )=
For example, suppose = {1,2,3} and realization
0 for ≥ 4 since
= 4, and we can check
(3) ( ) =
(3) (1,3,2,3,1, … )
=
+
+
,
,
,
=0+1+0+1+0+0+⋯ =2
Let be a multinomial st.
is its probability, then
=
+
,
= 1,3,2,3,1, …, then
+
,
+
,
( )=
,
+⋯
,
( )]. Since the expectation of identity random variable
[
=
=
,
ℙ {
= ,
> }
Note { = ,
> } ⊆ { > } , then
≤ ∑ ℙ { > } = [ ] < ∞ . Note the
equality ∑ ℙ { > } = [ ] is by the proof of previous theorem. Now we show is a
left eigenvector of . Again, suppose the initial state is , then for any state , we have
=
=
=
=
∈
=
ℙ {
ℙ {
=
ℙ {
,
=
+ℙ
ℙ {
= ,
> }
= ,
=
ℙ {
=
=
ℙ {
∈
∈
∈
,
=
,
=
and thus

have that
=
=
=∑
=
∈
[
]
[
]
=
,
≥ + 1} =
≥ } −ℙ {
> } +
−ℙ {
=
Note that if ≠ obviously ℙ { =
}=ℙ { = }=ℙ
=
=ℙ
By normalization
= ,
}
> }
≥ + 1}
ℙ {
=
ℙ {
}=ℙ
=
∈
,
=
,
≥ 0}
=
≥ }
= } −ℙ {
,
=
}
=
= 0 ; if = , then ℙ {
= 1. No matter which case we have
=
=
, we have the desired stationary distribution . Further, check
∈
ℙ {
[
[
( )]
]
=
= ,
[
holds for any .
]
> }=
ℙ {
. Since initial state
> }=
[
]
can be chosen arbitrarily, we
Uniqueness of stationary distribution for irreducible chains. It has been discussed earlier that
when a stochastic matrix is multiplied by a column vector on the right, then
can be
interpreted as a vector of expectations by viewing as a function.
Given a transition matrix and state space , define a column vector
function ℎ as being harmonic at a state iff
=∑
∈
or ℎ( ) = ∑
∈
( , )ℎ( )
and its corresponding
where we treat as a function as well when is treated as a function. Also say or ℎ is
harmonic on a set ⊆ if they are harmonic on every state of . If = , then we have
= . The harmonic function here can be viewed as the discrete version of the “continuous”
harmonic functions discussed in PDE.
Lemma 6 If is irreducible, then a function harmonic at every state of is constant. In other
words, every component of is the same if
= and is irreducible. Let be a state st.
= argmax ∈ ℎ( ), and note ℎ( ) = ∑ ∈ ( , )ℎ( ) is a convex sum. If ℎ is not contant
on , we will have ℎ( ) < ℎ( ), since a convex sum has to be smaller than the maximum if
not all components are equal. This lemma implies the general solution of
= ⇒
( − ) = is (1, … ,1) for some , i.e. the solution space is of one dimension, and the null
space of − is of dimension | | − 1, and so does ( − ) = ′ − ′, since the column rank
and the row rank of a square matrix is the same.
Theorem 19. The stationary distribution of an irreducible transition matrix is unique, since
the solution space of
= ⇒
′ = is of one dimension by previous lemma and in this
solution space there is clearly only one vector with components summing to 1.
REMARK: if
=∑
is a non-negative integer-valued random variable, then
This is actually an interchange of summation order. Note that
ℙ{ = } + ⋯ + ℙ{ = }
terms
ℙ{ = } =
=
= ℙ{
+ ℙ{
+ ℙ{
+ ℙ{
+⋯
ℙ{ > }
= 1}
= 2} + ℙ{ = 2}
= 3} + ℙ{ = 3} + ℙ{ = 3}
= 4} + ℙ{ = 4} + ℙ{ = 4} + +ℙ{ = 4}
The original summation is over rows, but we can sum them over columns as well. Note
ℙ{ > 0} = ℙ{ = 1} + ℙ{ = 2} + ⋯ is the sum over first column, ℙ{ > 1} = ℙ{ =
2} + ℙ{ = 3} + ⋯ is the sum over second column, etc., clearly
ℙ{ = } = ℙ{ > 0} + ℙ{ > 1} + ⋯ =
=
Summation
REMARK: Convex Sum
=∑
is called a convex sum if ∑
ℙ{ > }
= 1, with
components. Average value is a special case of convex sum when
not all components of are identical, then
Let
= max{ , … ,
=
=
=
min{ , … ,
}, then
∈{ :
}
∈{ :
}
+
+
}<
∈{ :
∈{ :
}
}
< max{ , … ,
<
=
∈{ :
∈{ :
=
defined as its
,…,
= . If
=⋯=
}
+
}
}
+
∈{ :
∈{ :
}
}
Similar argument for min{ , … ,
}< .
EX 8. Show that the uniform distribution is stationary for
KEY. Sufficiency. If every column of
Necessity. If
∈
sum to 1, then let
=
is uniform and stationary for , then
=
EX 9. Show that if
∈
=
∈
iff every column of
be uniform and verify that
=
=
⇒
∈
sum to 1.
=1
∈
is symmetric, then the uniform distribution is stationary for .
KEY. If is symmetric, then clearly every column of sum to 1. Previous exercise implies the
uniform distribution is stationary. However, note that a non-symmetric matrix might have the
0.1 0.4 0.5
uniform stationary distribution by previous exercise, e.g. = 0.2 0.5 0.3 , every column of
0.7 0.1 0.2
which sums to 1.
Obviously the uniform distribution is the stationary distribution of . It is worth mention that a
special case of this is the simple random walk on a -regular graph (the degree of every vertex is
), in which
=
= if ( , ) ∈ , and
= 0 otherwise, hence the transition matrix
=
of a simple random walk on a -regular graph is symmetric, and the uniform distribution is its
stationary distribution.
EX 10. Let be the transition matrix of an irreducible Markov chain with state space . Let ⊂
be a non-empty subset of the state space, and assume ℎ: → ℝ is a function harmonic at all
states ∉ . Prove that if ℎ is non-constant and ℎ( ) = max ℎ( ), then ∈ .
KEY. Suppose ℎ( ) = max ℎ( ) but
∈
∈
∉ , then ℎ is harmonic at . Also by assumption ℎ is not
constant, and thus there exists at least one
ℎ( ) =
∈
∈
st. ℎ( ) < max ℎ( ). Thus,
( , )ℎ( ) < max ℎ( )
which is a contradiction. Thus
∈ .
∈
∈
∈
( , ) = max ℎ( ) = ℎ( )
∈
Here the letter of set “ ” stands for boundary. This exercise is analogous to the weak maximum
principle of the “continuous” harmonic function, which says the maximum is only to be found on
the boundary if the function is not constant.
EX 11. Assume an irreducible Markov chain. Define , = 0, and , = min{ : > , , =
}. In plain words, , is th time when is visited and of course , = . Let = , −
for ≥ 1, then is the -th time interval between ( − 1)th visit and th visit. Show that
,
, , … , are independent RVs for any finite .
KEY. , is finite for every finite
negative integers, and we show
Let
=
ℙ{
for
+ ⋯+
since the chain is assumed irreducible. Let
=
|
=
,…,
= 1,2, … , , then
} = ℙ{
=
be any non-
}
=
,…, = }
≠ ,…,
≠
≠ ,… ≠ ,
=
ℙ{ = |
=
=ℙ
= ,
=ℙ
= ,
,…,
=
where the first equality is due to Markov property, and the second equality is due to EX 5 and
check the following
ℙ
= ,
=
=
=ℙ
,
,…,
,
,…,
≠ ,…,
= ,
ℙ
= ,
ℙ
= ,
Meanwhile, note that {
ℙ{
=
=
}=
=
=
=
=ℙ
=ℙ
= ℙ{
≠ ,…
=
ℙ
ℙ
ℙ
ℙ
= ,
= ,
= |
Note in above argument,
ℙ
=ℙ
ℙ
≠
≠ ,
}={
,
,
=
=
=
= +
= ,
= ,
= ,
= ,
≠ ,…,
≠ ,…,
=
,…,
= ,
= ,
=
=
,
=
,…,
,
=
,…,
,
,
=
≠ ,…,
≠ ,…,
≠ ,…,
≠ ,…,
≠ |
≠ |
= }
=
=
=
}, then
+
,
=
≠ |
=
≠ ,
≠ |
≠ |
=
=
≠ ,…,
≠ ,…,
≠ | ,
≠ |
=
,
,
=
ℙ{
ℙ{
=
ℙ{
,
,
ℙ{
,
= }
,
= }
= }
= }
=
=
is by Theorem 12, and the second last equality is due to that ∑
ℙ{ ,
= } = 1 since the
summation exhausts all possibilities for . Hereby the proof is completed. In EX 12 we will prove
, = 1,2, … , are actually i.i.d.
Stopping Time & Strong Markov Property. Given a sequence of random variables , … , , the
stopping time is a RV st. event { = } can be determined by , … , for = 0,1, … , . The
stopping time generalizes the idea of hitting times, return times, etc. discussed earlier. Recall that
1) The first hitting time is defined as the first time when state
} iff 0 , … , −1 ≠ and = .
is visited. Any outcome
2) For a Markov chain ( , ), the first return time + is the first time when state
outcome ∈ { = } if 1 , … , −1 ≠ and = .
3) For a Markov chain ( , ), the th return time
outcome ∈ { , = } if ( − 1) RVs of 1 , … ,
Apparently { = }, { = }, {
, , , are all stopping times.

=
is re-visited. Any
is the th time when state is re-visited. Any
= .
−1 take value and
+
,
= } can be determined by only looking at
,
∈{
,…,
and
Theorem 20 Strong Markov Property. Let be a stopping time of a homogenous Markov chain
with state space and transition matrix , the information < ∞ plus
= is sufficient to
decouple any history and future, and , +1 , … , + is a Markov chain ( , ) for any ∈
ℕ and it is independent of 0 , … , −1 .
More precisely, fix any ∈ ℕ , let be event { = ,
= ,…,
= } where
, ,…,
are arbitrary states, and let be any event determined by 0 , … , −1 , then
< ∞, =
=ℙ
< ∞, = ℙ( | < ∞,
}ℙ(
= ,…,
=
| < ∞, = )
ℙ ⋂
=ℙ {
= )
where ℙ { 1 = 1 , … ,
= } = , 1 1, 2 …
. Since is automatically known
−1 ,
by definition of a variety of stopping times like first hitting time, first return time, etc., strong
Markov property implies that knowing is finite is often sufficient to decouple ⋂ . Let’s
start by fixing some arbitrary finite as the present time, then by Theorem 12
since
ℙ
⋂
is the past and
As a result
= ,
=
=ℙ
is the future. Further by EX 5
ℙ
= , =
= ℙ(
= ℙ( = , = , … ,
ℙ
ℙ
⋂
= ,
= ,
= ,
=
Times both sides by ℙ{ = ,
ℙ( ⋂{ = }⋂{
ℙ( ⋂ ⋂{ = }⋂{
=
= ,
= |
=
=
,
= ,…,
= )=
,
,
ℙ( | = ,
=
,
…
|
…
= )
ℙ( | = ,
= )
= }) =
,
,
,
,
,
…
…
ℙ{ = ,
= }
,
,
,
,
= } for both equations, then
= }) =
=
,
…
= )
,
ℙ( ⋂{ = }⋂{
= })
Note ,
is independent of . Summing over all possible on both sides of
, …
,
both equations and we have
=
=
=
ℙ( ⋂{ = }⋂{
∈ℕ
,
,
∈ℕ
,
,
,
…
…
,
…
,
,
= }) = ℙ
,
ℙ{ = ,
ℙ{ = ,
∈ℕ
ℙ{ < ∞,
⋂{ < ∞}⋂{
= }
= }
= }
= }
and similarly
ℙ( ⋂ ⋂{ = }⋂{
∈ℕ
=
=
,
∈ℕ
,
,
…
,
…
,
Divides both side by ℙ{ < ∞,
ℙ
= }) = ℙ( ⋂ ⋂{ < ∞}⋂{
ℙ( ⋂{ = }⋂{
,
ℙ( ⋂{ < ∞}⋂{
= })
= } and it gives
| < ∞,
=
=
= })
= })
,
…
,
,
which can be treated as a special case of strong Markov property when there is no historical
event and interpreted as the chain starts afresh at stopping time. Then
⋂ ⋂{ < ∞}⋂{ = } = ,
ℙ ⋂ ⋂{ < ∞}⋂{ = }
⇒
=
ℙ{ < ∞, = }
⇒ ℙ ⋂ | < ∞, =
= ,
ℙ
,
ℙ( ⋂{ < ∞}⋂{ = })
ℙ( ⋂{ < ∞}⋂{ = })
,
ℙ{ < ∞, = }
ℙ( | < ∞, = )
,
…
,
…
,
As a summary,

…
,
,
ℙ ⋂ ⋂{ < ∞}⋂{ = } = ℙ | < ∞, =
= ,
= )
, …
, ℙ( | < ∞,
{
}ℙ(
=ℙ
= ,…,
=
| < ∞, = )
ℙ( | < ∞,
= )
Note that if the Markov chain is not homogenous, the strong Markov property may fail. In the
above proof of strong Markov property, the crucial line is
where
ℙ
,
⋂
ℙ
∈ℕ
⇒ℙ
⋂
= ,
,
…
⋂
=
,
= ,
< ∞,
=
,
…
,
ℙ( | = ,
,
is independent of , which leads to
=
=
=
=
,
,
,
,
…
…
,
However, for inhomogeneous Markov chain, we have
,
= )
ℙ( | = ,
∈ℕ
ℙ( | < ∞,
= )
ℙ ⋂
= , =
=ℙ
= , = ℙ( | = , = )
⇏ℙ ⋂
< ∞, =
=ℙ
= , = ℙ( | < ∞, = )
= )
Consider a simple Markov chain , , , … with state space = {0,1,2}, a uniform initial
distribution and transition matrixes for odd times and even times
(
)
0 1 0
= 0 0 1 ,
1 0 0
(
)
0 0 1
= 1 0 0
0 1 0
corresponding to the following random walks on 3-cycle.
0
0
2
2
1
1
odd time transition
even time transition
Define
= be the stopping time, let be the event that the random walk stays at state 0 at
all future times, and be the event that the random walk stays at state 0 at all past times.
Note that
= 1 iff
= 0,
= 2 iff
= 1, and
= ∞ iff
= 2.
1) It is not possible for
be other than 1,2, ∞, by convention
2) When
= 0,
= 1,
= 1,
= 2,
ℙ
|
= 1, then
ℙ
3) When
= 2, then
|
= 1 = ℙ( |
ℙ
|
⋂
= 0, thus
= 1) = ℙ
= 1, thus
⋂
=1 =1
= 1) = ℙ
⋂
=1 =0
= =ℙ
= ℙ( |
= 0 or
= 1, which gives
<∞ =
= ) for every finite
ℙ( ⋂ ⋂{ = 1}) + ℙ( ⋂ ⋂{
ℙ( = 1) + ℙ( = 2)
ℙ( ⋂ ⋂{ = 0}) + ℙ( ⋂ ⋂{
=
ℙ( = 0) + ℙ( = 1)
and similarly we can compute ℙ

ℙ
⋂
∉ {1,2, ∞}) = 0
= 1 = 1 ( contains no condition)
ℙ( |
Clearly
satisfies ℙ ⋂
0,1,2, …. However
< ∞ iff
ℙ
∉ {1,2, ∞}) = ℙ( ⋂ |
∉ {1,2, ∞} = ℙ( |
< ∞ = , ℙ( |
= 1})
= 2})
1
1
=3=
2 2
3
< ∞) = . Thus
< ∞ ℙ( |
<∞ ≠ℙ
=
< ∞)
Theorem 21 Extension of Strong Markov property to any future event. Again note the crucial
line of the proof,
ℙ
⋂
= ,
=
=ℙ (
= ,
=
,…,
=
)ℙ( | = ,
= )
the objective of which is to reduce event ℙ
= , =
to something not containing
denominator ℙ( = , = ) so that we can sum over on both sides. For an event other
than = { = , +1 = 1 , … , + = }, the strong Markov property apply if we can do
the same thing.
Denote event { = ,
= ,…,
future event determined by , … ,
some , , … , .
= } by , ,…, instead of . Let
, then is a disjoint union of events
,
be any
of
,…,
ℙ( ⋂ | = ,
=
,
⊆
,…,
= )=
ℙ
= ℙ( | = ,
Let
( )
, ,…,
,
,
( )
=
for every ,
where ℙ
=ℙ
st.
,…,
,
,
ℙ( ⋂ | = ,
=
( )
ℙ( ⋂ | < ∞,
=ℙ ( )
=
ℙ
⊆
⋂
= ,
ℙ( | = ,
= )
,
,
= )
=
( )
, ,…,
⊆ , then ℙ
,…,
= ,
,…,
,…,
|
( )
|
= ) = ℙ( | = ,
=
=
=ℙ
=
,
,…,
= ℙ( | = ,
= )ℙ
|
=
( )
is no longer dependent on = and hence
= ) = ℙ( | < ∞,
ℙ( | < ∞, = )
=
}, and let
=
( )
, ,…,
⊆
,…,
⇒ℙ
,…,
,…,
,
ℙ
=
= ,
where the union is disjoint. Since
( )
, ,…,
,…,
= )ℙ( | = ,
be the event of form {
⊆
,…,
= ,
,…,
= )
= ℙ( | = ,
ℙ
,
= )ℙ( | < ∞,
=
= ), and so
= )
A similar argument will show the following holds when there is no historical event,
ℙ( | < ∞,
= ) = ℙ( | < ∞,
=∑
An example is that the sum
for any
and {
and
( )
=∑
are independent of each other
∈ ℕ if the Markov chain is irreducible. For any
=
} is an event dependent on RV
irreducibility provides information
ℙ{

= )=ℙ
=
,
} = ℙ{
=
< ∞, and
}ℙ{
=
=
=
,
,
=
} = ℙ{
,{
,…,
=
=
} is a historical event,
=∑
automatically, then
=
}ℙ
. The
=
Extension of strong Markov property to multiple stopping times. We will that strong Markov
property can be applied on multiple stopping times.
Lemma 7 First we prove a lemma that providing additional future and historical events as
condition does not break independence. As before let be a stopping time, and let and
be any future event and historical event. In addition, let ,
be another arbitrary future
event and historical event, then
ℙ( , | < ∞,
ℙ( ,
ℙ(
ℙ( ,
=
ℙ(
= ℙ( |
=
= ,
| < ∞,
, | < ∞,
| < ∞, =
| < ∞, =
< ∞, = ,
,
,
,
)=
=
= )
)ℙ(
)ℙ(
)ℙ(
)
ℙ( , , < ∞, = , , )
ℙ( < ∞, = , , )
| < ∞, = )
| < ∞, = )
| < ∞, = , )
,
Similarly
ℙ( , | < ∞,
ℙ( , | < ∞,
It follows that
) = ℙ( | < ∞,
= ,
) = ℙ( | < ∞,
= ,
ℙ( | < ∞,
= ,
,
)=
= )ℙ( | < ∞,
= )
= ,
)
ℙ( , < ∞, = , , )
ℙ( < ∞, = , , )
ℙ( , | < ∞, = , )
ℙ( | < ∞, = , )
ℙ( | < ∞, = , )ℙ( | < ∞,
=
ℙ( | < ∞, = )
= ℙ( | < ∞, = , )
=
)ℙ( | < ∞,
= ,
= )
meaning that providing additional future event
in the condition does not break the
independence between any future event and historical event. Similarly we can calculate
ℙ( | < ∞, = , , ) = ℙ( | < ∞, = , ). It further follows that
ℙ( , | < ∞,
= ℙ( | < ∞,
= ,
= ,
) = ℙ( | < ∞, = ,
)ℙ( | < ∞, = , ,
,
,
)ℙ( | < ∞,
)
= ,
)
by which we conclude that providing any additional future event and historical event
does not break the independence between any future event and historical event. From
another perspective, if the conditions can be broken into < ∞, = , a future event and
a historical event , then any future event and historical event are independent.
Theorem 22 Now given = 0 and multiple stopping times
determined by
,
,…,
for = 1,2, … , , let
,
, …. In addition, let = {
= ,…,
= }.
ℙ
|
< ∞,
=
ℙ
< ⋯ < , let be an event
be an event determined by
|
< ∞,
This is done by argument similar to. First choose as the present time, then ⋂
historical event, and
is a future event. Also note event
where {
{
=
ℙ
< ∞}⋂ =
,…,
=
|
< ∞,
Likewise, to solve ℙ ⋂
verify that { < ∞}⋂
historical event. Thus
ℙ
|
< ∞,
=
⋂{
} is a historical event, then
= ℙ(
|
< ∞,
= ℙ(
|
< ∞,
|
,…,
=
< ∞, )ℙ
| < ∞, , choose
can be broken into
Iteratively, we arrive at ℙ ⋂
=
|
}
< ∞,
as the present time, and it is easy to
< ∞,
= , a future event and a
< ∞, )ℙ
=∏
is an
ℙ
|
|
< ∞,
< ∞,
.
REMARK
Observe that the following inference does not necessarily hold, where , ,
are overloaded to represent events determined by , respectively,
are RVs and ,
ℙ( , | = ) = ℙ( | = )ℙ( | = ) ⇏ ℙ( , | < ∞) = ℙ( | < ∞)ℙ( | < ∞)
A counter example is that,
is uniform on {1,2}, and ( , ) are distributed according to
=1
=1
=2
0.25
=1
0.25
0.25
=2
=2
=1
0.24
=1
0.25
=2
0.36
0.16
=2
0.24
= | = ) = ℙ( = | = )ℙ( = | = ) for any , , ∈
It is easy to verity ℙ( = ,
{1,2}, however
1
1
× 0.25 + × 0.24 = 0.245
2
2
1
1
1
1
ℙ( = 1| < ∞)ℙ( = 1| < ∞) =
× 0.5 + × 0.6
× 0.5 + × 0.4 = 0.1575
2
2
2
2
ℙ( = 1,
Clearly ℙ( = 1,
= 1| < ∞) =
= 1| < ∞) ≠ ℙ( = 1| < ∞)ℙ( = 1| < ∞).
EX 12. We have proved in EX 11 that = , −
identically distributed using strong Markov property.
KEY. For every
for any
are independent. Now show they are
,
= 1,2, …, simply by the strong Markov property without historical event, we have
ℙ
∈ ℕ , then
= |
,
<∞ =ℙ {
are identically distributed.
= ,
EX 13. Assume the Markov chain is irreducible. Let
defined in previous discussions where
,
= 1. Let
,
≠ }
≠ ,…,
be the th return time to state
=∑
sum between the ( − 1)th and the th visit of . Show that
,
,
,
,…
,
= 1,2, …, which is the
are i.i.d for any finite .
KEY. Show they are identically distributed. Similar to previous exercise, for every
have
ℙ
=
for every
ℙ
then
,
= |
,
= 1,2, …, we
<∞ =ℙ
= ,
≠ ,…,
≠ ,
=
ℙ
= ,
≠ ,…,
≠ ,
=
∈ ℕ and . Thus
= |
,
<∞ =
∈ℕ
is clearly identically distributed.
as
Show independence. Fix some arbitrary
, = 1,2, … , events { = }, = 1,2, … are
determined by RVs
,
,…,
for each
respectively, so those events are
independent by strong Markov property for multiple stopping times. Since are discrete RVs,
then the independence of { = } implies the independence of .
Time Reversibility. A Markov chain , , … with initial distribution and transition matrix is
said to be reversible if the joint distribution of ( , , … , ) is the same as ( ,
,…, )
for any ≥ 0 . Recall that if we say joint distribution ( , ) is identical to ( , ), we mean
ℙ{ = , = } = ℙ{ = , = } for any realization , , then reversibility is actually
ℙ{
=
for any realization
,
,
=
,…,
,…,
=
} = ℙ{
=
,
=
and any , and equivalently,
…
=
,…,
}
=
…
From definition we can easily see a typical case when a chain is reversible is when is uniform
and is symmetric. Thus we can see an irreducible chain is not necessarily reversible, for example
is uniform but is not symmetric. Also note a reversible chain is not necessarily irreducible,
check EX 15 for an example.
We now show it actually only takes a simpler condition for a Markov chain to be reversible. If a
multinomial satisfies
=
for any , ∈ , then we say , satisfy detailed
balance. The detailed balance implies the Markov chain characterized by , satisfies
for any

,
ℙ{
∈ .
,
=
} = ℙ{
=
,
=
}
satisfy detailed balance iff the Markov chain is reversible. For necessity, check
Theorem 23. ,
⇒
⇒
⇒⋯⇒

=
=
=
=
=
…
=
…
For sufficiency, the definition of a reversible Markov chain already guarantees detailed balance,
since it says
…
=
…
for any realization and any
. Simply choose = 1 and we have
=
.
Theorem 24. If ,
satisfy detailed balance, then
=
=
is a stationary distribution, simply due to
⇒
=
Let ℳ be an irreducible Markov chain characterized by , . When is stationary for , define a
new Markov chain ℳ = , , … characterized by the same initial and a new transition matrix
≔
=
, then ℳ is said to be the time reversal of ℳ if
ℙℳ {
=
,
=
,…,
=
} = ℙℳ
=
,
=
,…,
=
for any realization and any , where the subscripts of ℙ emphasizes the probability measures are
with respect to different chains. Here has to be stationary for to ensure that is stochastic,
since ∑
=∑
∈
∑
=
∈
= 1 iff ∑
∈
; and the following will
=
∈
show so-defined ℳ is indeed a time reversal without condition other than


is stationary for
=
∈
∈
=
=
∈
Theorem 25. ℳ is the time reversal of ℳ. Note that
ℙℳ {
=

as well, simply check.
=
,
=
=⋯=
=
,…,
…
…
=
=
=
…
=
}
⇒
is stationary for .
∈
=
=
…
= ℙℳ
=
,
…
=
,…,
=
It is clear that if ℳ is reversible, or equivalently , satisfy detailed balance, then
=
=
=
⇒ = , then ℳ = ℳ and ℳ is the time reversal of itself, which
reinforces what “reversible” means.
EX 14. Check that the random walk on an -cycle is time reversible iff the random walk is unbiased
(the random walk has equal chance to go clockwise or counter-clockwise at every state).
KEY. Note that every column of the transition matrix for random walk on a -cycle sum to 1, i.e.
∑ ∈
= 1 for all . Observe that the column sum is the probability of transition from other
states to , and the transition can only be from states adjacent to on a cycle. Suppose the
random walk goes clockwise with probability at each state, let
= ( − 1) mod and
=
( + 1) mod , then
∈
=
+
=
+ (1 − ) = 1
Thus the stationary distribution is uniform, no matter what is. The -cycle is a 2-regular graph,
and the random walk on it is irreducible. For any two states , , if they are not adjacent on the
cycle, then
since
since
=
=
=0
=
=
= 0. If , are adjacent, and suppose the random walk is unbiased, then
1
2
= . However, if the random walk is biased, WLOG suppose
going clockwise, i.e.
= ( − 1) mod , then
=
which means ,
=
1
≠
1
=
(1 − )
is prior to
when
do not satisfy detailed balance, and hence the random walk is not reversible.
Classification of States. Let be the transition matrix of a Markov chain on a finite state space
and let = ( , ) be the graph representation of . In previous discussions is mostly assumed
irreducible, while in this section we studies without such restriction. Given any two states , ,
we say is accessible from if there exists a → path in , and say and communicates if
both → and → , simplified as ⟷ . In addition we force ⟷ , i.e. we think
communicates with itself. Define as being essential if for any ∈ , if → , then → ;
otherwise, is said to be inessential.

Lemma 8 ⟷ is an equivalence relation. Clearly

Lemma 9 If

⟷ ⇒ ⟷ , and ⟷ , ⟷ ⇒ ⟷
. In addition, we have forced ⟷ . The equivalence classes induced by ⟷ are called
communication classes. It is not hard to see a communication class is actually a strongly
connected component of .
is essential and → , then is essential. For any state st. → , then there is
a path , … , , … , , and thus → . Since is essential, then → , and so there exists a path
, … , , … , , i.e. → . It immediately follows that states in a single communication class are
either all essential or all inessential. An essential communication class is also called an
essential class. It is easy to verify that Markov chain is irreducible, i.e. is strongly connected,
iff all states are essential and they form a single essential class.
Now suppose a communication class [ ] only contains one single state, i.e. [ ] = { }, and it
is inessential, then once the random walk leaves [ ], it never returns, and they are transient,
as defined before, since ℙ { < ∞} = 0 < 1. Actually since it never returns, this case is even
stronger than “transient”, and we can call them “absolutely” transient. If it is essential, then
once the random walk enters [ ], it never leaves again, and is of course recurrent, and also
commonly called absorbing. Two examples are given below.
1
2
3
4
1
2
3
4
Figure 5 Illustration of communication class and absorbing state. First note
both graphs are not strongly connected, and hence reducible. In the first
graph, there are three communication classes, [1] = {1} (inessential), [2] =
{2} (inessential) and [3] = {3,4} (essential), and state 1,2 are “absolutely”
transient (once leave, never return). In the second graph, communication
classes are the same, with [3] being inessential, and state 1 is “absolutely”
transient, state 2 is absorbing (once enter, never leave).
Theorem 26 More generally, given an essential class, say [ ], once the random walk enters [ ],
it never leaves [ ] again, i.e. there is no edge from any state in [ ] to other states. Otherwise,
suppose ∃ ∈ [ ], ∈ [ ] but ( , ) ∈ , then by ′ being essential, we have ( , ) ∈ ⇒
∈ [ ].
A corollary is that a transition matrix restricted to an essential class is an irreducible
stochastic matrix, denoted as ( ) . Let , , … , be states from one essential class , and
,
, … , be the rest. Clearly there are no edge from any state in to any state in ̅
(but there could be edges from ̅ to
rearranged to be like
( )
=
). The indexes of transition matrix
where
( )
is a
partial matrix st.
×
can also be
( )
1 ≤ , ≤ , and ∑
= 1 for each , or in other words, is indexed by states
order. Since a communication class is strongly connected, then ( ) is irreducible.
( )
=
,…,
for
in
A further corollary is that given any essential class = { , , … , } of , there exists
stationary distribution st. if ∑ ∈
= 1. Let ( ) be the stationary distribution of ( ) .
Let
= ( ) , , then clearly
is stationary by
=
The swap of components of
( )
,
( )
=
REMARK
and the indexes of
( ) ( )
,
=
( )
,
=
does not matter. Let
be the identity
0 1 0
matrix of rank with the th row and th column swapped. For example,
= 1 0 0 if
0 0 1
the rank is 3. Recall that given a matrix of rank , the meaning of
is to swap the th and
th rows of , and the meaning of
is to swap the th column and the th column of . We
call a swap matrix. Note is symmetric and orthogonal and
=
= , and
= .
is achieved by finite swaps of indexes of . By swap of index , , we mean swap of th row
and th row, and swap of th column and th column at the same time. Thus for every step, a
swap is applied to on the left as well as on the right. Meanwhile, is applied to on the
right to swap its components (for row vector the components are “columns”). So if ( ) =
…
…
, then ( ) =
…
, and
=

…
…
…
=
…
=
Theorem 27 Every finite Markov chain has at least one essential state and hence one essential
class. Given states, construct a sequence of states , , … , … by the following algorithm
(note we’ll see the sequence is actually finite since the algorithm terminates).
1) Choose an inessential state in . If there is no inessential state, then the Markov chain is
irreducible and all states form a single essential class.
2) Since is inessential, we can choose a second state different from st. → but
↛ . If is essential, then [ ] is our desired essential class; otherwise, continue to
choose st. → but ↛ . In general, given the latest state , choose
st.
≠ , →
but
↛ .
Note no state will appear twice in the sequence, otherwise, suppose some state appears
in the sequence as , where − > 1 , and then we have a
→ = ,
contradicting with
↛ . As a result, since we only have finite many states, the
sequence is at most of length .
1) If the algorithm terminates at some
≤
− 1, then
is an essential class.
2) If it terminates at = , then clearly
↛
for any = 1,2, … , − 1, otherwise there
will be
→
→
. Hence [ ] = { } is an essential class.

Since every Markov chain has at least one essential class, combining the corollaries of
Theorem 26, we have that stationary distribution always exists.
Theorem 28 If
is a stationary distribution of , then is unique iff there is only one essential
class. Necessity is obvious by contraposition. Suppose there are different essential classes
, , … , , then by corollaries of Theorem 26, ( ) satisfying ∑ ∈
= 1 for = 1,2, … ,
are all stationary distributions . Sufficiency. If there is only one essential class , check that
( )
in
= ( ( ) , ) is unique, and so does
and .
()

Theorem 29 If
is a stationary distribution of , then = 0 if is an inessential state. In the
proof of previous theorem, given an arbitrary inessential state
, a finite sequence
, , … , is constructed for 1 ≤ ≤ , and belongs to some essential class . Then
=
∈
(
) =
∈
∈
∈
The last identity is due to ∑
then we have
=
Consequently,
Also recall
⇒
+
∈
=
= 0 for every
> 0 since
∈
∉
+
∈
=
+
∈
= 1 for any . Note since
∈
∈
=
∈
∈
→
=
(
) ⇒
∉
and
, then
∈
∈
∉
∈
∉
is assumed to be stationary,
∉
∈ . Recall
=0
=0
∉ ,
∈ , and then
= 0, i.e.
=0
Thus
= 0 for every . Recall
> 0, then
observe a backward induction pattern that all the way leads to
the proof since is an arbitrary inessential state.
= 0. Now it is easy to
= 0, which completes
REMARK
Recall the fact that stationary distribution describes the long-run transition, i.e.
→ if
is stationary. Together with theorems in this section, we can explain intuitively why a stationary
distribution cannot give chance to inessential states: suppose an inessential state is given
chance to be the initial state by . Since the random walk is going to be absorbed into one of
) →0≠ .
the essential classes sooner or later, then (
Take the following for example, there are two essential classes [3] = {3,4,5} and [6] =
{6,7,8,9}. Eventually the random walk will be absorbed in either [3] or [6], and will never
) → 0 and (
) → 0. If
return to state 1 or 2. As a result, (
≠ 0 or
≠ 0, then is
not stationary since it does describe the long-term transition of the Markov chain.
4
5
3
9
6
7
2
8
1
EX 15. Show that if a Markov chain has two one-way connected communication classes , (say
there is one edge from to but no edge from to ), then the Markov chain cannot be
reversible. Give an example of a reducible yet reversible Markov chain.
KEY. Suppose ∈ , ∈ , and ( , ) ∈ for some , . Then
= , = is a possible
realization for = 2, i.e. ℙ{ = , = } > 0. However, ℙ{ = , = } = 0 since there
is no edge from to . Thus the Markov chain is not reversible.
For a reducible Markov chain, if all communication classes are disconnected, and each class is
reversible themselves, then the whole Markov chain is reversible. The only thing of concern is
when two states , come from different classes, however
since
=
= 0.
=
=0