58 Markov Chains

Introduction to Markov Chains
Learning goal: Students see the tip of the iceberg of Markov chain theory.
Many situations can be modeled as a number of discrete states, where at fixed time
intervals the system switches from one state to another with a fixed probability. Our peoplemoving-into-and-out-of-California system is one such. Another such is the “gambler’s ruin”
where two gamblers, with initial bankrolls of $A and $B each bet a dollar on the flip of a coin.
They keep this up until one gambler has all the other’s money. What is the probability that each
will win—and how long will it take, on average?
Such processes are called Markov processes or chains. We number the states from 1 to n
(e.g. s1 = “in California”, s2 = “outside CA”, or s1 = gambler one has all the money, s2 = gambler
one has A + B – 1 and gambler two has $1, etc.). We define pij to be the probability of passing
from state i to state j. So we create the transition matrix P = (pij). For the gambler’s ruin, the
⎛ 1
0
0
0 ! 0 ⎞
⎜ 1/ 2 0 1/ 2 0 ! 0 ⎟
⎜
⎟
0 1/ 2 0 1/ 2 ! 0 ⎟
⎜
matrix will look like
. The usual situation gives us a
⎜ 0
0 1/ 2 0 ! 0 ⎟
⎜ "
"
"
" # " ⎟
⎜
⎟
0
0
0 ! 1 ⎠
⎝ 0
probability vector v that gives the initial probability of being in each state. The standard is to
have this be a row vector. Then the probability vector after one time step is vP. The probability
of moving from state i to state j in two steps is (P2)ij, etc.
We ask several questions:
1. In the long term, what is the probability of being in each state?
There may be absorbing states, from which you can’t escape (one gambler has all the
money, e.g.). In these cases:
2. If there are several absorbing states, what is the probability of ending in each?
3. On average, how long until we are absorbed?
4. On average, how many times do we visit any particular state before being absorbed?
1.
2.
3.
4.
To get started we notice several important things about the transition matrix.
All of its entries are non-negative, because they are probabilities.
Each row adds to one (since the probability is one that from state i you will go
somewhere on your next turn!)
P has an eigenvalue of one. For since all rows add to one, (1, 1, …, 1) is an eigenvector
of eigenvalue one.
No eigenvalue of P is larger than one in modulus. For let λ be an eigenvalue (which
might be complex) with corresponding eigenvector x. Let xk be the largest component
n
of x (in modulus). Then since Ax = λx, we have λ xk = ∑ akj x j , so
j=1
λ xk = λ xk =
n
∑a
j=1
kj
n
n
j=1
j=1
x j ≤ ∑ akj x j ≤ ∑ akj xk = xk , so |λ| ≤ 1.
Here’s where we break the theory into three parts. One is where there is complete
“mixing” of all the states. That is, there is some positive probability of getting from any state to
any other. This means that some positive power of the matrix P has all positive entries. Such a
matrix is called “positive” or “irreducible.” The second case is where some state or states are
“absorbing” in the sense that once you get there you can never get out. The row in P
corresponding to such a state is just a one on the diagonal and all zeroes elsewhere. The third
case is all others—things like “inaccessible” states, or groups of states that can’t get to each
other, or loops that absorb. These last cases can sometimes be analyzed by simplifying to one of
the former cases, but their complete analysis is much more complex.
The first case is also called the ergodic case. Let P be a positive matrix and v a nonnegative vector (all entries ≥ 0). Then vP is positive. Similarly, if Pn is positive, then so is every
larger power.
There is the following theorem, called the Perron-Frobenius Theorem:
If A is a positive matrix, then A has a unique eigenvalue of largest modulus, which is real and
positive. It has algebraic multiplicity one. Furthermore, its eigenvector is positive, and no other
eigenvector is non-negative.
In our case, the Perron-Frobenius eigenvalue is λ = 1. We have already proven that
|λ| ≤ 1, so it would remain to show that we must have λ = 1. But if P had another eigenvalue of
modulus one, then some power of P would have an eigenvalue of modulus one and negative real
part. Then Pn – εI has an eigenvalue whose modulus is larger than one. But it is still positive,
has all rows adding to 1 – ε, so by the argument above cannot have an eigenvalue larger than one
in modulus.
There can be no other eigenvectors other than (1, 1, …, 1) for λ = 1. For if w is such, we
can choose ε so that z = (1, 1, …, 1) – εw is positive except for one entry which is zero. But
then z is an eigenvector of eigenvalue one, yet Pz has all positive entries, so it cannot be an
eigenvector at all! Thus 1 is the only eigenvalue of modulus one, and it has geometric
multiplicity one also. Could it have larger algebraic multiplicity?
If it did, then the triangular matrix obtained from P from Schur’s lemma would look like
⎛ 1 a * * ⎞
⎛ 1 ka * * ⎞
⎜
⎜
1 * * ⎟
1 * * ⎟
⎜
⎟ and the kth power of it would be ⎜
⎟ . Since this entry grows
* * ⎟
* * ⎟
⎜
⎜
⎜⎝
⎜⎝
* ⎟⎠
* ⎟⎠
without bound, there is no way for the matrix to stay with all entries ≤ 1.
(The general case of Perron-Frobenius is a bit harder to prove—we had the advantage
that all rows added to the same thing!)
Take any matrix, A. Take any unit vector v0. Create the sequence vn+1 = Avn / ||Avn||.
This creates a sequence of unit vectors (unless v0 is in the nullspace of some power of A). In
most cases, this sequence will converge to an eigenvector of A corresponding to the largest (in
modulus) eigenvalue. This is because we could express v0 as a linear combination of
eigenvectors, and as we multiply by higher and higher powers of A, the smaller eigenvalues
become insignificant. This is known as the power method for finding eigenvectors.
In our case, since the largest eigenvalue (1) is algebraic multiplicity one, this sequence
will converge. If we start with a non-negative vector v0, PTv0 will be positive, so our PerronFrobenius eigenvector will be positive.
What does all this mean? It means that if you start with any probability distribution, and
do the Markov process long enough, it will converge on the eigenvector of λ = 1 for the
distribution. Thus Pn converges to a matrix all of whose rows are this eigenvector, and this
eigenvector is the “steady state” probability of being in each state. Of course, it can most easily
be found by finding a null vector of PT – I.
Now what about absorbing Markov processes? The first thing to do is to re-order all the
⎛ Q R ⎞
states so that the absorbing states come last. Then the Markov matrix has the form ⎜
⎟.
⎝ 0 I ⎠
Q is a non-negative matrix that gives the probabilities of transitioning from one transient state to
another, R is a non-negative matrix that gives the probabilities of transitioning from each
transient state to one of the absorbing states, and I is the identity, because you don’t move from
the absorbing states.
It turns out Q will help us answer all our questions.
First, the probability that, if we start in state i we are in state j after exactly n steps is the
ij-entry of Qn. That means the expected number of visits to state j, given that we started in state i
is the ij-entry of I + Q + Q2 + Q3 + !. But this infinite series converges! And it happens to
converge to (I – Q)-1 = N.
The expected number of steps before being absorbed (when starting in state i) is the ith
entry of N1, where 1 is the vector of all ones. This is because each entry of N1 is simply the sum
of the expected number of times we visit each state before being absorbed.
Finally, the probability of being absorbed in absorbing state k, given that you start in
starting state i is the ik-entry of NR.