Strong Stationary Times

CHAPTER 5
Strong Stationary Times
5.1. The top-to-random shuffle
Consider the following (slow) method of shuffling a deck of n cards: Take the top card and place
it uniformly at random in the deck. The following exercise proves that this method of shuffling will
eventually mix up the deck:
E XERCISE 5.1. Show that the top-to-random shuffle just described is a Markov chain with
stationary distribution uniform on the n! card arrangements.
Next card to be placed in one of the slots
under the original bottom card
Original bottom card
F IGURE 5.1. The top-to-random shuffle.
How long must we shuffle using this method until the arrangement of the deck is close to
random?
Let T be the time one move after the first occasion when the original bottom card has moved to
the top of the deck. We show now that the arrangement of cards at time T is distributed uniformly
on the set of all permutations of {1, . . . , n}. More generally, we argue that when there are k cards
under the original bottom card, then all k! orderings of these k cards are equally likely.
This is seen by induction. When k = 1, the conclusion is obvious. Suppose that there are
(k − 1) cards under the original bottom card, and that each of the (k − 1)! arrangements are equally
probable. The next card to be placed under the original bottom card is equally likely to land in any
37
38
5. STRONG STATIONARY TIMES
of the k possible positions, and by hypothesis, the remaining (k − 1) cards are in random order. We
conclude that all k! arrangements are equally likely.
It should not be too surprising that bounding the size of T (in distribution) gives information
about how long this shuffle should be performed to truly mix the cards. T is an example of a
stationary time: the law of the chain at time T is the stationary distribution. In this chapter, strong
stationary times will be used as a tool for bounding the rate of convergence of Markov chains.
Before we discuss stationary times, we must first understand stopping times.
5.2. Stopping in the stationary distribution
5.2.1. Stopping Times. A friend gives you directions to his house, telling you to take Main
street and to turn left at the first street after City Hall. These are acceptable directions, because you
are able to determine when to turn using landmarks you have already encountered before the turn.
This is an example of a stopping time, which is an instruction for when to “stop” depending only
on information up until the turn.
On the other hand, his roommate also provides directions to the house, telling you to take Main
street and turn left at the last street before you reach a bridge. You have never been down Main
street, so not knowing where the bridge is located, you unfortunately must drive past the turn before
you can identify it. Once you reach the bridge, you must backtrack. This is not a stopping time,
becuase you must go past the turn before identifying it.
We now provide a precise definition for a stopping time. As usual, throughout we assume
(Xt )t≥0 is a finite, irreducible Markov chain with transition matrix P and stationary distribution π.
A stopping time is a random time T with values in {0, 1, 2, . . .}∪{∞} such that the event {T = t}
depends only on the values of the chain up to time t. More precisely, for each t, there is a function
ft : Ωt+1 → {0, 1} so that
1{T = t} = ft (X0 , X1 , X2 , . . . , Xt ).
(5.1)
(1A denotes the indicator random variable for the event A, i.e. the {0, 1}-valued random variable
which equals 1 if and only if A occurs.)
For example, if τA is the first time the chain takes on a value in a subset A of Ω, then τA is a
stopping time. The history (X0 , X1 , . . . , Xt ) up to time t suffices to determine whether the chain is in
the set A for the first time. More precisely,
1{τA = t} = 1{X0 , ∈
/ A, X1 ∈
/ A, . . . , Xt−1 ∈
/ A, Xt ∈ A}.
(5.2)
The right-hand side is a function of (X0 , X1 , . . . , Xt ), so τA satisfies the definition (5.1).
An example of a random time which is not a stopping time is the first time that the chain reaches
its maximum value over a time interval {0, 1 . . . ,t f }:
!
"
M = min t : Xt = max Xs .
(5.3)
1≤s≤t f
It is impossible to check whether M = t by looking only at the first t values of the chain. Indeed,
any investor hopes to sell a stock at the time M when it achieves its maximum value. Alas, this
would require clairvoyance—the ability to see into the future—and is not a stopping time.
5.3. BOUNDING CONVERGENCE USING STRONG STATIONARY TIMES
39
5.2.2. Achieving equilibrium. A strong stationary time for a Markov chain (Xt ) is a stopping
time T such that XT , the chain sampled at T has two properties: first, the law of XT is exactly the
stationary distribution of the chain, and second, the value of XT is independent of T . That is, for all
t = 0, 1, 2, . . .,
P{Xt = x, T = t} = π(x)P{T = t}.
(5.4)
Strong stationary times were introduced in Aldous and Diaconis (1987); see also Aldous and Diaconis (1986).
We will later need the following strengthening of the defining equation (5.4):
E XERCISE 5.2. Show that if T is a strong stationary time, then
P{Xt = x, T ≤ t} = π(x)P{T ≤ t}
(5.5)
5.3. Bounding convergence using strong stationary times
Throughout this section, we discuss a Markov chain (Xt ) with transition matrix P and stationary
distribution π. The route from strong stationary times to bounding convergence time is the following
proposition:
P ROPOSITION 5.1. If T is a strong stationary time, then
max 'Pt (x, ·) − π'TV ≤ max Px {T > t}.
x∈Ω
x∈Ω
(5.6)
Moreover, the right-hand side in (5.6) determines the rate of geometric convergence to zero of the
left-hand side: If maxx∈Ω Px {T > t0 } ≤ ε, then
max 'Pt (x, ·) − π'TV ≤ ε t/t0 .
x∈Ω
(5.7)
We break the proof into several lemmas. It will be convenient to introduce a parameter s(t),
called separation distance and defined by
#
$
Pt (x, y)
s(t) := max 1 −
.
(5.8)
x,y∈Ω
π(y)
The relationship between s(t) and T is:
L EMMA 5.2. If T is a strong stationary time, then
s(t) ≤ max Px {T > t}.
x∈Ω
(5.9)
P ROOF. Observe that for any x, y ∈ Ω,
1−
Pt (x, y)
Px {Xt = y}
Px {Xt = y, T ≤ t}
= 1−
≤ 1−
.
π(y)
π(y)
π(y)
(5.10)
By Exercise 5.2, the right-hand side is bounded above by
1−
π(y)Px {T ≤ t}
= Px {T > t}.
π(y)
(5.11)
!
The next Lemma along with Lemma 5.2 proves (5.6).
40
5. STRONG STATIONARY TIMES
L EMMA 5.3. Total variation is bounded by separation distance:
%
%
max %Pt (x, ·) − π %TV ≤ s(t).
x∈Ω
P ROOF. Writing
t
'P (x, ·) − π'TV =
∑
y∈Ω
Pt (x,y)<π(y)
&
'
π(y) − Pt (x, y) =
we conclude that
t
'P (x, ·) − π'TV
∑
y∈Ω
Pt (x,y)<π(y)
#
$
Pt (x, y)
π(y) 1 −
,
π(y)
#
$
Pt (x, y)
≤ max 1 −
.
y∈Ω
π(y)
(5.12)
(5.13)
(5.14)
!
The bound (5.7) follows from the submultiplicative property of s(t), which the following exercise establishes.
E XERCISE 5.3. Let s(t) be defined as in (5.8)
(a) Show that there is a stochastic matrix Q so that Pt (x, ·) = [1 − s(t)] π + s(t)Q(x, ·), and π = πQ.
(b) Using the representation in (a), show that
Pt+u (x, y) = [1 − s(t)s(u)] π(y) + s(t)s(u) ∑ Qt (x, z)Qu (z, y).
(5.15)
z∈Ω
(c) Using (5.15) establish that s is submultiplicative: s(t + u) ≤ s(t)s(u).
P ROOF OF (5.7). By Exercise 5.3,
(
)
t
s(t) = s t0
≤ s(t0 )t/t0 .
t0
Since s(t0 ) ≤ ε by hypothesis, applying Lemma 5.3 establishes (5.7).
!
5.4. Two glued complete graphs
Consider the graph G obtained by taking two complete graphs on n vertices and “gluing” them
together at a single vertex. See Figure 5.2 for an illustration. We analyze here a slightly modified
simple random walk on G.
!
!
& Let v1 be
' the vertex where the two complete graphs meet. When away from v , with probability
1
2 1 − 2n−1 , choose a neighbor uniformly at random. With the remaining probability, remain at
the current vertex. When at v! , the walk remains at v! with probability (2n − 1)−1 , and with the
remaining probability chooses among the 2n − 2 neighbors uniformly at random. This modification
to simple random walk makes the stationary distribution uniform on the vertices of G. (Note that
for simple random walk, the stationary distribution at a vertex is proportional to its degree, and so
v! carries extra mass under simple random walk.)
E XERCISE 5.4. Check that the stationary distribution for the walk above is uniform on all
2n − 1 vertices.
5.4. TWO GLUED COMPLETE GRAPHS
41
F IGURE 5.2. Two complete graphs (on 7 vertices), “glued” at a single vertex.
It is clear that when at v! , the next move is equally likely to be any of the 2n − 1 vertices. For
this reason, if T is the time one step after v! has been visited for the first time, then T is a strong
stationary time.
&
' 1
1
!
When the walk is not at v! , the chance of moving to v! is 12 1 − 2n−1
n−1 . The time Tv until
first visiting v! is then a geometric random variable with
(
)
2n − 1
E (Tv! ) = 2(n − 1)
.
(5.16)
2n − 3
We conclude that
(
)
2n − 1
E (T ) = 2(n − 1)
+ 1 ≤ 2n − 1.
2n − 3
(5.17)
Using Markov’s inequality and (5.17) shows that
Px {T ≥ t} ≤
2n
.
t
(5.18)
Taking t = 2en in (5.18) shows that Px {T ≥ t} ≤ e−1 , and applying Proposition 5.1 shows that
%
%
max %Pt (x, ·) − π %TV ≤ exp(−t/2en).
(5.19)
x∈Ω
After order n steps, the distance to stationarity is small. In fact, at least order n steps are required:
E XERCISE 5.5. By considering the set A ⊂ Ω of all vertices in one of the two complete graphs,
show that if t = cn, for small c the distance maxx∈Ω 'Pt (x, ·) − π'TV is close to 1/2.
42
5. STRONG STATIONARY TIMES
F IGURE 5.3. The 3-dimensional hypercube.
5.5. Random walk on the hypercube
We have met already the walk on the hypercube in Section 4.5; here we approach its mixing
time via strong stationary times. The coordinate-by-coordinate coupling mentioned in our previous
discussion and a natural strong stationary time are closely related. The coupon collector’s time
dominates the coupling time, and has the same distribution as our strong stationary time.
Let us restate the definition: Let Ω = {0, 1}n be the collection of all binary words of length n.
We define the Hamming distance between x, y ∈ Ω as the number of coordinates in which x and y
differ:
n
d(x, y) = ∑ |xi − yi |.
(5.20)
i=1
The n-dimensional hypercube is the graph with vertex set Ω and with edges connecting vertices
having Hamming distance 1.
Simple random walk on the hypercube moves from a vertex by choosing one of its coordinates
at random and changing the corresponding bit.
We will instead study the lazy random walk on the cube. It can be executed as follows: At
each step, one of the n coordinates is selected. The bit at this coordinate is then replaced by an
independent fair bit.
Let T be the first time that all n coordinates have been selected. Since each coordinate bit at
this time has been replaced by an independent fair coin toss, the distribution is uniform on {0, 1}n .
This shows that T is a strong stationary time.
The next section gives some details about the classical “coupon collecting problem”, which will
be needed to analyze the distribution of T .
5.5.1. Coupon collecting. A card company issues baseball cards, each featuring a single player.
There are n players total, and a collector desires a complete set. We suppose each card he acquires
is equally likely to be each of the n players. How many cards must he obtain so that his collections
contains all n players? The answer is that after obtaining n log n cards, he is likely to have a full
collection.
Let T be the (random) number of cards collected when the set first contains every player. The
expectation E(T ) can be computed by writing T as a sum of geometric random variables. Let Tk be
the total number of cards accumulated when the collection first contains k distinct players. Then
T = Tn = T1 + (T2 − T1 ) + · · · + (Tn − Tn−1 ).
(5.21)
5.5. RANDOM WALK ON THE HYPERCUBE
43
Furthermore, Tk − Tk−1 is a geometric random variable with success probability (n − k + 1)/n: after
collecting Tk−1 cards, n − k + 1 of the n players are missing from the collection. Each subsequent
card drawn has the same probability (n − k − 1)/n of being a player not already collected, until such
a card is finally drawn. Thus E(Tk − Tk−1 ) = n/(n − k + 1) and
E(T ) =
n
n
n
1
1
=n∑ .
k=1 n − k + 1
k=1 k
∑ E(Tk − Tk−1 ) = n ∑
k=1
(5.22)
E XERCISE 5.6. By comparing the integral of 1/x with its Riemann sums, show that
log n ≤
n
∑ k−1 ≤ log n + 1.
(5.23)
k=1
Applying Markov’s inequality, for any c > 0,
P{T > cn(log n + 1)} ≤
E(T )
.
cn(log n + 1)
Using Exercise 5.6 and (5.22), the right-hand side is not larger than 1/c, showing that
1
P{T > cn(log n + 1)} ≤ .
c
This can be improved without much work. Let Ai be the event that the i-th player does not appear
among the first n log n + cn cards drawn. Then
*
,
P{T > n log n + cn} = P
n
+
i=1
Ai
n
≤ ∑ P(Ai )
i=1
)
(
)
n (
1 n log n+cn
n log n + cn
= ∑ 1−
≤ n exp −
= e−c . (5.24)
n
n
i=1
The bound (5.24) quantifies the statement that obtaining n log n cards is sufficient to acquire a
complete set.
5.5.2. Upper bound. We now can quite easily bound the distance to stationarity for the lazy
random walker on the hypercube. The time T until all coordinates have been updated has the same
distribution as the time in the coupon collector’s problem, just analyzed.
Substituting (5.24) in Proposition 5.1 shows that if t = n log n + cn, then
%
%
sup %Pt (x, ·) − π % ≤ e−c .
(5.25)
x∈Ω
TV
5.5.3. Lower bound. In this section we prove a lower bound on the time it takes for the random
walk on the hypercube to become mixed. We begin by proving a general result which is useful for
obtaining lower bounds on total variation distance for many chains.
Let µ and ν be two probabilities on Ω, and let f be a real-valued function defined on Ω. We
write Eµ to indicate expectations of random variables (on sample space Ω) with respect to the
probability µ. Likewise Varµ ( f ) indicates variance computed with respect to the probability µ.
If
P ROPOSITION 5.4. Let µ and ν be two probabilities on Ω, and f a real-valued function on Ω.
|Eµ ( f ) − Eν ( f )| = rσ ,
44
5. STRONG STATIONARY TIMES
where σ 2 = max{Varµ ( f ), Varν ( f )}, then
8
.
r2
P ROOF. Without loss of generality, assume that Eµ ( f ) ≤ Eν ( f ). Then
!
"
'
1&
r .
µ f ≤ Eν ( f ) + Eµ ( f ) = µ f − Eµ ( f ) ≤ σ .
2
2
'µ − ν'TV ≥ 1 −
Using Chebyshev’s Inequality on the right-hand side above shows that
!
"
'
1&
4
µ f ≤ Eµ ( f ) + Eν ( f ) ≥ 1 − 2 .
2
r
Similarly,
!
"
'
1&
r . 4
ν f ≤ Eµ ( f ) + Eν ( f ) = ν f − Eν ( f ) < − σ ≤ 2 .
2
2
r
/
0
1
Consequently, if A := f ≤ 2 [Eν ( f ) + Eµ ( f )] , then
'µ − ν'TV ≥ µ(A) − ν(A) ≥ 1 −
8
.
r2
!
For example, if r = 4, then 'µ − ν'TV ≥ 1/2.
We return now to the random walker on the hypercube, to which we will apply Proposition 5.4.
P ROPOSITION 5.5. Let "1 = (1, 1, . . . , 1). For the random walk on the n-dimensional hypercube,
if t ≤ cn log n for c < 1/2, then
%
%
% t"
%
(5.26)
%P (1, ·) − π % = 1 − o(1).
TV
P ROOF. The Hamming weight for x ∈ {0, 1}n is defined by N(x) := ∑ni=1 xi , the number of
coordinates of x with value 1. We will compare the mean and variance of N under the uniform
probability π with the mean and variance of Nt := N(Xt ), where Xt = (X1 (t), . . . , Xn (t)) is the walker
at time t started at "1 = (1, . . . , 1).
As π is uniform on {0, 1}n , the distribution of the random variable N under π is binomial with
parameters n and p = 1/2. In particular,
n
n
Eπ (N) = , Varπ (N) = .
2
4
The chance the coordinate i has been updated at least once by time t is
(
)
1 t
1− 1−
n
Given that it has been updated at least once, the chance that the last of these updates is to 0 is 1/2.
Since Xi (t) = 0 if and only if the coordinate i has been updated at least once by time t and the last
of these updates is to 0,
#
(
) $( ) ( )#
(
)$
1 t
1
1
1 t
E"1 (Xi (t)) = 1 − P"1 {Xi (t) = 0} = 1 − 1 − 1 −
=
1+ 1−
.
n
2
2
n
5.6. TOP-TO-RANDOM SHUFFLE REVISITED
45
By the linearity of expectation,
#
(
)$
n
1 t
E"1 (N(Xt )) =
1+ 1−
.
2
n
Because the bits Xi (t) are negatively correlated if X(0) = "1, it follows that Var(Nt ) ≤ n/4. (See
Exercise 5.7.) Setting
√
1
n
σ = max{Varπ (N), Var(Nt )} =
,
2
we have
(
)t
(
)t
2
2
√
2Eπ (N) − E" (Nt )2 = n 1 − 1 = σ n 1 − 1 .
1
2
n
n
√
Letting r(t, n) = n(1 − n−1 )t and applying Proposition 5.4,
'Pt ("1, ·) − π'TV ≥ 1 −
8
.
r(t, n)2
It can be checked that if t = cn log n, where c < 1/2, then
#
(
)$
log n
r(t, n) = 1 + O
n1/2−c .
n
(5.27)
(5.28)
We conclude that if t ≤ cn log n, where c < 1/2, then
'Pt ("1, ·) − π'TV ≥ 1 −
8
= 1 − o(1).
(1 + o(1))n1/2−c
!
E XERCISE 5.7. Let X(t) = (X1 (t), . . . , Xn (t)) be the position of the lazy random walker on the
hypercube, started at X(0) = "1 = (1, . . . , 1). Show that the covariance between Xi (t) and X j (t) is
negative. Conclude that if Nt = ∑ni=1 Xi (t), then Var(Nt ) ≤ n/4. Hint: It may be easier to consider
the variables Yi (t) = 2Xi (t) − 1.
5.6. Top-to-random shuffle revisited
Let us return now to the top-to-random shuffle introduced in Section 5.1. The time T when the
original bottom card is first placed in the deck after rising to the top is a strong stationary time, as
explained earlier.
Consider the motion of the original bottom card. When there are k cards beneath it, the chance
that it rises one card remains k/n until a shuffle puts the top card underneath it. Thus, the distribution
of T is the same as the coupon collector’s time. From here upper bound analysis proceeds just as for
the strong stationary time for the hypercube walk. Applying Proposition 5.1 and (5.24) produces
the following result:
P ROPOSITION 5.6. Let P be the top-to-random shuffle on Sn . If t = n log n + cn, then
%
%
sup %Pt (x, ·) − π %TV ≤ e−c .
x∈Sn
(5.29)
46
5. STRONG STATIONARY TIMES
5.7. Transitive chains and move-to-front
A Markov chain is called transitive if for each pair (x, y) ∈ Ω × Ω there is a function φ = φ(x,y)
mapping Ω to itself such that
φ (x) = y
and
P(z, w) = P(φ (z), φ (w)).
(5.30)
Roughly, this mean the chain “looks the same” from any point in the state-space Ω.
E XERCISE 5.8. Show the random walk on the torus (Section 4.4) is transitive.
E XERCISE 5.9. Show that the stationary distribution of a transitive chain must be uniform.
5.7.1. Time reversal and transitive chains. The time-reversal of a Markov chain with transition matrix P and stationary distribution π is the chain with matrix
3 y) := π(y)P(y, x) .
P(x,
π(x)
(5.31)
E XERCISE 5.10. Let (Xt ) be a Markov chain with transition matrix P, and write (X3t ) for the
time-reversed chain with the matrix P3 defined in (5.31).
3
(a) Check that π is stationary for P.
(b) Show that
Pπ {X0 = x0 , . . . , Xn = xn } = Pπ {X30 = xn , . . . , X3n = x0 }.
(5.32)
Exercise 5.10 shows that the terminology “time-reversal” is reasonable. (Note that when a chain
is reversible, as defined in Section 3.2.5, then P3 = P.)
E XERCISE 5.11. Show that if P is transitive, then P3 is also transitive.
L EMMA 5.7. Let P be a transitive transition matrix and let P3 be the time-reversed matrix
defined in (5.31). Then
%
%
%
%
% 3t
%
(5.33)
%P (x, ·) − π % = %Pt (x, ·) − π %TV .
TV
P ROOF. Since our chain is transitive, Exercise 5.9 implies that it has uniform stationary distribution. For x, y ∈ Ω, let φ(x,y) be a permutation carrying x to y and preserving the structure of the
chain. For any x, y ∈ Ω and any t,
2
2
2
2
(5.34)
∑ 2Pt (x, z) − |Ω|−1 2 = ∑ 2Pt (φ(x,y) (x), φ(x,y) (z)) − |Ω|−1 2
z∈Ω
z∈Ω
=
2
2
∑ 2Pt (y, z) − |Ω|−1 2 .
z∈Ω
Averaging both sides over y yields
2
2
2
2
1
∑ 2Pt (x, z) − |Ω|−1 2 = |Ω| ∑ ∑ 2Pt (y, z) − |Ω|−1 2 .
z∈Ω
y∈Ω z∈Ω
(5.35)
(5.36)
3 y), and thus Pt (y, z) = P3t (z, y). It follows that the
Because π is uniform, we have P(y, z) = P(z,
right-hand side above is equal to
2
2
2
2
1
1
2 3t
2 3t
−1 2
−1 2
P
(z,
y)
−
|Ω|
=
P
(z,
y)
−
|Ω|
(5.37)
2
2
∑ ∑2
∑ ∑2
|Ω| y∈Ω
|Ω| z∈Ω
z∈Ω
y∈Ω
5.8. NOTES
47
By Exercise 5.11, P3 is also transitive, so (5.36) holds with P3 replacing P (and z and y interchanging
roles). We conclude that
2
2
2 t
2
2 3t
−1 2
−1 2
2
P
(x,
z)
−
|Ω|
=
P
(x,
y)
−
|Ω|
(5.38)
2
2.
∑
∑
z∈Ω
y∈Ω
Dividing by 2 and applying Proposition 3.9 completes the proof.
!
5.7.2. Move-to-front chain. A certain professor owns many books, arranged on his shelves.
When he finishes with a book drawn from his collection, he does not waste time reshelving it in its
proper location. Instead, he puts it at the very beginning of his collection, in front of all the shelved
books.
If his choice of book is random, this is an example of the move-to-front chain. It is a very natural
chain which arises in many applied contexts. Any setting where items are stored in a stack, removed
at random locations, and placed on the top of the stack can be modeled by the move-to-front chain.
Let P be the transition matrix (on permutations of {1, 2, . . . , n}) corresponding to this method
of rearranging elements.
The time-reversal P3 of the move-to-front chain is the top-to-random shuffle, as intuition would
expect. It is clear from the definition that for any permissible transition σ1 +→ σ2 for move-to-front,
the transition σ2 +→ σ1 is permissible for top-to-random, and both have probability n−1 .
Using Lemma 5.7, the mixing time for move-to-front will be identical to that of the top-torandom shuffle. Hence the upper bound of Proposition 5.6 applies to the move-to-front chain as
well.
5.8. Notes
A group is a set equipped with an invertible multiplication operation. Random walks on groups
are always transitive.
References on strong uniform times are Aldous and Diaconis (1986) and Aldous and Diaconis
(1987).