Fast Mixing in a Markov Chain

Fast Mixing in a Markov Chain
László Lovász and Peter Winkler
December 2007
Contents
1 Introduction and preliminaries
1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
2
2
2 Access times between distributions
2.1 Hitting times . . . . . . . . . . . . . . . .
2.2 General Stopping rules . . . . . . . . . . .
2.3 Access times . . . . . . . . . . . . . . . .
2.4 Mixing time . . . . . . . . . . . . . . . . .
2.5 Exit Frequencies . . . . . . . . . . . . . .
2.6 *Exit discrepancy and recurrent potential
2.7 Four stopping rules . . . . . . . . . . . . .
2.7.1 Local rule . . . . . . . . . . . . . .
2.7.2 Filling rule . . . . . . . . . . . . .
2.7.3 Threshold rule . . . . . . . . . . .
2.7.4 Chain rule . . . . . . . . . . . . . .
2.8 Halting states and optimal rules . . . . .
2.9 *Strong stopping rules . . . . . . . . . . .
2.10 *Matrix formulas . . . . . . . . . . . . . .
2.11 *Comparison of stopping rules . . . . . .
2.12 *Continuous time and space: an example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
4
5
6
7
9
10
11
11
12
13
14
18
18
19
20
3 *Time reversal
3.1 Reverse chains . . . . . . . . . . . .
3.2 Forget time and reset time . . . . . .
3.3 Optimal and pessimal starting states
3.4 Exit frequency matrices . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
20
22
24
26
4 Blind rules
4.1 *Exact blind rules . . . . . . . .
4.2 Averaging rules . . . . . . . . . .
4.3 *Approximate blind mixing times
4.4 *Pointwise approximation . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26
26
28
29
30
5 *Groups of mixing times
5.1 The relaxation group . .
5.2 The forget group . . . .
5.3 The reset group . . . . .
5.4 The mixing group . . . .
5.5 The maxing group . . .
5.6 A reverse inequality . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33
35
38
41
42
43
44
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
6 Estimating the mixing time
6.1 *A linear programming bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Conductance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3 *Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
45
46
49
Abstract
The critical issue in the complexity of Markov chain sampling techniques has been “mixing
time”, the number of steps of the chain needed to reach its stationary distribution. It turns
out that there are many ways to define mixing time—more than a dozen are considered here—
but they fall into a small number of classes. The parameters in each class lie within constant
multiples of one another, independent of the chain.
1
Introduction and preliminaries
In the past 20 years there have been numerous applications (see, e.g., [3], [18], [16]) for sampling via
Markov chains. Typically a random walk on a graph is run for a fixed number of steps after which
the distribution of the current state is nearly stationary.
There is no particular reason why such a walk must be run for a fixed number of steps; in fact,
more general stopping rule, some of which “look where they are going,” are capable of achieving
the stationary distribution exactly.
It turns out to be useful to consider stopping rules that achieve any given distribution, when
starting from some other given distribution. From a theoretical standpoint, our stopping rules result
in a generalization of the notion of “hitting time” to state-distributions. This measure of distance
between distributions behaves quite nicely.
By considering the least expected number of steps required by rules of a particular kind to reach
the stationary distribution (or some approximation thereof) we obtain several measures of mixing
time. To these we add some additional parameters, related to eigenvalues, set-hitting time, and the
reverse of the given chain. Altogether we obtain a substantial collection of numbers each of which
has mixing time implications.
Our objective is to place these numbers into a few equivalence classes, within which numbers
differ only by constant factors independent of the chain. To do so we develop a calculus of “exit
frequencies” and some linear algebraic tools which we hope shed some light on the mechanism of
mixing.
In what follows we assume that the Markov chain has finitely many states (although see [6] for
more general results) but is not generally reversible. For some of our stopping rules the transition
matrix of the Markov chain must be known, making use as sampling mechanisms unlikely; but such
rules are useful in analysis and, as we shall see, often replaceable by simpler ones without loss in
efficiency. In any case we put no restriction on the amount of computation needed to implement a
stopping rule.
1.1
Notation
Throughout this paper we will assume a fixed irreducible Markov chain with transition matrix
M = (pij ), with a finite state space V of finite cardinality n (see [6] for extensions to infinite state
space). If we say “distribution” without specifying an
P underlying set, we mean distribution on V .
If σ is a distribution on V and A ⊆ V , then σ(A) = i∈A σi is the probability of A. We denote by
π the stationary distribution of the chain. The ergodic flow consists of the values
Qij = πi pij .
1.2
Examples
In this section we discuss examples which show that “intelligent” stopping rules can sometimes
achieve specified distributions in an elegant or surprising manner.
2
Example 1.1 (Cycle) The following is an interesting fact from folklore. Let G be a cycle of length
n and start a random walk on G from a node u. Then the probability that v is the last node visited
(i.e., the a random walk visits every other node before hitting v) is the same for each v 6= u.
While this is not an efficient way to generate a uniform random points of the cycle, it indicates
that there are entirely different ways to use random walks for sampling than walking a given number
of steps. This particular method does not generalize; in fact, apart from the complete graph, the
cycle is the only graph which enjoys this property (see [21]).
Example 1.2 (Cube) Consider another quite simple graph, the cube, which we view as the graph
of vertices and edges of [0, 1]n . Let us do a random walk on it as follows: at each vertex, we select
a direction (that is, one of the n coordinate indices) at random, then flip a coin. If we get “heads”
we walk along the incident edge corresponding to that direction; if “tails” we stay where we are.
We stop when we have selected every direction at least once (whether or not we walked along the
edge).
It is trivial that after each of the n directions has been selected, the corresponding coordinate
will be 0 or 1 with equal probability, independently of the rest of the coordinates. So the vertex we
stop at will be uniformly distributed over all vertices.
This method takes about n ln n coin flips on the average, thus about n ln n/2 actual steps, so it
is a quite efficient way to generate a random vertex of the cube (assuming we insist on using random
walks; otherwise choosing the coordinates independently is simpler and faster). We will see that it
is in fact optimal.
Example 1.3 (Card Shuffling) A classic application of Markov chain mixing is shuffling a deck
of playing cards; see e.g. [9] where Bayer and Diaconis argue that seven ruffle shuffles are necessary
and sufficient to mix a deck of 52 cards. In Aldous and Diaconis [5], the following simple shuffling
algorithm is analyzed: a card is removed from the top of an n-card deck and replaced with equal
probability in any of the n slots among the remaining n − 1 cards.
It is not difficult to see that if we note when the card originally at the bottom of the deck
has reached the top and perform just one more shuffle, then the deck will be precisely uniformly
random! Again, this method takes about n ln n steps on the average, and also as in the cube case,
constitutes a “strong” stopping rule in the sense of [5] (the ending distribution is uniform even when
conditioned on the stopping rule having taken some fixed number of steps). However, we will see
later that this shuffling rule is not optimal.
Without going into the details, let us remark that the algorithm of Aldous [4] and Broder [10] can
be regarded as a stopping rule for a Markov chain on trees that generates the uniform distribution.
Finally, we note that to stop after a fixed number of steps in the “continuous” time model of
Markov chains (where the time needed by a step is an exponentially distributed random variable)
can be viewed as choosing a random number T from a Poisson distribution and stopping after
T steps. So here a (randomized) stopping rule is considered which, in many respects, has better
properties than the “stop after t steps” rule. A similar remark applies to stopping a “lazy” walk as
defined, e.g., in [20].
2
2.1
Access times between distributions
Hitting times
Let H(i, j) denote the expected hitting time from i to j (mean number of steps to reach j from i).
Hitting times satisfy the following identity:
(
P
1 + k pik H(k, j), if i 6= j,
H(i, j) =
(1)
0,
if i = j.
3
A more surprising identity is the “Random Target Lemma” (also known as the right averaging
principle, e.g. in [1]), which says that there is a constant N for which
X
πj H(i, j) = N
(2)
j
for every i.
Hitting times in a reversible chain satisfy the following “cycle-reversing” identity of Coppersmith
et al. [13]:
H(i, j) + H(j, k) + H(k, i) = H(i, k) + H(k, j) + H(j, i).
(3)
Our goal is to generalize this by defining access times between distributions.
2.2
General Stopping rules
Throughout this paper we will assume a fixed irreducible Markov chain with transition matrix
M = {pij } and initial distribution σ, over a state space V of finite cardinality n. Most of our
results generalize in straightforward manner in the presence of transient states and/or a countably
infinite state space, but with messier conditions and arguments which we prefer to avoid. We
assume discrete-time transitions but, again, most or our results can be modified to apply to the
continuous-time case.
We denote by V ∗ be the space of finite “walks” on V , that is, the set of finite strings w =
(w0 , w1 , . . . , wt ), wi ∈ S. Relative to the Markov chain we then have
P(w) = σw0
t−1
Y
pwi ,wi+1 .
i=0
P
The distribution of wt will be denoted by σ t , so that σ 0 = σ and σit = wt =i P((w0 , . . . , wt )).
It will be useful to us to regard a stopping rule Γ both as a random stopping time and as a
(partial) function from S ∗ to [0, 1] giving the probability of continuing from a walk w. Formally:
Definition. A stopping rule Γ is a partial map from S ∗ to [0, 1], such that Γ(w) is defined for
w = (w0 , . . . , wt ) just when P(w) > 0 and Γ(w0 , . . . , wi ) is defined and non-zero for each i, 0 ≤ i < t.
We interpret Γ(w) as the probability of continuing given that w is the walk so far observed, each
such stop-or-go decision being made independently. We can also regard Γ as a random variable with
values in {0, 1, . . . }, so that we stop at wΓ . The probability of stopping at state j, given starting
distribution σ, is
Ã t−1
!
Y
X
Γ
σj :=
σw0
Γ(w0 , . . . , wi )pwi ,wi+1 (1 − Γ(w)).
i=0
w=(w0 ,...,wt =j)
The mean length EΓ of the stopping rule Γ is its expected duration; if EΓ < ∞ then with
probability 1 the walk eventually stops, and thus σ Γ is a probability distribution. A stopping rule
Γ for which σ Γ = τ is also called a stopping rule from σ to τ .
Note that if X is a random variable whose values are stopping rules, then X is equivalent to a
stopping rule Γ whose probability of continuing at w is a sum or integral of X(w) conditioned on
X not having stopped the chain so far. Thus our stopping rules are not less general for insisting on
independent randomization at each step. For example, the stopping rule given above in Example 1.2
(Cube) does not at first appear to fit our model, but we may simply walk on the (loopless) cube and
stop according to the probability that the “draw and flip” rule would stop, given that it reached
the current time and state.
Note that for any Markov chain and any τ , there is at least one finite stopping rule Γ such that
σ Γ = τ ; namely, we select a target state j in accordance with τ and walk until we reach j. We call
this the “naive” stopping rule Ωσ,τ . The mean length of the naive rule is given by
X
N (σ, τ ) = EΩσ,τ =
σi τj H(i, j).
i,j
4
When the target is the stationary distribution π for the chain, EΩσ,π is is independent of the starting
distribution σ, on account of the Random Target Lemma (2), so that N = N (σ, π).
The maximum length max(Γ) of a stopping rule Γ is the maximum length of a walk that has
positive probability. Γ is said to be bounded (for σ) if this quantity is finite. It is clear that for a
fixed M and Γ, max(Γ) only depends on the support of σ. Bounded stopping rules are available
when the target distribution τ has sufficiently large support, thus in particular when τ = π.
We often think of a stopping rule Γ as a means of moving from a starting distribution σ to a
given target distribution τ = σ Γ . Such a Γ is said to be mean-optimal or simply optimal (for σ and
τ ) if EΓ is minimal, max-optimal if max(Γ) is minimal.
2.3
Access times
The mean length of a mean-optimal stopping rule from σ to τ will be called the access time from
σ to τ and denoted H(σ, τ ). As suggested by the notation H(σ, τ ) generalizes the hitting times
H(i, j).
Trivially, H(σ, τ ) = 0 if and only if σ = τ . It is easy to see that the following triangle inequality
is satisfied for any three distributions α, β and γ:
H(α, γ) ≤ H(α, β) + H(β, γ).
(4)
(To generate γ from α, we can first use an optimal rule to generate β from γ and then use the state
obtained as a starting state for an optimal rule generating γ from β). We should warn the reader,
however, that H(σ, τ ) 6= H(τ, σ) in general.
We have seen that the access time H(σ, τ ) has the properties of a metric on the space of statedistributions, except for symmetry; the latter is of course too much to expect since the ordinary
hitting time, even for a reversible chain, is not generally symmetric.
Hence it would appear reasonable to expect that the cycle reversing identity 3 extends to distributions in the case of reversible chains. However, this is not the case. A counterexample may be
found on the 5-path: let α = (0, 0, 1, 0, 0), β = (0, 1/2, 0, 1/2, 0) and γ = (1/4, 0, 1/2, 0, 1/4). Then
it is easily checked that H(α, β) + H(β, γ) + H(γ, α) = 1 + 1 + 2 but the reverse route gives 2 + 1 + 2.
If τ is concentrated at j (for which we write, rather carelessly, “τ = j”) then
X
H(σ, j) =
σi H(i, j)
(5)
i
since clearly the only optimal stopping rule in this case is Ωi,j , “walk until state j is reached.” This
is therefore both mean-optimal and max-optimal, although max(Ωi,j ) will generally be infinite. In
particular, we have
X
N =
πj H(π, j).
(6)
j
By considering the naive rule Ωτ , we get the inequality
X
H(σ, τ ) ≤
σs τj H(i, j).
(7)
i,j
This may be quite far from equality; for example, H(σ, σ) = 0 for any σ.
The access time H(σ, τ ) is convex in both its arguments, i.e.,
H(cσ + (1 − c)σ 0 , τ ) ≤ cH(σ, τ ) + (1 − c)H(σ 0 , τ )
(8)
H(σ, cτ + (1 − c)τ 0 ) ≤ cH(σ, τ ) + (1 − c)H(σ, τ 0 )
(9)
and
for 0 ≤ c ≤ 1 and any distributions σ, σ 0 , τ, τ 0 on S.
5
To see (8), let i be the starting point drawn from distribution cσ + (1 − c)σ 0 . Flip a biased
coin; with probability cσi /(cσi + (1 − c)σi0 ) follow an optimal stopping rule from σ to τ , else follow
an optimal stopping rule from σ 0 to τ . It is easy to check that this gives a stopping rule from
cσ + (1 − c)σ 0 to τ , which may not be optimal, but gives an upper bound on the length of an optimal
rule.
We derive (9) similarly: with probability c follow an optimal stopping rule from σ to τ , else one
from σ to τ 0 .
We mention one more useful elementary inequality.
P Let σ and τ be two distributions and denote
by d(σ, τ ) their total variation distance: d(σ, τ ) = i max{0, σi − τi }. Also define the distributions
(σ\τ )i :=
max{0, σi − τi }
,
d(σ, τ )
(σ ∧ τ )i :=
min{σi , τi }
.
1 − d(σ, τ )
Then
H(σ, τ ) ≤ d(σ, τ )H(σ\τ, τ \σ).
(10)
This can be seen by stopping with probability min{σi , τi }/σi at time 0, else following an optimal
stopping rule to get from σ\τ to τ \σ.
2.4
Mixing time
Before defining our first notions of mixing time it will be useful to establish a reasonably consistent
notation scheme. We will use script letters (like H or N ) to denote mixing parameters, with different
letters often used to distinguish classes of stopping rules. The dependence of a parameter on the
(fixed) transition matrix of a Markov chain will always be understood.
As before we permit initial and target distributions as parameters, e.g. H(σ, τ ) for the access
time from σ to τ , but we will extend the scope of the arguments even further, to sets of distributions.
In the case of access time, if A and B are two sets of state-distributions, we define
H(A, B) := min max H(α, β).
β∈B α∈A
For example, H(V, ≤ 2π) would be the minimum over all distributions τ ≤ 2π of the access time to
τ from a worst single state (equivalently, in this case, a worst distribution of states).
Often mixing times will be computed from the worst starting distribution among all
distributions—as in the above case this is generally equivalent to worst state—and that will be
the assumption if there is only one argument to H. If there are no arguments to H then the starting
distribution is worst case and the target distribution π, as it is in most applications. In symbols:
H := H(π) = max H(s, π) = max H(σ, π) := H(V, π).
σ
s∈V
We bestow upon H the honor of calling it the mixing time of the chain.
We have already seen the notation
N := N (π) := N (σ, π)
for the constant in the Random Target Lemma, where σ is any distribution.
We reserve the letter M for the minimum over all stopping rules Γ (with appropriately restricted
initial and target distributions) of the maximum number of steps, instead of the mean, taken by Γ.
Thus,
M(σ, τ ) := min max(Γ) ;
Γ: σ Γ =τ
and
M := M(π) := max M(σ, π)
σ
6
is the related mixing measure, which we cannot resist calling the “maxing time” of the chain.
A stopping rule is said to be independent if the state at which it stops the chain is independent
of the one in which the chain was started. If we write
I(σ, τ ) := min{EΓ : Γ independent, σ Γ = τ }
then we have
I(σ, τ ) =
X
σi H(i, τ )
i∈V
since an independent rule must produce the target distribution “separately” from each starting
state.
In many applications repeated independent samples are required from the stationary distribution
of a chain, and in that case it is natural to start the second and subsequent walks from π itself instead
of a “worst” state. Hence, we let both the default initial and target distributions for an independent
rule be π:
X
I := I(π, π) =
πi H(i, π)
i
obtaining what we call the reset time of the chain. (This is only formally similar to formula (6) for
N ; in general, N is much larger.)
The last mixing measure which we introduce at this stage does not overtly involve the stationary
distribution; instead, we look for the best target distribution, in the sense of expected access time
from a worst starting state for that distribution. We call this parameter the forget time because it
measures, in a sense, the least time it can take to “forget” what state the chain was started in. We
denote the forget time by F and this time no arguments are needed:
F := min max H(s, τ ) = min max H(σ, τ ).
τ
s∈V
τ
σ
The target distribution τ which minimizes maxs H(s, τ ) will be called the “forget distribution” and
denoted by ϕ.
2.5
Exit Frequencies
Let us now fix a transition matrix M , an initial distribution σ and a finite stopping rule Γ.
Definition. The exit frequencies xi = xi (Γ) (i ∈ V ) are defined by setting xi equal to the expected
number of times the walk leaves state i before stopping. The scaled exit frequencies are defined by
yi = π1i xi . In the finite case, the use of xi or yi is only a matter of taste; in the case of Markov
chains with infinite state space, the difference between xi and yi is quite significant.
Exit frequencies are a special case of what Pitman [26] calls “pre-T occupation measures” for
stopping times T . Indeed, versions of Lemma 2.2 and Theorem 2.4 below are proved in [26] using
Pitman’s “occupation measure identity.”
To warm up, let us determine the exit frequencies in some simple cases. The first result is from
Aldous [1]. Several related formulas could be derived using relations to electrical networks, as in
[12], [14], or [29].
Lemma 2.1 The scaled exit frequencies for the naive stopping rule from state i to state j are given
by
yk (Ωij ) = H(i, j) + H(j, k) − H(i, k).
More generally, the scaled exit frequencies for the naive stopping rule from σ to τ are given by
X
X
yk (Ωσ,τ ) =
σi τj (H(i, j) + H(j, k) − H(i, k)) =
σi τj H(i, j) + H(τ, k) − H(σ, k).
i,j
i,j
7
Proof. The mean number of k-visits in a walk from i to j to k and back to i must be the stationary
probability of k, multiplied by the mean length of such a walk, i.e. πk (H(i, j) + H(j, k) + H(k, i)),
since one can partition a doubly-infinite walk into such pieces. Similarly the number of k visits
(counting the last but not the 0-th) on a k-to-i-to-k round trip is πk (H(k, i) + H(i, k)).
From i to k or from j to k we have exactly one visit of k, so altogether we get that the mean
number of k-visits in a walk from i to j is
¡
¢
¡
¢
πk H(i, j) + H(j, k) + H(k, i) + 1 − πk H(k, i) + H(i, k) − 1
giving the desired result.
¤
Exit frequencies are related to the starting and ending distributions by a simple formula, found
in Pitman [26]:
Lemma 2.2 The exit frequencies of any stopping rule Γ that reaches distribution τ from distribution
σ satisfy the equation
X
pij xi (Γ) − xj (Γ) = τj − σj .
i
Proof. The probability of stopping at state j is the expected number of times j is entered minus
the expected number of times j is left.
¤
One can rewrite this identity as follows.
X
X
pij xi (Γ) −
pji xj (Γ) = τj − σj .
i
i
In other words, the values pij xi (Γ) can be viewed as values of a flow through the underlying graph,
from supply σ to demand τ . This is also easily seen by observing that pij xi (Γ) is the expected
number of passes from i to j while following Γ.
Theorem 2.3 Fix two distributions σ and τ , and let Γ and Γ0 be two finite stopping rules from σ
to τ . Then x(Γ0 ) − x(Γ) = (EΓ − EΓ0 )π.
Conversely if Γ and Γ0 are two stopping rules such that when starting from σ, x(Γ) and x(Γ0 )
0
differ by a multiple of π, then σ Γ = σ Γ .
Proof. Let xi = xi (Γ), x0i = xi (Γ0 ). By Lemma 2.2,
X
pi,j xi − xj .
σjΓ = σj +
i
Γ
Γ0
In case σ = σ , we can subtract the similar identity involving x0 to get
X
x0j − xj =
pi,j (x0i − xi )
i
which
means that x0 − x is a multiple of the stationary distribution, say x0 − x = Dπ. Since
P
j xj = EΓ, the difference in mean length is D. The converse statement follows by the same
formulas.
¤
It follows from Theorem 2.3 that the exit frequencies of any mean-optimal stopping rule from σ
to τ are the same. We denote them by xi (σ, τ ), and the corresponding scaled exit frequencies by
yi (σ, τ ). We can compute these using Theorem 2.3 and the naive rule:
X
X
yk (σ, τ ) = H(σ, τ ) +
σi τj H(i, j) + H(τ, k) − H(σ, k) −
σi τj H(i, j)
i,j
i,j
(11)
= H(σ, τ ) + H(τ, k) − H(σ, k).
Thus we proved:
Theorem 2.4 The scaled exit frequencies of a mean-optimal stopping rule from σ to τ are given by
yk (σ, τ ) = H(τ, k) − H(σ, k) + H(σ, τ ).
8
2.6
*Exit discrepancy and recurrent potential
Another way of formulating Theorem 2.3 is to say that the values yi (Γ) − EΓ are the same for
every stopping rule Γ (optimal or not) from a given σ to a given τ . We call these values the exit
discrepancies and denote them by zi (σ, τ ). They will be very convenient quantities to use.
Formula (11) implies that
zk (σ, τ ) = H(τ, k) − H(σ, k),
which yields immediately a number of useful properties of z:
zk (σ, τ ) + zk (τ, ρ) = zk (σ, ρ) ,
and
zk (σ, τ ) = −zk (τ, σ).
Also note that zk (σ, τ ) is a linear function in each of σ and τ .
The matrix Rik = zi (k, π) is called the recurrent potential (from starting state k). The above
properties of exit discrepancies imply that the recurrent potential contains all the information needed
to express exit discrepancies, exit frequencies, and access times.
Lemma 2.5 For every fixed state k, the recurrent potential Rik is the (unique) function on S
satisfying the conditions
(
X
πi
if i 6= k,
pji πj Rjk − πi Rik =
π1 − 1, if i = k.
j
and
X
πi Rik = 0.
i
Furthermore, we have the formulas
X
zi (σ, τ ) =
(σk − τk )Rik ,
k
H(σ, τ ) = − min zi (σ, τ ) = max
i
i
X
(τk − σk )Rik
k
and
yi (σ, τ ) = zi (σ, τ ) + Hi (σ, τ ) =
X
X
(σk − τk )Rik − min
(σk − τk )Rjk .
j
k
k
Proof.
zi (σ, τ ) = zi (σ, π) − zi (τ, π) =
X
(σk − τk )zi (k, π) =
k
X σk − τk
k
πk
Rik .
¤
By substituting using (2.4), it is easy to verify the following identity:
zk (σ, ρ) = d(σ, ρ)zk (σ \ ρ, ρ \ σ).
(12)
P
Let k.kπ denote the `1 norm in IRn with weight function π, i.e., kz(σ, τ )kπ = i∈S πi |zi (σ, τ )|.
We denote by Z the maximum of (1/2)kz(σ, π)kπ over all distributions σ, and call it the discrepancy
of the Markov chain. Clearly, the maximum is attained when σ is a singleton. Then
³
´
kz(σ, ρ)kπ = d(σ, ρ)kz(σ\ρ, ρ\σ)kπ ≤ d(σ, ρ) kz(σ\ρ, π)kπ + kz(ρ\σ, π)kπ ≤ 4d(σ, ρ)Z. (13)
9
We obtain another interesting formula for z by considering a simple stopping rule: walk one
step! Its exit frequencies are σi , and hence
σk
zk (σ, σ 1 ) =
− 1.
πk
It follows that
t−1 ³ m
X
σ
zk (σ, σ t ) =
k
m=0
πk
´
−1 .
(14)
If the chain is ergodic, then we can let t → ∞ and obtain
zk (σ, π) =
∞ ³ m
X
σ
k
m=0
πk
´
−1 .
(it is easy to see that the series on the right is absolutely convergent). Hence we get that for any
two distributions σ and τ ,
zk (σ, τ ) =
∞ ³ m
X
σ
k
m=0
πk
τkm ´
.
πk
−
Exit discrepancies are related to access times by some simple but useful inequalities. Trivially
zi (σ, τ ) ≥ −H(σ, τ )
(15)
and hence
zi (σ, τ ) ≤ H(τ, σ).
(16)
In terms of H(σ, τ ), we only have the much weaker inequality
zi (σ, τ ) ≤ (
1
− 1)H(σ, τ ).
πi
To get a stronger inequality, note that since
max
A
X
πi zi (σ, τ ) =
X
(17)
P
i
πi zi (σ, τ ) = 0, we have
πi max{0, zi (σ, τ )} =
i
i∈A
1
kz(σ, τ )kπ .
2
Then for the sum of zi over any subset:
X
X
πi zi (σ, τ ) =
xi (σ, τ ) − πA H(σ, τ ) ≤ H(σ, τ ) − πA H(σ, τ ).
i∈A
Hence
1
kz(σ, τ )kπ ≤ H(σ, τ ).
2
2.7
(18)
i∈A
(19)
Four stopping rules
Using the notion of exit frequencies, we describe four stopping rules that, with the right choice of
their parameters, can generate any target distribution from any starting distribution. We’ll see in
the next section that these rules are optimal (again, if the parameters are chosen right). Throughout,
let σ and τ be arbitrary distribution and let x ∈ IRV be such that xi ≥ 0 and
X
pji xj − xi = τi − σi .
(20)
j
(Such numbers xi can be obtained, for example, by considering the exit frenquencies of any stopping
rule from σ to τ , and then adding to them any real multiple of π, as long as non-negativity is
preserved.)
10
2.7.1
Local rule
The following “local” (or “Markovian”) rule Λ is perhaps the easiest to describe: if we are at node
i, we stop with probability τi /(xi + τi ), and move on with probability xi /(xi + τi ) (if xi + τi = 0
the stopping probability does not need to be defined). Thus the probability of stopping under Λ
depends only on the current state, not the time; in effect, it treats “stop” like a new (absorbing)
state of the Markov chain. It is clear that each walk stops eventually with probability 1.
Theorem 2.6 The local rule generates τ , i.e., σ Λ = τ .
Proof. Let x01 , . . . , x0n be the exit frequencies of Λ, and let τ 0 be the distribution it produces, and
assume that τ 0 6= τ . Clearly τi0 = 0 if τi = 0 and x0i = 0 if xi = 0. Since any time we are at state i
with τi 6= 0, the probability of moving on is xi /τi times that of staying, we have that
x0i =
xi 0
τ.
τi i
For each state i, let

0

xi /xi , if xi 6= 0,
αi = τi0 /τi , if τi 6= 0,


0,
if xi = τi = 0.
and
α = max αi .
i
Let I = {i : αi = α} and I1 = {i : αi = α, xi > 0}.
Since τ 0 6= τ , we have α > 1. Moreover, x0i ≤ αxi and τi0 ≤ ατi for all i, and x0i = αxi and
0
τi = ατi for all i ∈ I.
Let i ∈ I, then applying Lemma 2.2 to Λ and using the conditions on the xi , we get
X
X
σi = τi0 + x0i −
pji x0j ≥ ατi + αxi −
pji αxj ≥= ασi .
j
j
This can only hold if σi = 0 and x0j = αxj for all j with pji > 0. For such a j, we have either xj = 0
or j ∈ I1 . But we cannot have xj = 0 for all these j, since then (20) implies that xi = τi = 0,
contradicting the choice i ∈ I. Thus I1 6= ∅. It also follows that if we follow Λ, then we don’t start
at I1 and never enter I1 . But this is clearly impossible.
¤
2.7.2
Filling rule
This is the discrete version of the “filling scheme,” introduced by Chacon and Ornstein [11] and
shown by Baxter and Chacon [8] to minimize expected number of steps. We call it the filling rule
Φ = Φσ,τ and define it recursively as follows. Let pti be the probability of being at state i after t
steps (and thus not having stopped at a prior step); let qit be the probability of stopping at state i in
fewer than t steps. Then if we are at state i after step t, we stop with probability min(1, (τi −qit )/pti ).
Thus, Φ stops myopically as soon as it can without overshooting the target probability of its
current state. The probability of ever stopping at state i is clearly at most τi . We will see that Φ
is a finite stopping rule and thus it does in fact achieves τ when started at σ.
Let S t denote the set of states i such that qit < τi . Observe that S1 ⊇ S2 ⊇ . . . . If S m be the
last non-empty set in this sequence (it may be that S m = S m+1 = . . . or S m+1 = ∅).
We claim that if we hit S m we stop. Indeed, suppose that we hit a state j ∈ Sm at step t and
don’t stop. This means that j will be filled after this step (else, we would stop with probability 1),
/ S t+1 . By the definition of S m , this means that S t+1 is empty.PBut then P
every state is
and so j ∈
filled, which means that be then we must have stopped with probability i qit+1 = i τi = 1, so
again we could not move on.
11
Since we hit S m with probability 1, this proves that the filling rule is finite.
Let us note that the filling rule Φ can also be described by “deadlines” gi ≥ 0. For a given state
i, we define gi as follows. Consider the largest t for which qit < τi , and let
gi = t +
τi − qit
.
pti
Since q t+1 = τi ≤ qit + pti , we have pti > 0 and gi ≤ t + 1).
In terms of the deadlines gi , we can describe the filling rule as follows: roughly speaking, we
stop at state i if we hit it before its deadline, and move on if we hit it after the deadline. Close to
the deadline we have to be a bit more careful. To be exact, we stop if we hit it after t steps and
t ≤ gi − 1; we stop with probability gi − t if gi − 1 < t ≤ gi ; and we don’t stop if gi < t.
It is not hard to see that the deadlines are uniquely determined by σ and τ except for states
where we never stop, or states where we always stop. For these, we can take gi = 0 and gi = ∞,
respectively, to make the deadlines well-defined.
2.7.3
Threshold rule
A threshold rule is the antithesis of the filling rule: it keeps moving whenever possible without
overshooting the prescribed exit frequencies xi .
To be precise, we describe the threshold rule recursively as follows. Again, let pti be the probability of being at state i after t steps (and thus not having stopped at a prior step); let qit be the
probability of stopping at state i in fewer than t steps. Let xti be the expected number of exits from
state i, again in fewer than t steps. We’ll maintain that xti ≤ xi . Then if we are at state i after step
t, we we continue with probability min{1, (xi − xti )/pti }.
It is clear that xt+1
≤ xi remains valid. Let us also observe qjt ≤ τj remains valid. Indeed, if
i
we stop at j by time t at all, then we have xt+1 = xj ; but then the expected number of times j is
entered by time t + 1 is
X
X
σj +
pij xti ≤ σj +
pij xi = τj + xj = τj + xt+1
,
j
i
i
and hence the expected number of times we stop there is at most τj .
However, Θ stops with probability 1 (since the sum of its exit frequencies is bounded) and does
achieve an ending distribution. Thus this distribution must be τ .
Another way to describe a threshold rule is dual to the “deadline” description of the filling rule.
We have a “threshold” or “release time” 0 ≤ hi ≤ ∞ associated with each state i. The stopping
rule is defined by

if k ≥ hwk
 0
1
if k ≤ hwk − 1
Γ(w0 , . . . , wk ) =

k − hwk otherwise.
Thus if we hit a state past the release time we stop; otherwise we continue, except that if we’re
within 1, we use the fractional part to randomize.
For a given state i, we can define the threshold (or release time) hi as follows. Consider the
largest t for which xti < xi , and let
hi = t +
xi − xti
.
pti
If xti < xi for all i, then hi = ∞. If t is finite, then xt+1 = xi ≤ xti + pti , and hence pti > 0 and
hi ≤ t + 1).
The threshold vector may not be uniquely determined by a threshold rule Γ (e.g. all possible
thresholds hi smaller than the time before any possible walk reaches i are equivalent), but by
convention we always associate with Γ the vector each of whose coordinates is minimal. Then h is
finite if and only if Γ is bounded.
12
While the threshold rule may appear just a variation on the filling rule, it has many nice properties. In particular, it turns out to be often bounded in time and this time bound is the least among
all bounded rules.
We now show that the threshold rule is bounded whenever τ has sufficient support.
Theorem 2.7 Suppose there is no possible closed walk which passes only through states j with
τj = 0. Then every threshold rule producing τ is bounded.
Proof. Trivially every state j with τj > 0 has finite threshold (else, we could never stop there).
Let t be an upper bound on all finite thresholds. If any walk survives past time t + n, then between
time t and t + n it must have visited states with infinite threshold and, consequently, with τj = 0
only. But during this period, the walk must have traversed a closed walk, which contradicts the
assumption of the theorem.
¤
Assume that there is a closed walk through states with τj = 0. If σj > 0 for at least one state on
such a walk, then clearly every threshold rule is unbounded. Furthermore, for any σ, we can choose
xi satisfying (20) so that the corresponding threshold rule will be unbounded; it suffices to choose
the xi large enough (by adding a large multiple of π to x) so that we reach the bad closed walk
with positive probability. But some threshold rules may be bounded even in this case; for example,
there is a threshold rule from any σ to itself is trivially bounded.
2.7.4
Chain rule
Let ρ be a probability distribution on the subsets of the state space V , and observe that it provides
a stopping rule: “choose a subset U from ρ, and walk until some state in U is hit.” The naive rule
is of course a special case, with ρ concentrated on singletons.
Theorem 2.8 For every target distribution τ , there exists a unique distribution ρ which is concentrated on a chain of subsets and gives an optimal stopping rule for generating τ .
Proof. Starting the chain from distribution σ, let αU be the distribution of the first node in U ,
for every nonempty subset U . For example, αV = σ. Let U0 = V , µ0 = τ . We define distinct
nodes i1 , . . . , in , non-negative numbers ρ1 , . . . , ρn , and non-negative vectors µ1 , . . . , µn ∈ RV , by
induction. Assume that we have defined vj , ρj and µj for j ≤ k. Let Uk+1 = S \ {v1 , . . . , vk }. Select
a node ik+1 which minimizes µk−1
/αiUk , and let
i
U
k+1
ρk+1 = µkik+1 /αik+1
and
U
µk+1
= µki − ρk+1 αi k+1 .
i
Note that ρ1 = τ1 /σ1 .
It follows by induction and by the choice of ik+1 that ρk+1 ≥ 0, µk+1 ≥ 0, and νik+1 = 0 for
i∈
/ Uk+2 . Moreover, we have
n
X
ρk = 1.
k=1
¤
We call this rule the “chain rule” and
P denote it by Ξ. A rather neat way to think of this rule is
to assign a “price”, a real value r(i) = {ρU : i ∈ U } to each state i. The rule is then implemented
by choosing a random real “budget” b uniformly from [0, 1] and walking until a state j with r(j) ≤ b
(an item that we can buy) is reached.
13
2.8
Halting states and optimal rules
Any state j for which xj = 0 is called a halting state. By definition we stop immediately if and when
any halting state is entered. (Of course we may stop in other states too, just not all the time.)
As an example, suppose we take a random walk starting at vertex 3 of the path 1—2—3—4—5,
with target distribution τ = (1/4, 0, 1/2, 0, 1/4). There is a stopping rule here that is both meanoptimal and max-optimal, namely “take two steps and quit.” In contrast the filling rule stops dead
at the starting point half the time, otherwise it walks until it reaches an endpoint (which takes 4
steps on the average). It thus achieves the same mean length 2, but unbounded maximum length.
For both rules the endpoints are halting states. Note that the naive rule Ω3,τ is worse, choosing its
stopping vertex beforehand to get mean length H(3, 1)/4 + H(3, 5)/4 = 6.
The following theorem gives an extremely useful characterization of optimality.
Theorem 2.9 A stopping rule Γ is mean-optimal for starting distribution σ and target distribution
τ if and only if it has a halting state.
Proof. Since xj cannot be negative, it is a trivial corollary of Theorem 2.3 that if some xj = 0
then Γ is mean-optimal. The converse is less obvious: we must provide, for any two distributions σ
and τ , a stopping rule that has a halting state. We can use any of the four rules described in Section
2.7 for this purpose. We need the observation that we can find xi such that (20) is satisfied and
mini xi = 0 (just subtract an appropriate multiple of π from the exit frequencies of any stopping
rule from σ to τ ). Then
1. — the local rule will halt in every state with xj = 0, by definition;
2. — the filling rule will halt in every state of the last non-empty S t , as discussed in the construction;
3. — the threshold rule will halt if xj = 0 since, as used in the construction, its exit frequencies
are bounded by the xj ;
4. — the chain rule will halt in every state of the least set in the chain of non-empty sets on
which ρ is concentrated.
¤
Corollary 2.10 Under the condition that mini xi = 0, each of the four rules described in Section
2.7 is optimal.
Let us apply Theorem 2.9 to our initial examples. Our stopping rule for Example 1.1 (Cycle)
has no halting state (except for n = 2) and is therefore not optimal; in fact we will see that
it is asymptotically six times as long as necessary for generating the uniform distribution among
non-starting states.
Our rule for Example 1.2 (Cube) must stop when the antipodal point is hit, since then all
directions must have been considered. Hence it is optimal.
For Example 1.3 (Card Shuffling) the permutations with least exit frequency are those with
the initial bottom card at the top; each of these is equally likely to be exited on the last step of
the chain, but at no other time, thus achieving exit frequency exactly 1/(n − 1)!. The stationary
distribution is uniform by symmetry, thus to produce a halting state each exit frequency must be
reduced by 1/(n − 1)! = nπi . The mean stopping time is the sum of the exit frequencies which, for
our stopping rule, is thus exactly n shuffles more than optimal.
Theorem 2.11 For all distributions σ and τ ,
¡
¢
H(σ, τ ) = max H(σ, j) − H(τ, j) .
j
This maximum is achieved precisely when j is a halting state of some optimal rule which stops at τ
when started at σ.
14
Proof. The inequality H(σ, τ ) ≥ H(σ, j) − H(τ, j) is a special case of the triangle inequality (4).
On the other hand, if j is a halting state for (σ, τ ) then it costs nothing to pass through τ on the
way from σ to j, i.e. H(σ, j) = H(σ, τ ) + H(τ, j).
¤
Of course Theorem 2.11 implies that for fixed M , σ and τ any two optimal stopping rules have
the same halting states, namely those states j which maximize H(σ, j) − H(τ, j).
Since H(σ, j) is linear in σ, we can write the formula in the theorem as
X
H(σ, τ ) = max
(σi − τi )H(i, j).
j
i
As an immediate corollary we obtain the following useful fact:
Corollary 2.12 Let α, β, γ, δ be four distributions such that α − β = c(γ − δ). Then
H(α, β) = cH(γ, δ).
Theorem 2.11 gives an explicit description of the “asymmetric metric” H(σ, τ ). Let H denote
the matrix of hitting times: Hij = H(i, j). Then H defines an affine map of the simplex with the
unit vectors as vertices (the space of state-distributions) onto some
P other simplex S. Using the
Random Target Lemma (2), we find that S lies in a hyperplane i πi xi = N . Furthermore, (5)
implies that (Hσ)i = H(σ, i) for each distribution σ.
For x, y in this hyperplane, P
we can define an asymmetric “one-sided maximum norm” by ¿
x, y À := maxi (xi − yi ). Since i πi (xi − yi ) = 0, we have ¿ x, y À > 0 for x 6= y. This “norm”
also satisfies the triangle inequality, and from Theorem 2.11 we have
H(σ, τ ) = ¿ Hσ − Hτ À .
Another way of stating Theorem 2.9 is the following:
H(σ, τ ) = max −zj (σ, τ ).
(21)
j
Next we discuss the special case of time-reversible chains. In this case we can use the cyclereversing identity (3) to obtain the following formula for the exit frequencies of the naive rule:


X
X
X
¡
¢
x̃k = πk
σi τj H(j, i) + H(k, j) − H(k, i) = πk 
σi τj H(j, i) +
(τi − σi )H(k, i) ,
i,j
i,j
i
and so, proceeding as above, the exit frequencies of a mean-optimal rule from σ to τ are
!
Ã
X
X
(τi − σi )H(j, i)
xk (σ, τ ) = πk
(τi − σi )H(k, i) − min
j
i
and
H(σ, τ ) =
X
k
xk (σ, τ ) =
i
X
X
(τi − σi )H(π, i) − min
(τi − σi )H(j, i).
j
i
i
In the case when τ = π and σ is concentrated at a single state i (which is perhaps the most common
in applications of Markov chain techniques to sampling), this formula can be simplified using the
Random Target Lemma (2) to get that any mean-optimal rule from i to π has exit frequencies
xk = πk (max H(j, i) − H(k, i))
(22)
H(i, π) = max H(j, i) − H(π, i).
(23)
j
and
j
We have thus identified the halting state of a reversible chain, in attaining the stationary distribution
from a fixed state i, as the state j from which the mean time to hit i is greatest. This seems slightly
perverse in that we are interested in getting from i to j, not the other way ’round!
Time-reversibility is also reflected by the exit discrepancies in a very natural way:
15
Lemma 2.13 The Markov chain is reversible if and only if Rik = Rki for all i, k ∈ S.
Proof. By (22) and (23), we have
Rik = H(π, k) − H(i, k).
On the other hand, (11) implies
Rki = H(π, k) − H(i, k) ,
which proves the lemma.
¤
As an application, we prove a curious fact. Call a state s pessimal if it maximizes H(s, π), i.e.,
if H(s, π) = H.
Corollary 2.14 Let i be a pessimal state of a time-reversible Markov chain and let j be a halting
state from i to π. Then j is also pessimal, and i is a halting state for H(j, π). In particular, every
time-reversible chain with at least two states has at least two pessimal states.
Proof. By Lemma 2.13,
Rsj = Rjs .
Using that xj (i, π) = 0, we can write this equation as
−H(i, π) = yi (j, π) − H(j, π).
This implies that H(i, π) ≤ H(j, π). Since i is pessimal, this implies that we must have H(j, π) =
H(i, π) and yi (j, π) = 0.
¤
Theorem 2.15 The threshold rule Θσ,τ is max-optimal.
Proof. It is perhaps already plausible to the reader that by getting all its exiting “over with” as
soon as possible, Θσ,τ achieves the least maximum number of steps.
Let Γt (i) denote the probability that a stopping rule Γ continues at time t given that the chain
has reached state i at time t (and thus has not yet been stopped). Two stopping rules Γ1 and Γ2
are said to be equivalent if Γt1 (i) = Γt2 (i) for all i and t such that both conditional probabilities are
defined. In that case σ Γ1 = σ Γ2 , EΓ1 = EΓ2 and max Γ1 = max Γ2 , so the rules behave similarly
for our purposes. It is easy to see that every stopping rule Γ is equivalent to a unique balanced rule
Γ∗ whose continuation probability depends only on the current state and the time; Γ∗ is defined
simply by Γ∗ (w0 , w1 , . . . , wt ) = Γt (wt ).
To prove the theorem, consider any balanced, bounded mean-optimal stopping rule Γ which
generates τ , and suppose there are times s < t and a state j such that Γs (j) < 1 and Γt (j) > 0. Let
p and q be the (positive) probabilities of being at state j at times s and t, respectively, under Γ,
and put ε = min{p(1 − Γs (j)), qΓt (j))}. Then increasing Γs (j) by ε/p and decreasing Γt (j) by ε/q
produces a new stopping rule θΓ with the same exit frequencies as Γ but with either θΓs (j) = 1 or
θΓt (j) = 0; and max(θΓ) ≤ max(Γ).
Finitely many repeated applications of the operator θ produce a threshold rule Θ with max(Θ) ≤
max(Γ). Since Θ has the same exit frequencies as Γ it also generates τ , concluding the proof of the
theorem.
¤
It is interesting that this result defines an ordering (with ties—technically, a “preorder”) of
the states for every σ and τ , starting with the state having the largest surplus (σi /τi maximum),
and ending with the halting state. (This ordering is in general different from the ordering by exit
frequencies, the ordering implicit in the filling rule, or the ordering by thresholds defined in the next
section.) We conclude this section by computing the exit frequencies in some specific cases.
16
Example 2.16 (Path) Consider first the classic case of a random walk on the path of length n,
with nodes labeled 0, 1, . . . , n. We begin at 0, with the object of terminating at the stationary
distribution.
The hitting times from endpoints are H(0, j) = H(n, n − j) = j 2 and the stationary distribution
is
π=(
1 1 1
1 1
, , ,..., ,
)
2n n n
n 2n
since for any random walk on a graph πi is proportional to the degree of the node i. Owing to the
special topology of the path, the filling rule and the chain rule are equivalent; moreover since we
begin at an endpoint both rules are equivalent to the naive stopping rule Ω, for which node n is a
halting state. From (22) we have
xk = πk (H(n, 0) − H(k, 0)) = πk H(n, k) = πk (n − k)2
and
H = H(0, π) =
n
X
n−1
xk =
i=0
n2
1
1 2 X1
n +
(n − k)2 =
+ .
2n
n
3
6
i=1
It follows that for a cycle of length n, n even,
H=
n2
1
+
12 6
as compared with expected time
n − 1 n(n − 1)
(n − 1)2
=
n
2
2
for staying at 0 with probability 1/n else walking until the last new vertex is hit as in Example 1.1.
Example 2.17 (Winning Streak) The following Markov process, sometimes called the “winning
streak” chain, will be quite useful later on. Let S = {0, 1, . . . , n − 1}, 0 < c < 1, and define the
transition probabilities by

c
if j = i + 1,



1 − c if j = 0 and 0 ≤ i ≤ n − 2,
pij =

1
if j = 0 and i = n − 1,



0
otherwise.
It is easy to check that the stationary distribution is
πi = ci
1−c
.
1 − cn−1
Hence the exit frequencies xi for an optimal rule from 0 to π can be determined using Lemma 2.2,
working backwards from i = n − 1, n − 2, . . . , obtaining
xi = (n − i − 1)ci
1−c
.
1 − cn−1
Summing over all states, we get
H(0, π) =
n − cn − 1 + cn
1
=n+
+ O(ncn )
n
(1 − c)(1 − c )
c−1
(if c is fixed and n → ∞).
17
2.9
*Strong stopping rules
A stopping rule Γ is said to be strong if for any time t with P(Γ = t) > 0, the conditional distribution
{σ t |Γ = t} is the same as σ Γ . In other words, the final state v Γ is independent of the number of
steps Γ.
A balanced strong stopping rule Ψ is determined completely by the numbers q t , defined as the
probability that the chain is stopped by Ψ precisely at time t. To see that this is so let q t (i) be the
conditional probability of stopping at time t given that the chain has reached time t and is in state
i; and let σit (Ψ) be the probability that the chain has survived to time t and enters state i at that
time. For any t with q t > 0, the state distribution given that the chain stops at that point is the
target distribution τ . Hence we must have q t (i)σit (Ψ) = q t τi for each state i. It follows that q t (i) is
determined for the first t for which q t > 0, and similarly for all subsequent such t.
Assuming that the chain is ergodic, there will always be strong stopping rules; in fact we can
stop the chain at any point we wish, as long as it has positive probability of being in any state i for
which τi > 0, with any probability q t up to mini (σit (Ψ)/τi ). In fact, suppose that the chain achieves
at least 12 π after k steps from any state, and let c = 12 mini (πi /τi ); then Ψ can stop the chain every
k steps with probability at least c times the probability it is still alive. Hence EΨ ≤ k/(1 − c).
If the chain is periodic, with period k, let Sj be the set of states accessible at times t ≡ j (mod
k) starting from state 1. Then for a strong stopping rule to exist it is necessary and sufficient that
for some fixed integer m and real r,
n
X
X o
τi ∈ 0, r ·
σi
i∈Sj+m
i∈Sj
for all j = 0, . . . , k − 1, with indices taken modulo k.
The “greedy” strong stopping rule stops as soon as it can and with maximum probability. This
cannot cost if the target distribution is π; for, suppose Ψ is not greedy and let t be the first time
that q t falls short of mini (σit (Ψ)/πi ). Then there is a positive portion of π within σ t which we may
think of as a new chain starting, and thus remaining, in state π. But, obviously, the new chain must
be stopped immediately with probability 1 to minimize length.
As an example, consider a random walk on the graph K3 , starting at vertex 1, and aiming for
the (uniform) stationary distribution. The greedy strong stopping rule takes two steps and stops
unless the walk has returned to vertex 1, in which case it stops with probability 1/2 and otherwise
repeats the experiment. Thus its expected length is 2/(1 − 14 ) = 8/3, which is strictly greater than
the access time H(1, π) = 2/3.
From the above analysis it is clear that for a strong stopping rule Ψ to be mean-optimal over
all rules, it must be the greedy strong stopping rule and moreover at every time t, the states i
minimizing σit (Ψ)/πi must include all of the halting states. If in the K3 example we add loops to
the vertices, so that the probability of remaining at the same vertex in a given step is some fixed
p ≥ 13 , then the two halting states remain underdogs to vertex 1, and the greedy strong stopping
rule is optimal.
Let us say that a state i minimizing σit (Ψ)τi is “hot” at time t. Notice that when a strong rule
(or no rule at all) is in effect and the target is π, the list Lt of “hot” states at time t is independent of
Ψ. The reason is that if st (Ψ) is the probability that the chain has been stopped prior to time t then
σit (Ψ) = σit − st (Ψ)πi , since the stationary distribution persists. Thus σit (Ψ)/πi = σit /πi − st (Ψ),
affecting all states equally. Putting these various facts together, we have:
Theorem 2.18 For any Markov chain, and any initial distribution σ and target distribution τ , the
greedy strong stopping rule is the only one (up to equivalence) which can be mean-optimal among
all stopping rules. If τ = π then the greedy strong stopping rule is mean-optimal if and only if Lt
contains the halting states for all t.
2.10
*Matrix formulas
We describe some useful formulas connecting the transition matrix M with the matrix H whose
entry in position (i, j) is the hitting time H(i, j). So in particular the diagonal of H is 0. We denote
by R the diagonal matrix with the return time 1/πi in the i-th position of the diagonal.
18
It is easy to see that these matrices satisfy the equation
(I − M )H = J − R.
(24)
Unfortunately, the matrix I − M is singular, and so (24) does not uniquely determine H. But the
only left eigenvector of I − M with eigenvalue 0 is π T , and the only right eigenvector with eigenvalue
0 is 1, hence I − M + 1π T is non-singular. Moreover, we have
(I − M + 1π T )(I − 1π T ) = I − M
(using that M 1 = 1 and π T M = M ). Hence
(I − M + 1π T )(I − 1π T )H = (I − M )H = J − R
and thus
H = (I − M + 1π T )−1 (J − R) + 1π T H
(25)
For convenience, let us denote the matrix (I − M + 1π T )−1 (J − R) = J − (I − M + 1π T )−1 R by G;
this matrix turns out to carry a lot of information about combinatorial properties of the random
walk. It is not difficult to see that for time-reversible walks, G is a symmetric matrix.
Equation (25) is still not the right formula for H, since H also occurs on the right hand side.
But we can determine π T H by looking at the diagonal: 0 = Gii + (π T H)i and hence
(π T H)i = −Gii .
So the negative of the i-th diagonal entry of G gives the expected time H(π, i) needed to hit state
i, starting from the uniform distribution. But then we can express H purely from the matrix G. In
fact, (25) implies that
Hij = Gij − Gjj .
Now consider the mixing time. Using Theorem 2.11, we get
H(s, π) = maxt (H(s, t) − H(π, t)) = maxt (Gst − Gtt + Gtt ) = maxs Gs,t .
Hence the mixing time is the largest entry of G.
2.11
*Comparison of stopping rules
Each of the stopping rules can be implemented in time polynomial in the number of states. In
fact, the constructions of the rules as described in Section 2.7 yields polynomial time algorithms to
compute the exit frequences, deadlines, release times, and prices, and these quantities are enough
to implement the stopping rules. (Unfortunately, this is not good enough in a typical application
of these techniques to sampling, where the size of the state space is exponential.)
The chain rule Ξ shares with the filling rule Φ the “now-or-never” property that once a state is
exited, it can never be the state at which the rule stops. In fact, once Ξ exits state ik it can never
stop at any ij for j ≤ k in the notation of the proof.
Note that the chain rule is not generally balanced; for example, we cannot stop at state j if some
state i with r(i) ≤ r(j) has already been hit.
The fact that this rule is optimal follows from Theorem 2.9 since the state in is never exited.
The uniqueness of ρ follows easily by induction on k.
It is interesting that this result defines an ordering (with ties—technically, a “preorder”) of
the states for every σ and τ , starting with the state having the largest surplus (σi /τi maximum),
and ending with the halting state. (This ordering is in general different from the ordering by exit
frequencies, the ordering implicit in the filling rule, or the ordering by thresholds defined in the next
section.)
The continuous-time versions of both Φ and Θ enjoy more elegant threshold descriptions: in the
first case we stop whenever we transfer to a state ahead of its threshold, in the second whenever a
threshold is reached while we sit in the corresponding state. In both cases the thresholds themselves
are uniquely defined.
19
2.12
*Continuous time and space: an example
Although we have not presented statements or proofs of our results either for continuous time or for
continuous state spaces, the temptation to examine our repertoire of stopping rules in the context
of Brownian motion is too difficult to resist. Let n −→ ∞ in Example 2.16, the path of length
n, while time contracts at rate n2 . Then in the limit we have Brownian motion B(t) in the unit
interval [0, 1] with reflecting barriers, and hitting time normalized to H(0, 1) = 1. Suppose that we
again wish to start at 0 and stop at the stationary (uniform) distribution.
Skorokhod [28], Dubins [15] and Root [27] have proposed different rules for stopping B(t) at a
given distribution. All three rules are optimal, running in expected time 1/3 in our case. Skorokhod’s
rule is the filling rule (or equivalently, the chain rule) applied to Brownian motion; here it amounts
merely to choosing u uniformly from [0, 1] and stopping as soon as B(t) = u.
Dubin’s is a strong stopping rule, operating in our case as follows. Define a sequence U1 , U2 , . . .
of state-sets by
n 1 + 2j
o
Ui :=
: j = 1, . . . , 2i−1 − 1 .
i
2
Put T1 := inf{t : B(t) ∈ U1 }, Ti := inf{t > Ti−1 : B(t) ∈ Ui }. When we stop at T = limi→∞ Ti ,
the bits of the binary expansion of B(T ) will be fair and independent, hence B(T ) is uniformly
distributed. (Note that there is no strong stopping rule, optimal or otherwise, for the discrete path.
To get one we would need to append loops to the nodes or to operate in continuous time.)
Root’s is a threshold rule: there is a lovely, smooth function h on the unit interval with h(1) = 0
such that B(inf{t : h(B(t)) = t}) is uniform. Unfortunately, no one seems to have an explicit
description of h.
To these three we can add a local rule. The continuous analog of exit frequency is given by
x(u) = (1 − u)2 , thus we can attain the uniform distribution by stopping B(t) between T and
T + dt, given that we have not stopped before, with probability
dt
.
1 + (1 − B(t))2
From limiting forms of our results it follows that each of these stopping rules is unique in the
appropriate context: for example, Skorokhod’s is the only optimal chain rule, Dubin’s the only
optimal strong rule, Root’s the only rule which minimizes the maximum stopping time, and ours
the only optimal rule whose infinitesimal stopping probability does not depend on time.
3
3.1
*Time reversal
Reverse chains
Given a Markov chain with transition probabilities pij , we define the reverse chain as the Markov
−
chain on the same set of states, with transition probabilities ←
p ij = πj pji /πi . We will generally use
the reverse arrow over a symbol to indicate that it refers to this reverse chain.
We start with the trivial but important remark that the reverse chain has the same stationary
distribution as the original. This follows by straightforward substitution.
The key to the proof of many properties of the reverse chain is the following general “duality
formula.”
Lemma 3.1 Let α, β, γ, δ be four distributions on the states. Then
X
X
−
(βi − αi )←
y i (γ, δ) =
(δi − γi )yi (α, β).
i
i
Proof. Let v0 , v1 , . . . be a walk in the forward chain started from α, and consider the random
variables
−
−
Yt = ←
y vt+1 (γ, δ) − ←
y vt (γ, δ) + (γvt − δvt )/πvt .
20
From the conservation equation (Lemma 2.2) for exit frequencies applied to the reverse chain, we
have
1 X←
←
−
−
− (γ, δ)
y i (γ, δ) − (γi + δi )/πi =
p ji ←
x
j
πi j
=
X pij
j
πj
←
− (γ, δ) =
x
j
X
−
pij ←
y j (δ, γ) ,
j
and therefore
E(Yt |vt = i) =
X
−
pij ←
y j (γ, δ) −
X
j
−
pij ←
y j (γ, δ) = 0
j
for each i, thus EYt = 0 a priori. In particular,
PT−1 if T is the number of steps taken by some optimal
stopping rule from α to β, then the sum t=0 Yt has zero expectation; but
T−1
X
−
−
Yt = ←
y vT (γ, δ) − ←
y v0 (γ, δ) +
t=0
T−1
Xµ
i
γi − δi
πi
¶
Xi
where Xi is the number of times state i is exited. Taking expectation, the lemma follows.
¤
We derive some corollaries of this lemma.
Corollary 3.2 For every state j, we have
←
−
H(π, j) = H(π, j).
←
−
Hence also N = N .
Proof. Choose α = γ = π and β = δ = j in Lemma 3.1.
¤
Corollary 3.3 For every Markov chain, the mixing time and reverse mixing time are the same:
←
−
H = H.
Moreover, if i is a pessimal state for the first chain and j is a halting state from i to π, then j is a
pessimal state for the reverse chain and i is a halting state from j to π in the dual chain.
Proof. Let i be a pessimal starting point for the primal chain and let j be a halting state from i
to π. Apply the lemma with α = i, γ = j, and β = δ = π. Then we get
←
−
−
H(j, π) − ←
y i (j, π) = H(i, π) − yj (i, π) = H(i, π).
Hence
←
− ←
−
H ≥ H(j, π) ≥ H(i, π) = H.
The reverse inequality follows similarly.
¤
Corollary 3.4 For every two states i and j, we have
←
−
H(i, j) = H(j, i) − H(π, i) + H(π, j).
21
A consequence worth mentioning is that if the Markov chain has a state-transitive automorphism
←
−
group, then H(i, j) = H(j, i).
Proof. Apply the lemma with α = i, β = π, γ = j and δ = i. Then we get
←
−
H(i, j) = yj (j, π) − yi (j, π).
(26)
Now Theorem 2.4 implies the assertion.
¤
←
−
Corollary 3.5 Let i be any state. A state j is a halting state from i to π if and only if H(j, i) ≥
←
−
H(u, i) for every state k. In this case,
←
−
H(j, i) = H(i, π) + H(π, i)
Proof. By (26),
←
−
H(k, i) = yi (i, π) − yk (i, π).
It follows that the left hand side is maximized when k is a halting state j from i to π. Note that
then
←
−
H(j, i) = yi (i, π).
(27)
Now apply the lemma with α = δ = i and β = γ = π. Then we get
←
−
H(π, i) = yi (i, π) − H(i, π).
Combining with Corollary 3.2 and (27), we get the second statement of the corollary.
¤
Remark. Several results about reversible Markov chains can be extended by using the notion of
reverse chain. For example, the “cycle reversing” identity of [13] can be generalized (with a virtually
identical proof) as
←
−
←
−
←
−
H(i, j) + H(j, k) + H(k, i) = H(i, k) + H(k, j) + H(j, i)
for any three states i, j and k. (From here we could get another way of deriving, among others,
Corollary 3.4 above.)
All the linear algebra formulas also relate nicely to the reverse chain. Clearly this has transition
←
−
←
−
matrix M = RM T R−1 , and simple substitution gives G = GT . We could use this fact to derive the
results of the previous section in an algebraic way.
3.2
Forget time and reset time
We begin with a surprising identity: that the forget time of a chain is equal to the reset time of its
reverse.
←
−
←
−
Theorem 3.6 For every finite Markov chain, F = I and I = F . In particular, if the Markov
chain is time-reversible, then F = I.
First, we describe a reformulation of the forget time F as the optimum value of a linear program,
and derive a few consequences. Let us rewrite the definition, using Theorem 2.11:
F = min max H(s, τ ) = min max max(H(s, j) − H(τ, j)) = min max(H(j 0 , j) − H(τ, j)) ,
τ
s
τ
s
τ
j
22
j
where j 0 is an “antipode” of j, i.e., a state for which H(j 0 , j) is maximal among all values H(k, j).
We can write this expression as a linear program:
F=
minimize
subject to
t
τi ≥ 0
P
i = 1,
iτ
P
t + i τi H(i, j) ≥ H(j 0 , j)
(i ∈ V ),
(28)
(j ∈ V ).
Let us formulate the dual program; we have a variable r for the equation and a variable ρj for each
j ∈ V , and then the dual program is:
P
F = maximize
r + j ρj H(j 0 , j)
subject to P
ρj ≥ 0
(j ∈ V ),
(29)
ρ
=
1
j
j P
r + j ρj H(i, j) ≤ 0
(i ∈ V ).
So ρ is a probability distribution; moreover, the best value of r is clearly


X
r = min −
ρj H(i, j) ,
i
(30)
j
and so
F = max min
ρ
i
X
¡
¢
ρj H(j 0 , j) − H(i, j) .
(31)
j
Any probability distribution ρ gives a lower bound on the forget time. An attractive choice is ρ = π,
in which case we can use the Random Target Lemma (2) to see that r = N and the minimum in
(30) is attained for all i. Hence
X
F≥
πj (H(j 0 , j) − H(π, j)).
(32)
j
The key result here is that equality holds in (32).
Theorem 3.7 The forget time of any Markov chain can be computed by
X
F=
πj (H(j 0 , j) − H(π, j)).
j
Moreover, the forget distribution ϕ is uniquely determined and is given by the formula
¡
¢ X
¡
¢
ϕj = πj 1 − H(j 0 , j) + H(π, j) +
pmj πm H(m0 , m) − H(π, m) .
(33)
m
Remark: Using the reverse chain and Corollary 3.5, we can write these formulas in the following
neater forms:
X ←
−
F=
πj H(j, π) ,
j
and
Ã
ϕ j = πj
!
X
←
−
←
−
←
−
p jm H(m, π) .
1 − H(j, π) +
m
In particular, Theorem 3.7 will imply Theorem 3.6.
23
(34)
Proof. It suffices to show that ρ = π and r = −N form an optimal solution of (29). For this, it
suffices to exhibit a solution (ϕ, t) of the primal such that the complementary slackness conditions
hold:
X
ϕi > 0 ⇒ −N +
πj H(i, j) = 0,
(35)
j
and
πj > 0
⇒
t+
X
ϕi H(i, j) = H(j 0 , j).
(36)
i
P
We choose t = j πj (H(j 0 , j) − H(π, j)) (recall that we want equality in (32)). To choose the right
ϕ, observe that the first set of conditions is fulfilled, on account of the Random Target Lemma (2),
independently of the choice of ϕ. Since π > 0, the second set applies for every j. This
P gives n = |V |
linear equations on ϕ (and we have one more from the original system, namely i ϕi = 1)). We
show that this (seemingly overdetermined) system has a unique solution, which is non-negative, and
hence is a solution of the linear program (28). This will prove the theorem.
Let hj = H(j 0 , j). Then (36) can be re-written as
H T ϕ = h − t1.
(37)
and we also must have
1T ϕ = 1.
(38)
We know from (25) that
H = J − (I − M + 1π T )−1 R + 1π T H ,
and hence
H T ϕ = Jϕ + R(I − M T + π1T )−1 ϕ + H T π1T ϕ = 1 − R(I − M T + π1T )−1 ϕ + H T π.
Equating the two expressions for H T µ and rearranging, we can express ϕ:
¡
ϕ = (I − M T + π1T )R−1 (1 + t)1 + H T π − h) = π + (I − M T )R−1 (H T π − h).
(39)
Sustitution shows that this µ satisfies (37) and (38). The fact that it is non-negative follows by
noticing that (39) is equivalent to equation (33), which in turn is equivalent to (34), where the
non-negativity of the right hand side is trivial:
X
←
−
←
−
←
−
H(j, π) ≤ 1 +
p jm H(m, π) ,
m
since making one step from i and then following an optimal rule to π is a (not necessarily optimal)
stopping rule from j to π.
Thus we have a feasible solution of (28), which satisfies, together with (r, π), the complementary
slackness conditions. Our argument also proves the uniqueness of the optimizing distribution, as
well as the optimality of π as a dual solution.
¤
3.3
Optimal and pessimal starting states
We conclude with some further links between the forget time and mixing-optimal and mixingpessimal states.
Lemma 3.8 The exit frequencies from the stationary distribution π to the forget distribution ϕ,
and vice versa, can be expressed by the formulas
←
−
←
−
yi (π, ϕ) = H(i, π) − min H(s, π)
s∈V
and
←
−
yi (ϕ, π) = H − H(i, π).
24
Proof. By Theorem 33, we have


X
←
−
←
−
←
−
ϕ i = πi  1 +
p ij H(j, π) − H(i, π)
j
and hence
X
←
−
←
−
pji πj H(j, π) − πi H(i, π) = ϕi − πi .
j
←
−
This equation means the numbers yi = H(i, π) are the scaled exit frequencies of some (not neces←
−
←
−
sarily optimal) stopping rule from π to ϕ, and so the numbers H(i, π) − mins H(s, π) are the exit
frequencies of an optimal stopping rule. The second equation follows similarly.
¤
The previous lemma has the following corollaries.
Corollary 3.9 A state j is a halting state from π to ϕ if and only if it is a mixing-optimal state
for the reverse chain. Moreover, for the reverse mixing time from j we have
←
−
min H(s, π) = F − H(π, ϕ).
s
Corollary 3.10 A state j is a halting state from ϕ to π if and only if it is a mixing-pessimal state
for the reverse chain. Moreover, for the mixing time we have
H = F + H(ϕ, π).
So from a pessimal point, a best stopping rule to get to π is to follow an optimum stopping rule
to the forget distribution ϕ, and then follow an optimum rule from there. It also follows that
Corollary 3.11 Every mixing-pessimal state is forget-pessimal.
We conclude with a discussion of our second introductory example, the winning streak chain.
The stopping rules to π exhibited for the reverse chain are optimal, since each has a halting state.
Hence
←
−
H = H = n − 1.
To compute the forget time, we use Theorem 3.7. It can easily be checked that H(0, i) = 2i+1 − 2
for all i and H(j, i) = 2i+1 for all i < j ≤ n − 1. It is clear that the state i0 which maximizes H(i0 , i)
is i + 1 (mod n). Using the Random Target Lemma we have that for any state i,
N =
n−1
X
πj H(i, j) =
j=0
n−1
X
πj H(0, j) = n − 1
j=0
and
n−1
X
πj H(j 0 , j) = n + 1 − 2−(n−2) ,
j=0
so that by Theorem 3.7, the forget time is
F = 2 − 2−(n−2) .
←
−
This is indeed the value for the reverse reset time I calculated in the introduction. The forget
distribution ϕ is given by

−(n−1)

, if i = 0,
1 − 2
−(n−1)
ϕi = 2
,
if i = n − 1,


0,
otherwise.
25
In fact, H(i, ϕ) = 2 − 2−(n−2) for all i (walk until you hit 0 or make n − 1 steps, whichever comes
first).
For the reverse forget time, a more tedious calculation shows that
←
−
n−k
1
F =n−k+1−
− n−1 ,
2k
2
where k is the largest integer with 2k ≤ n − k + 1 (so k ≈ log2 n; H(i, π) behaves differently for
i < k and i ≥ k). This value is strictly smaller than the mixing time H, but still asymptotically n.
3.4
4
4.1
Exit frequency matrices
Blind rules
*Exact blind rules
We now turn temporarily to a type of stopping rule which is not generally capable of achieving
arbitrary target distributions, and is almost never optimal. We call a stopping rule Γ blind if it is
balanced and Γt (i) depends only on t. The simplest blind stopping rule is the stopping rule used
most often: “stop after t steps.” Several other practical methods to generate elements from the
stationary distribution (approximately) can also be viewed as blind rules. For example, the method
(to be discussed below) of walking u steps and then choosing one of the exited points uniformly
can be viewed as a blind rule: after t steps, we stop with probability 1/(u − t). Lazy random walks
have been considered (see Lovász and Simonovits [20]) because they have better convergence to the
stationary distribution; the lazy version of a Markov chain is obtained by flipping a coin before each
move and staying where we are if we see “heads.” Stopping the lazy version of a Markov chain after
s steps is equivalent to following the original walk (w0 , w1 , . . . , w¡u ¢) for u steps and then choosing a
wt according to the binomial distribution, i.e. with probability ut 2−u . This is again equivalent to
a blind rule, where we stop after t steps with probability
µ ¶ Áµµ ¶ µ
¶
µ ¶¶
u
u
u
u
+
+ ··· +
.
t
t
t+1
u
It is often more convenient to describe a blind stopping rule by the probabilities at that state
vt is selected (not conditioning on notP
having stopped before). Trivially at ≥ 0 and the finite
termination of the rule is equivalent to t at = 1. The rule is bounded if and only if the sequence
at has a finite number of non-0 terms. Thus a blind rule can always be thought of as an averaging,
using the distribution (at ).
One cannot generate any distribution by a blind stopping rule; for example, starting from the
stationary distribution, every blind rule generates the stationary distribution itself. We shall restrict
our attention to stopping rules generating the stationary distribution (or at least approximations of
it).
Let λ1 , . . . , λq be the eigenvalues of M , λ1 = 1. From the Perron-Frobenius Theorem we know
that |λi | ≤ 1. Our next theorem gives a characterization for the existence of a blind stopping rule
for the stationary distribution.
Theorem 4.1 (a) If λk is positive real for some k ≥ 2, then there exists a state s from which no
blind stopping rule can generate π.
(b) If every λk , k ≥ 2, is either non-real, negative or zero, then there is a finite blind stopping
rule that generates π from any starting distribution.
Proof. (a) Assume that there exists a blind stopping rule, described by a0 , a1 , . . . (this sequence
may be finite or infinite), that generates π from a starting distribution σ. Let z be a complex
variable, and φ the expansion
X
at z t .
φ(z) =
t
26
P
Since at ≥ 0 and t at = 1, this series is convergent for |z| ≤ 1. Moreover, by the definition of the
stopping rule,
X
at σ T M t = π T ,
t
or
σ T φ(M ) = π T .
(40)
Let uk denote a right eigenvector of M belonging to λk ; note that π T uk = 0 for k ≥ 2. Apply
(40) to uk (where k ≥ 2) to get that
φ(λk )σ T uk = σ T φ(M )uk = π T uk = 0.
Choose σ here so that σ T uk 6= 0 for k = 2, . . . , n (such a σ exists even among those distributions
concentrated on single nodes). Then it follows that
φ(λk ) = 0
for all k = 2, . . . , n. Clearly, this cannot hold if any λk is positive.
(b) Conversely, assume that λ2 , . . . , λp are negative or zero and λp+1 , . . . , λq are non-real. For
each λk , k > p, we choose a positive integer bk such that
Re λbkk ≤ 0.
Let mk denote the multiplicity of eigenvalue λk . The polynomial
Ã
!mk Ã
!
¶mk Y
bk mk
q
p µ
Y
z bk − λbkk
z − λk
z b k − λk
φ(z) =
= a0 + a1 z + a2 z 2 + . . .
bk
1 − λk
1 − λbkk
1−λ
k=2
k=p+1
k
P
has non-negative coefficients with t at = φ(1) = 1. Considering, e.g., the Jordan normal form of
M , we can see that φ(M ) = 1π T . Thus for any starting distribution σ, φ(M T )σ = π.
¤
Interestingly, the condition formulated in the theorem is most restrictive for time-reversible
chains; then all the eigenvalues are real, and typically many of them are positive. For example, for
random walks on graphs, only complete multipartite graphs give a spectrum with just one positive
eigenvalue. More generally, it is easy to argue that for any two states i and j that are not “twins”
(i.e., pik 6= pjk or pki 6= pkj for some k) we must have pij > 0. (If the chain is symmetric and pii = 0
for all i, then the condition is equivalent to saying that that states can be represented by points in
√
some euclidean space so that the distance between i and j is pij .)
But one can show that for every chain, there is an “almost blind” stopping rule which achieves
the stationary distribution exactly: namely, a rule in which the probability of stopping after the
walk w1 , w2 , . . . , wT depends only on which pairs ws , wt are equal (walking in an unknown city
with no map, we recognize a corner where we have already been). In fact, the result of [7] and
its improvement in [22] implies the existence of such a rule whose description depends only on the
number of states in the Markov chain. We do not go into details in this paper.
It is virtually never possible to find a blind rule for generating π which is mean-optimal. If we
consider only stopping at some fixed time, even approximating π may take far more time than a
mean-optimal rule (for example when the chain is almost periodic). However, the blind approximate
stopping rule discussed in the next section will take time not much more than H(σ, π) for any starting
distribution σ. We should also remark that if the chain is time-reversible and all the eigenvalues
λ2 , . . . , λn other than λ1 are non-positive, then the blind rule described in the proof of Theorem 4.1
is optimal among blind rules. Assuming forP
simplicity that these eigenvalues are distinct, an easy
n
computation shows that its mean length is k=1 1/(1 − λk ). This value is just the same as N in
(2), which is an average hitting time, so in general it is much larger than the mixing time.
27
4.2
Averaging rules
Now we turn to the issue of finding practical rules for generating the stationary distribution which
are optimal or near-optimal. We will describe simple, easily implementable (in particular, blind)
rules and prove that they give a good approximation of the stationary distribution, in expected time
only a constant factor more than the mixing time.
As a preliminary remark, we note that making a move independently of where we are does not
hurt; more exactly, recalling that d is total variation distance and σ 1 is the state distribution after
one step of the chain, we have
Lemma 4.2
d(σ 1 , π) ≤ d(σ, π);
H(σ 1 , π) ≤ H(σ, π).
Proof. The first inequality is easily checked. To prove the second, consider the following rule from
σ to π: make one step, then follow an optimal rule from σ 1 to π. Comparing this with an optimal
rule from σ to π using Theorem 2.3, we get
¡
¢
πi H(σ, π) − H(σ 0 , π) = xi (σ, π) − xi (σ 0 , π) + πi − σi ≥ −xi (σ 0 , π)
by Lemma 2.2. Since there is a state i for which the right hand side is 0, the second inequality also
follows.
¤
The uniform averaging rule Υ = Υt (t ≥ 0) is defined as follows: choose a random integer Y
uniformly from the interval 0 ≤ Z ≤ t−1, and stop after Y steps. (To describe this as a stopping
rule: stop after the u-th step with probability 1/(t−u) (j = 0, . . . , t−1).) We shall give estimates on
how close the distribution σ Υ of the state generated by the averaging rule is to the stationary. To
this end, we derive an explicit formula for the distribution of the state produced by the averaging
rule.
Lemma 4.3 Let Y be chosen uniformly from {0, . . . , t − 1}. Then
¡
¢
1
σiY = πi 1 + zi (σ, σ t ) .
t
Proof. Let xi denote the expected number of times state i is exited during the first t steps. Note
that σiY = xi /t. Then the xi are the exit frequencies of a (σ, σ t ) rule, and hence by Theorem 2.3,
we have
yi − t = zi (σ, σ t ) ,
whence the lemma follows.
¤
Theorem 4.4 Let Y be chosen uniformly from {0, . . . , t−1}. Then for any starting distribution σ
and 0 ≤ ε ≤ 1,
1
d(σ Y , π) ≤ ε + Hε (σ, π).
t
In particular,
d(σ Y , π) ≤
1
H(σ, π).
t
Proof. Let τ be a distribution such that d(τ, π) ≤ ε and H(σ, τ ) ≤ Hε (σ, π). Then
z(σ, σ t ) = z(σ, τ ) + z(τ, τ t ) + z(τ t , σ t )
28
and hence for every U ⊆ V ,
X
(σkY − πk ) =
k∈U
1X
1X
1X
1X
πk zk (σ, σ t ) =
πk zk (σ, τ ) +
πk zk (τ, τ t ) +
πk zk (τ t , σ t ).
t
t
t
t
k∈U
k∈U
k∈U
k∈U
The first term above is at most (1 − π(U ))H(σ, τ ) and the last term is at most
π(U )H(σ t , τ t ) ≤ π(U )H(σ, τ )
by (18) and (16). For the middle term, we use the rough bound
X
t−1
XX
πk zk (τ, τ t ) =
k∈U m=0
k∈U
t−1
X
(τkm − πk ) ≤
d(τ m , π) ≤ td(τ, π) ≤ tε.
m=0
Substituting these bounds, the theorem follows. (See [23] for a simple direct proof.)
¤
An immediate consequence of this lemma is the following converse to Lemma 5.20. Choosing
t = (2/ε)Z, the uniform averaging rule yields a distribution which is within variation distance ε of
π.
4.3
*Approximate blind mixing times
We introduce the letter B to denote mixing time using blind rules, so that
B(σ, τ ) :=
min
Γ blind, σ Γ =τ
EΓ
and for distribution-sets A and B,
B(A, B) := min max B(α, β).
β∈B α∈A
Let dε (τ ) denote the ball of radius ε and center τ in the total variation metric. We introduce
the shorthand notation
Bε (A, τ ) := B(A, dε (τ ))
and similarly for H and other mixing measures. In particular,
Bε := Bε (π) = max B(σ, dε (π))
σ
and
Hε := Hε (π) = max H(σ, dε (π)) = max
σ
s
min
τ : d(τ,π)<ε
H(s, τ ).
In terminology, Hε becomes the “approximate mixing time” and Bε the “blind approximate
mixing time.” Since the averaging rule Υ is a blind rule, we have shown
Corollary 4.5 For every 0 ≤ ε ≤ 1,
Hε ≤ Bε ≤
2
Z.
ε
29
4.4
*Pointwise approximation
Theorem 4.4 asserts closeness in the total variation distance; approaching π pointwise (as noted, for
the reversible case, in [1]) turns out to be somewhat harder in general.
We indicate pointwise approximation from above and from below by a bar over or under the
subscript, as follows:
Hε := H(≤ (1 + ε)π)
and
Hε := H(≥ (1 − ε)π).
The former will be called the “dispersion time” and the latter the “filling time.”
Below we establish results about the pointwise distance of distributions obtained by various
averaging rules. However, we shall have to use the worst-case bound on the mixing time.
The following example, simplified from [1], illustrates the problem in trying to approach π
pointwise from both above and below. Suppose there are just two states a and b, with paa = pab = 12 ,
pba = ε, pbb = 1 − ε. Then πa = ε/(ε + 12 ), and so starting from a it takes about log(1/ε) steps to
decrease the probability of being in a to twice its stationary probability. On the other hand, clearly
H1/2 ≤ 1 and so H ≤ 2. Thus we cannot generally control the quotient σiY /πi by averaging over
times of order H. In fact, we’ll see in Section 5.5 that the time needed to make a two-sided approach
to π using averaging is at least the order of the maximum length of a max-optimal stopping rule
that achieves π exactly.
We can show, however, that in time O(H), the excess of σ Y over π can be decreased to an
arbitrarily small fraction of its original value.
Theorem 4.6 Assume that σ ≤ Kπ. Then
µ
¶
1
Y
σ ≤ 1 + KH π.
t
Proof. By Lemma 4.3, the triangle inequality, and Lemma 4.2,
³
´
³
´
1
1
σiY = πi 1 + zi (σ, σ t ) = πi 1 + zi (σ t , σ)
t
t
³
´
³
¢´
1
1¡
t
≤ πi 1 + H(σ , σ) ≤ πi 1 + H(σ t , π) + H(π, σ)
t
t
³
¢´
1¡
≤ πi 1 + H(σ, π) + H(π, σ) .
t
Here we know that π ≥
1
K σ,
and so by Lemma 5.6 we get
H(π, σ) ≤ (K − 1)H ,
and thus
σiY
µ
¶
1
≤ 1 + KH πi .
t
¤
Our next goal is to describe a rule giving a point that has a probability of at least (1 − ε)πi of
being at state i. We assume that the starting distribution is already close to π in the total variation
distance (this can be achieved, by the above, using the averaging rule), and do another averaging.
As before, let Y be chosen uniformly from {0, . . . , t−1}. The main result is the following.
Theorem 4.7 Let 0 ≤ δ ≤ ε ≤ 1 and assume that d(σ, π) = ε and σ ≥ (1 − δ)π. Then
µ
¶
δ
σ Y ≥ 1 − ε − H π.
t
30
Proof. Let α = σ\π and β = π\σ, so that we can write
σ = π + εα − δβ.
Clearly, the supports of α and β are disjoint, and hence σ ≥ (1 − δ)π implies that β ≤ (δ/ε)π.
Clearly
σ Y = π Y + εαY − εβ Y = π + εαY − εβ Y ≥ π − εβ Y .
Now by Theorem 4.6, we have
µ
¶
δ
βY ≤ 1 + H π ,
εt
and hence
σY ≥
¶
µ
δ
1−ε− H π
t
as claimed.
¤
It follows that in order to fill π up to a factor 1 − ε, it suffices to choose two integers Y1
and Y2 uniformly and independently from {0, . . . , H/ε}, and then do Y1 + Y2 steps; symbolically,
Hε ≤ Bε ≤ H/ε.
This result is not entirely satisfactory, however; one would like to see that the error diminishes
exponentially with t or, in other words, that the time needed is proportional to log(1/ε) rather than
to 1/ε. Next we describe another simple averaging rule that achieves this. The result below also
follows by adaptation of the “multiplicativity property” in Aldous [1].
Let M > 0, t = 8dHe, and let X be the sum of M independent random variables Y1 , . . . , YM ,
distributed uniformly in {0, . . . , t − 1}. Stop at v X .
To analyze this rule, let Xk = Y1 + . . . Yk .
Lemma 4.8 For all k ≥ 1, we have
d(σ Xk , π) ≤ 2−(k+1)
and
³
´
σ Xk ≥ 1 − 2−(k−1) π.
Proof. We prove this inequality by induction on k. For k = 1 the inequality follows by Theorem
4.4. Assume that k > 1. Then σ Xk can be obtained from σ Xk−1 by the uniform averaging rule, and
hence we get by Theorem 4.7 that
µ
¶
1
(41)
σ Xk ≥ 1 − 2−k − 2−(k−2) π > (1 − 2−(k−1) )π.
8
On the other hand, we have by the induction hypothesis and Lemma 5.6 that
H(σ Xk−1 , π) ≤ 2−(k−2) H ,
and hence by Theorem 4.4,
d(σ Xk , π) ≤
1 −(k−2)
2
= 2−(k+1) .
8
This completes the induction.
¤
Choose M = dlog(1/ε)e, and denote the resulting rule by Γε . Then we have, as an immediate
consequence of Lemma 4.8, the following theorem.
31
Theorem 4.9 For any starting distribution σ, the rule Γε produces a distribution τ satisfying
τ ≥ (1 − ε)π ,
and has mean length O(H log(1/ε)).
Lemma 4.10 Assume that for some 0 < δ < 1, for any two states i and j, pij ≤ (1 + δ)πj . Let the
starting distribution σ satisfy σ > (1 − ε)π for some 0 < ε < 1. Then
σ 1 ≤ (1 + δε)π.
Proof. Consider the starting distribution ρ = (σ − (1 − ε)π)/ε. Then clearly ρ1 ≤ (1 + δ)π. On
the other hand, ρ1 = (σ 1 − (1 − ε)π)/ε. Thus
σ 1 − (1 − ε)π
≤ (1 + δ)π ,
ε
whence the assertion follows immediately.
¤
The following two assertions follow along the same lines:
Lemma 4.11 (a) Assume that for some 0 < ε < 1, for any two states i and j, pij ≥ (1 − ε)πj . Let
the starting distribution σ satisfy σ > (1 − ε)π. Then
σ 1 ≥ (1 − ε2 )π.
(b) Assume that for some 0 ≤ δ < 1, for any two states i and j, pij ≤ (1 + δ)π. Let the starting
distribution σ satisfy σ < (1 + δ)π. Then
σ 1 ≥ (1 − δ 2 )π.
Corollary 4.12 Assume that for some 0 ≤ δ < 1, for any two states i and j, pij ≤ (1 + δ)πj . Then
for any starting distribution σ, and all k ≥ 2,
¡
¢
σ k ≤ 1 + δ(1 − δ 2 )k−2 π.
Lemma 4.13 Suppose that there exists an averaging rule Y and a 0 < δ < 1 such that
σ Y ≤ (1 + δ)π
for every starting distribution σ. Then there exists an averaging rule W such that max W ≤ (2/δ)EY
and
σ W ≤ (1 + 3δ)π
for every starting distribution σ.
Proof. Let t be the smallest integer such that P(Y > t) ≤ δ/2. Define a non-negative integer
valued random variable W by
P(W = k) = P(Y = k | Y ≤ t).
Then clearly
σiW = P(v W = i) ≤
σiY
≤ σiY
P(Y ≤ t)
µ
1+δ
1 − δ/2
¶
π.
¤
32
5
*Groups of mixing times
Let σ and τ be two distributions on S, 0 ≤ ε ≤ 1.
Recall that Hε (σ, τ, ) is the minimum mean length of any stopping rule Γ such that σiΓ ≥ (1−ε)τi
for all i, and analagously for Hε , while Hε (σ, τ ) requires d(σ Γ , τ ) ≤ ε. As usual, absence of the first
argument indicates worst-case starting distribution, so that for example
Hε (τ ) := max Hε (σ, τ ).
σ
We can define similar approximate versions of the forget time, by Fε := minτ Hε (τ ) etc. Of
course, when ε = 0 all these versions are the same:
H0 = H0 = H0 = H
and
F0 = F0 = F0 = F.
It is clear that
Hε (σ, τ ) ≤ Hε (σ, τ ) ≤ H(σ, τ )
and hence
Hε ≤ Hε ≤ H
and
Fε ≤ Fε ≤ F.
Similar inequalities hold for the disperse times.
We can also restrict the rules used in the definition of these mixing times, to get closer to
practically implementable algorithms. To the letter B, reserved for blind rules, we can add the even
more restrictive U for uniform averaging rules (Υ). The exact forms B(σ, τ ) and U(σ, τ ) are not
generally defined but the approximate forms, e.g. Uε , make sense.
From matrix algebra we easily derive:
Theorem 5.1
←
−
B ε = Bε
and
←
−
U ε = Uε .
At this point, the distinction between mixing in the “filling”, “disperse” and “total variation”
sense may seem pedantic, but in fact these three mixing measures behave quite differently.
The next lemma describes a folklore procedure to access a distribution, if we can access a positive
fraction of it.
Lemma 5.2 For every two distributions τ and σ, and every 0 ≤ ε < 1,
H(σ, τ ) ≤ Hε (σ, τ ) +
ε
Hε (τ ).
1−ε
Proof. Let σ be any starting distribution, and consider the following stopping rule: follow an
optimal stopping rule Γ rule from σ to σ Γ such that σ Γ > (1 − ε)τ . We get a random state w from
33
Γ
distribution σ Γ . We flip a biased coin and stop with probability (1 − ε)τw /σw
(by our assumption
on Γ, this is at most 1). The probability that we stop at state i is
σiΓ
(1 − ε)τi
= (1 − ε)τi ,
σiΓ
and hence the probability that we stop at all is 1 − ε.
If we do not stop, the continuation of the walk can be considered as a walk starting from the
0
0
0 Γ0
distribution σ 0 = 1ε σ Γ − 1−ε
ε τ . We follow an optimal stopping rule Γ from σ to (σ ) , such that
0
0
(σ 0 )Γ > (1 − ε)τ , to get a random state j; there we stop with probability (1 − ε)τj /(σ 0 )Γj , else follow
0
an optimal stopping rule Γ00 from σ 00 = 1ε (σ 0 )Γ − 1−ε
ε τ to get close to τ etc. The probability that
we stop (eventually) at state i is
(1 − ε)τi + ε(1 − ε)τi + ε2 (1 − ε)τi + · · · = τi .
These numbers add up to 1, hence we stop with probability 1. The expected number of steps is
EΓ + εEΓ0 + ε2 EΓ00 + · · · ≤
1
Hε (τ ).
1−ε
¤
Corollary 5.3 For any target distribution τ and 0 ≤ ε < 1,
H(τ ) ≤
1
Hε (τ ).
1−ε
Another corollary concerns exact access times, and will be useful later.
Corollary 5.4 Let τ and τ 0 be distributions and 0 < ε < 1 such that τ 0 ≥ (1 − ε)τ . Then for any
starting distribution σ,
H(σ, τ ) ≤ H(σ, τ 0 ) +
ε
H(τ 0 ).
1−ε
In particular,
H(τ ) ≤
1
H(τ 0 ).
1−ε
Another simple lemma:
Lemma 5.5 For any three distributions σ and σ and τ there exists a distribution τ such that
d(τ, τ ) ≤ d(σ, σ)
and
H(σ, τ ) ≤ H(σ, τ ).
Another way of saying this is that if d(σ, σ) = ε then for every τ ,
Hε (σ, τ ) ≤ H(σ, τ ).
Proof. Let Γ be an optimal stopping rule from σ to τ . Let v 0 be a state from distribution σ and
v 0 , a state from distribution σ, coupled so that
P(v 0 = v 0 ) ≥ 1 − ε.
34
Define a new stopping rule Γ by
(
Γ, if v 0 = v 0 ,,
Γ=
0 otherwise.
Then v Γ = v Γ with probability at least 1 − ε, and so
d(σ Γ , σ Γ ) = d(σ Γ , τ ) ≤ ε.
Since trivially EΓ ≤ EΓ, this proves the lemma.
¤
Let us state here a simple lemma analogous to Corollary 5.4 in the sense that it relates access
times if the starting distributions are close. Recall that M(τ ) is the maximum over σ of M(σ, τ ),
which in turn is the least maximum number of steps taken by any stopping rule from σ to τ .
Lemma 5.6 Assume that σ 0 ≥ (1 − ε)σ. Then for every target distribution τ ,
H(σ 0 , τ ) ≤ (1 − ε)H(σ, τ ) + εM(τ ).
In particular,
H(σ 0 , σ) ≤ εM(σ).
Proof. This follows from the decomposition
µ
¶ ¶
µ
1
1
σ−
−1 ρ .
σ = (1 − ε)ρ + ε
ε
ε
and the convexity of H(σ, τ ) as a function of σ.
5.1
¤
The relaxation group
We now consider the effect of a “warm start,” where the starting distribution σ is required to be
bounded by a constant multiple of the stationary distribution. We will indicate a warm start in our
mixing time notation by placing a tilde over the caligraphic letter, but for the value of the constant
the reader will need to refer to the precise definitions.
Let
1
Z̃ := max kz(σ, π)k.
σ≤2π 2
Let
H(σ, τ ) ,
min
H̃ε := H(≤ 2π, d| eps(π)) = max
τ
σ
σ≤2π d(τ,π)<c
and
H̃ := H̃0 .
√
T −1
√Let 1 = λ1 ≥ λ2 > · · · ≥ λn denote the eigenvalues of the matrix L = (1/4) D(M +I) D (M +
I) D, where D = diag(1/π1 , . . . , 1/πn ) is the diagonal matrix of return times. The relaxation time
is defined by
L=
1
1 − λ2
(we use L to remind us of λ).
35
We say that two random variables X and Y (with values from a set V ) are ε-independent, if for
every two sets A ⊆ V and B ⊆ V , we have
¯
¯
¯
¯
¯P(X ∈ A, Y ∈ B) − P(X ∈ A)P(Y ∈ B)¯ ≤ ε.
This is a very weak notion of independence, but in some applications of Markov chain techniques
to sampling [19], this is exactly what is needed.
Let the warm reset time Ĩε be the smallest t such that for every starting distribution σ with
σ ≤ (1 + ε)π, there exists a stopping rule Γ with EΓ ≤ t such that if v 0 is from σ then v Γ is
ε-independent of the starting state, and σ Γ ≤ (1 + ε)π.
Theorem 5.7 We have
H̃ε ≤ ln(1/ε)L
and
Ĩ16ε ≤
1
H̃ε ≤ Z̃.
ε
For two distributions α, β on V such that β > 0, we define their χ2 -distance by
χ2 (α, β) =
X (αi − βi )2
i
βi
=
X α2
i
i
βi
−1
(this is not a proper distance function since it is not symmetric, but this will not matter). Note
that trivially χ2 gives an upper bound on the total variation distance:
d(α, β) ≤
1 2
χ (α, β).
2
The χ2 -distance is particularly well suited for the study of convergence to the stationary distribution,
because of the following nice property proved by Fill [17]:
Lemma 5.8 For every starting distribution σ we have
χ2 (
σ + σ1
, π) ≤ λ2 χ2 (σ, π).
2
Corollary 5.9 Let t ≥ L. Let Y be the sum of t independent coin flips. Then
χ2 (σ Y , π) <
1 2
χ (σ, π).
e
Corollary 5.10 Let t ≥ kL. Let Y be the sum of t independent coin flips. Assume that σ ≤ 2π.
Then
χ2 (σ Y , π) <
1
.
ek
Corollary 5.11 Let Y be the sum of t independent coin flips. Assume that σ ≤ 2π. Then
d(σ Y , π) ≤ λt2 .
Corollary 5.12
H̃ε ≤ (ln ε)L.
36
Lemma 5.13 Let Y be chosen uniformly from the set {0, 1, . . . , t − 1}. Then
d(σ Y , π) ≤
1
Z̃.
t
Proof.
σiY − πi =
1
πi zi (σ, σ t ) ,
t
and so with an appropriate A ⊆ V ,
d(σ Y , π) =
X1
X
2
πi zi (σ, σ t ) ≤ Z̃.
(σiY − πi ) =
t
t
i∈A
i∈A
¤
Defining the warm uniform mixing time Ũε in a manner analagous to H̃ε but for uniform stopping
rules, we have
Corollary 5.14
Ũε ≤ εZ̃.
The next lemma bounds Hε (ρ, π) if we only know that ρ ≤ rπ for some c > 2.
Lemma 5.15 Let ρ be a distribution such that ρ ≤ rπ for some c ≥ 2. Then
Hε (ρ, π) ≤ (c − 1)H̃ε/(2c−3) .
Proof. Let
h=
ε
2c − 3
and
α=
ρ + (c − 2)π
.
c−1
Then α is a distribution with α ≤ 2π, and hence there exists a distribution τ such that d(τ, π) = h
and H(α, τ ) ≤ H̃h . Then we can write τ = π − hβ + hγ for some distributions β, γ.
Let
u=
1
,
1 + h(c − 2)
and consider the distributions ρ0 = uρ + (1 − u)β and τ 0 = uτ + (1 − u)γ. Then
ρ0 − τ 0 = u(ρ − τ ) + (1 − u)(β − γ) =
c−1
(α − τ ).
1 + h(c − 2)
Hence
H(ρ0 , τ 0 ) =
c−1
H(α, τ ) < (c − 1)H̃h .
1 + h(c − 2)
But d(ρ, ρ0 ) ≤ 1 − u < h(c − 2), and thus by Lemma 5.5, there exists a distribution τ 00 such that
d(τ 00 , τ 0 ) < h(c − 2) and H(ρ, τ 00 ) ≤ H(ρ0 τ 0 ). Now we have
d(τ 00 , π) ≤ d(τ 00 , τ 0 ) + d(τ 0 , τ ) + d(τ, π) ≤ h(c − 2) + h(c − 2) + h = (2c − 3)h = ε.
Thus
Hε (ρ, π) ≤ H(ρ, τ 00 ) ≤ H(ρ0 , τ 0 ) < (c − 1)H̃h .
¤
37
Lemma 5.16 Let 0 < ε < 1 and t ≥ (1/ε)H̃ε . Let v 0 be chosen from a starting distribution σ ≤ 2π
and let Y be chosen uniformly from {0, . . . , t − 1}. Then v 0 and v Y are (16ε)-independent.
Note that Theorem 4.4 implies that the distribution of v Y is closer than 2ε to π in total variation
distance.
Proof. The assertion is trivial if σA < 16ε, so suppose that σA ≥ 16ε. Then we have
¯
¯
¯
¯
¯
¯
¯
¯
¯P(v 0 ∈ A, v Y ∈ B) − P(v 0 ∈ A)P(v Y ∈ B)¯ = σA ¯P(wY ∈ B | w0 ∈ A) − P(wY ∈ B)¯
¯
¯
¯
¯
= σA ¯ρY (B) − σ Y (B)¯ ≤ σA d(ρY , σ Y )
where ρ is the restriction of σ to A: ρS = σS∩A /σA . Notice that ρ ≤ cπ, where c = 2/σA ≤ 1/(8ε).
By Theorem 4.4, we have
1
d(σ Y , π) ≤ 2cε + H2cε (σ, π)
t
and
1
d(ρY , π) ≤ 2cε + H2cε (σ, π).
t
Here
H2cε (σ, π) ≤ H̃2cε < H̃ε .
Furthermore, by Lemma 5.15,
H2cε (ρ, π) ≤ (c − 1)H̃2cε/(2c−3) < (c − 1)H̃ε .
Hence
1
1
c
d(ρY , σ Y ) ≤ d(ρY , π) + d(σ Y , π) ≤ 4cε + H̃ε + (c − 1)H̃ε ≤ 4cε + H̃ε ≤ 8cε ,
t
t
t
and thus
¯
¯
2
¯
¯
¯P(v 0 ∈ A, v Y ∈ B) − P(v 0 ∈ A)P(v Y ∈ B)¯ ≤ σA d(ρY , σ Y ) ≤ 8cε = 16ε.
c
¤
Corollary 5.17
Ĩ16ε ≤
5.2
1
H̃ε .
ε
The forget group
This group contains a number of interesting invariants: the forget time F; the discrepancy Z; for
any fixed ε > 0, the approximate mixing time Hε ; and both approximate forget times Fε and Fε .
To these we add a new parameter, the “set access time” defined below. These parameters are all
within constant factors of each other (depending only on ε). Note that the approximate mixing
time Hε is not in this group.
The set access time S is the mean number of steps required to hit the “toughest” set of states
(adjusted by the stationary probability of the set) from the worst start:
S :=
max
s∈V, U ⊆V
πU H(s, U ).
38
Theorem 5.18 For every finite Markov chain and 0 < ε < 1/2, we have the following inequalities.
(1 − 2ε)Z ≤ Fε ≤ Fε ≤ 2Hε/2 ≤
4
4
S ≤ Z.
ε
ε
In addition,
Z≤F ≤
1
Fε .
1−ε
Corollary 5.19
S ≤ Z ≤ F ≤ 16S.
We break down the proof into several steps, one for each inequality claimed. The first step is
(up to the constant) a stronger version of (19):
Lemma 5.20 Let 0 < ε < 1/2, then
Z≤
1
Fε .
1 − 2ε
Proof. Let ρ be any distribution and let σ be a distribution maximizing kz(σ, π)k. By definition,
there exists a distribution µ such that d(µ, ρ) ≤ ε and H(σ, µ) ≤ Hε (ρ). Then for an appropriate
set A ⊆ V , we have
X
X
X
Z=
πi zi (σ, π) =
πi zi (σ, µ) +
πi zi (µ, ρ).
i∈A
i∈A
i∈A
Here by (19),
X
X
πi zi (σ, µ) =
πi zi (µ, σ) ≤ (1 − πA )H(σ, µ) ≤ (1 − πA )Hε (ρ).
i∈A
i∈V \A
To estimate the other term, we have by (13)
X
πi zi (µ, ρ) ≤
i∈A
1
kz(µ, ρ)kπ ≤ 2d(µ, ρ)Z ≤ 2εZ.
2
Hence
Z ≤ Hε (ρ) + 2εZ ,
and the lemma follows.
¤
The inequality
Fε ≤ F ε
is trivial.
Lemma 5.21 For ε < 1/2,
F2ε ≤ 2Hε .
Proof. For each state k, let τ (k) be a distribution such that d(τ (k), π) ≤ ε and H(k, τ (k)) ≤ Hε .
Let
X
fi = min
τ (k)i µk ,
µ
k
39
where µ ranges over all distributions with d(µ, π) ≤ ε. Let µ(i) be the distribution achieving
this minimum. Write µ(i)k = πk + α(i)k and τ (k)i = πi + β(k)i . Let a(i)k = min(0, α(i)k ) and
b(k)i = min(0, β(k)i ). Then
τ (k)i µ(i)k ≥ (πi + a(i)k )(πk + b(k)i ) ≥ πi πk + πi a(i)k + πk b(k)i
and hence
X X
X
X
X X
a(i)k +
πk
b(k)i .
fi =
τ (k)i µ(i)k ≥ 1 +
πi
i
i
i,k
k
k
i
Now here, for any fixed i,
X
a(i)k = −d(π, µ(i)) ≥ −ε ,
k
and similarly
X
b(k)i = −d(π, τ (k)) ≥ −ε.
i
Hence
X
fi ≥ 1 −
i
X
πi ε −
X
i
πk ε = 1 − 2ε.
k
Consider the distribution
,
X
ρi = f i
fj ,
j
and the following stopping rule: starting at state k, follow an optimal rule from k to τ (k); if you
end at j, then follow an optimal rule from j to τ (j). Let θ(k) be the distribution produced. Then
X
θ(k)i =
τ (k)j τ (j)i ≥ fi ≥ (1 − 2ε)ρi .
j
Thus H2ε (ρ) ≤ 2Hε .
¤
Lemma 5.22 For every 0 < ε < 1,
εHε ≤ S.
Proof. We want to prove that for any starting state s,
Hε (s, π) ≤
1
S.
ε
Consider the chain rule from s to π. There exists a labelling {0, . . . , n − 1} of the nodes and a
probability distribution ρ such that selecting Sk = {k, k + 1, . . . , n − 1} with probability ρk and then
walking until Sk is reached, we generate π. Choose k such that πSk+1 ≤ ε < πSk , and consider the
modified chain rule Ak that selects Sj with probability ρj for j = 1, . . . , k − 1, but selects Sk with
the remaining probability ρ0k = 1 − ρ1 − · · · − ρk−1 . Let this rule generate distribution τ (k). Then
τ (k)i = πi for i ∈
/ Sk , and τ (k)k ≥ πk . Hence
d(π, τ (k)) ≤ πSk+1 ≤ ε.
The mean length of this rule is
k−1
X
j=1
ρj H(s, Sj ) + ρ0k H(s, Sk ) ≤ H(s, Sk ) ≤
1
1
S < S.
π Sk
ε
¤
40
Lemma 5.23
S ≤ Z.
Proof. Let σ be any distribution and let A ⊆ V . Let ρ be the distribution of the first element
in A, starting the chain from σ. Note that H(σ, ρ) = H(σ, A) is just the expected number of steps
before hitting A, starting from σ. Also note that trivially xi (σ, ρ) = 0 for i ∈ A, and hence
H(σ, ρ) = −zi (σ, ρ) = zi (ρ, σ).
Hence
πA H(σ, A) =
X
pii zi (ρ, σ) ≤ Z.
i∈A
¤
To complete the proof of Theorem 5.18, it suffices to notice that letting ε → 0 in Lemma 5.20
we get
Z≤F ,
while Corollary 5.3 implies that
F ≤ H(ρ) ≤
5.3
1
1
Hε (ρ) =
Fε .
1−ε
1−ε
The reset group
Let
X
←
−
←
−
z i (π, v)
Z = max
v∈V
S⊆V
i∈S
be the discrepancy of the reverse chain.
Theorem 5.24
−
1
1←
I ≤ Z ≤ H̃ ≤ 2I.
32
2
Proof. The first inequality is immediate by Corollary 5.19 and Theorem 3.6. To prove the second,
←
−
let u, v ∈ V and S ⊂ V attain the maximum in the definition of Z . Let A = {i ∈ V : H(i, v) ≥
H(π, v)}.
Case 1. Assume that πA ≥ 1/2. Define
(
πi /πA , if i ∈ A,
ρi =
0
otherwise.
Then ρ is a distribution with ρ ≤ 2π. Moreover,
X πi ¡
¢
H(ρ, π) ≥ H(ρ, v) − H(π, v) =
H(i, v) − H(π, v)
πA
i∈A
1 X
1
Z > Z.
=
πi zi (v, π) =
πA
πA
i∈A
Case 2. Assume that πA < 1/2. Set t = (1 − 2πA )/(1 − πA ) and define
(
2πi , if i ∈ A,
ρi =
tπi , otherwise.
41
Then again ρ is a distribution with ρ ≤ 2π. Moreover,
H(ρ, π) ≥ H(ρ, v) − H(π, v)
X
¡
¢ X
¡
¢
=
2πi H(i, v) − H(π, v) +
tπi H(i, v) − H(π, v)
i∈A
= (2 − t)
X
i∈V \A
¡
πi H(i, v) − H(π, v)
¢
i∈A
X
1
1
πi zi (v, π) =
Z > Z.
=
1 − πA
1 − πA
i∈A
This proves the second inequality.
Finally, the third inequality is easy. Consider the following rule from σ to π: look at the starting
state and follow an optimal rule from that state. Hence
X
X
H(σ, π) ≤
σi H(i, π) ≤ 2
πi H(i, π) = I.
i
i
¤
5.4
The mixing group
Theorem 5.25
Hε ≤ H ≤
1
Hε .
1−ε
Proof. The first inequality is trivial; the second follows immediately from corollary 5.3.
¤
Theorem 5.26
³
´
H≤2 S +I .
Proof. Let S := {j : H(j, π) ≤ 2I}. Then
X
X
I=
πi H(i, π) ≥
πi H(i, π) ≥ π(V \ S)(2I),
i
i∈V \S
and hence π(S) ≥ 1/2. Now for any starting state s, we can walk from s until we hit some state
j ∈ S, and then follow an optimal rule from j to π. The expected length of this walk is at most
H(s, S) + 2I ≤
1
S + 2I ≤ 2S + 2I.
π(S)
¤
Using the inequality S ≤ F (see corollary 5.19), we get
Corollary 5.27
³
´
max{F, I} ≤ H ≤ 2 F + I .
42
5.5
The maxing group
We return once more to M(σ, τ ), the maximum length of a max-optimal stopping rule from σ to τ .
Theorem 5.28 Suppose that for any initial distribution σ,
4π/5 ≤ σiY ≤ 5π/4.
where Y is chosen uniformly from {0, . . . , t−1}. Then for any σ there is a (blind) stopping rule Γ
with σ Γ = π and max(Γ) < 2t.
Proof.
Replacing the transition matrix M by
while the condition of the theorem becomes
Pt−1
j=0
M j preserves the stationary distribution,
4π/5 ≤ σ 1 ≤ 5π/4.
for all σ. It now suffices to produce a stopping rule Γ with σ Γ = π and max(Γ) ≤ 2; to do this we
bound the exit frequencies xi (θ, π) when θ is close to π.
Note that for our new fast-mixing Markov chain, the conditions of Corollary *** obtain with
t = 1 and ε = 15 , so for any σ, H(σ, π) ≤ M = H ≤ 54 . Suppose that 4π/5 ≤ θ ≤ 5π/4; then we
may substitute θ for τ and π for τ 0 in Corollary 5.4 to get
M(θ) ≤
5
25
M≤
.
4
16
Now we can apply Lemma 5.6 twice to get
H(θ, π) ≤
1
1
M≤
5
4
H(π, θ) ≤
1
5
M(θ) ≤
.
5
16
and
It follows from (16) that for each state i,
yi (θ, π) = zi (θ, π) + H(θ, π) ≤ H(π, θ) + H(θ, π) ≤
9
.
16
Let Θ be the unique mean-optimal threshold rule for initial distribution θ and target π, constructed as in Section 2.7.3. From above we have that for every i,
θi ≥
4
9
πi >
πi ≥ xi (θ, π)
5
16
so that the exit frequency for every state is used up already on the first step of the construction. It
follows that all of the thresholds are less than 1 and therefore max(Θ) ≤ 1.
We may now define Γ for arbitrary starting state σ simply by taking one step, then setting
θ = σ 1 and implementing Θ.
¤
Lemma 5.29 Let Φ : σ → τ and Ψ : σ → ρ be two stopping rules such that Φ ≤ Ψ for any walk.
Then
H(τ, ρ) ≤ EΨ − EΦ.
Proof. Define a stopping rule from τ to ρ as follows: the starting state from distribution τ may
as well be generated by starting from σ and following Φ. But then we can just consider our walk
as a continuation and stop it following the rule Ψ. This way we get a rule from τ to ρ with mean
length EΨ − EΦ.
¤
43
Lemma 5.30 Assume that t = M(σ, τ ) is finite. Then
H(σ, τ ) + H(τ, σ t ) ≤ t.
Proof. Since H(σ, τ ) is the mean time of the threshold rule Θσ,τ , t is the mean time of the rule
“walk t steps”, and Θσ,τ ≤ t, Lemma 5.29 implies that H(τ, σ t ) ≤ t − H(τ, σ t ).
¤
Theorem 5.31 Let t = M(σ, π). Choose a random starting state from σ. Choose a random integer
Y uniformly from the interval [t, 2t − 1], and walk for Y steps. Then the probability of being at state
i is at most 2πi .
Proof. The probability that we stop at state i is
2t−1
1 X k
1
1
1
σi = πi + πi zi (σ t , σ 2t ) = πi − πi zi (π, σ t ) − πi zi (σ 2t , π).
t
t
t
t
k=t
Here we use that
zi (π, σ t ) = yi (π, σ t ) − H(π, σ t ) ≥ −H(π, σ t ) ≥ H(σ, π) − t
(by Lemma 5.30), and
zi (σ 2t , π) = yi (σ 2t , π) − H(σ 2t , π) ≥ −H(σ 2t , π) ≥ −H(σ, π)
(by Lemma 4.2). Hence we get that
2t−1
1
1 X k
1
σi ≤ πi − πi (H(σ, π) − t) + πi H(σ, π) = 2πi .
t
t
t
k=t
¤
5.6
A reverse inequality
For any state distribution µ, let µ̂ := mini {µi }. The log of the smallest probability in the stationary
distribution links the maxing and relaxation times, and also the mixing and set access times.
Theorem 5.32
M ≤ ln(1/π̂)L.
Proof. ***
¤
Theorem 5.33
H ≤ ln(1/π̂)S.
We prove the inequality in a more general form, fixing the starting distribution σ, and allowing
an arbitrary target distribution. We define
S(σ, τ ) = max τA H(σ, A).
A⊆V
Then we have the following lemma.
Lemma 5.34 Let σ, τ be two arbitrary distributions; then
H(σ, τ ) ≤ ln(1/τ̂ )S(σ, τ ).
44
Proof. We use the chain rule, more exactly Theorem 2.8. This implies that for every starting
state s there exists a labelling {0, . . . , n − 1} of the nodes and a probability distribution ρ such that
H(σ, τ ) =
X
ρk H(σ, Sk ),
(42)
k
where Sk = {k, k + 1, . . . , n − 1} and H(σ, S) denotes the expected number of steps before a walk
sarting from σPhits the set S. (One could formulate a similar expression for the exit frequencies.)
Setting ωk = m≥k ρm , (42) can be rewritten as
H(σ, τ ) =
X
ωk (H(σ, Sk ) − H(σ, Sk−1 )).
(43)
k
Now we use that
ωk ≤ τSk
(44)
(if we choose any Sm , m ≥ k in the chain rule as our target set, we certainly end up in Sk ), and
hence
X
X
X
τk
τSk (H(σ, Sk ) − H(σ, Sk−1 ) =
τk H(σ, Sk ) ≤ S(σ, τ )
H(σ, τ ) ≤
.
τk + · · · + τn
k
k
k
Using the inequality x ≤ ln 1/(1 − x), we get
H(σ, τ ) ≤ S(σ, τ )
X
k
ln
τk+1 + · · · + τn
= ln(1/τn )Tset (σ, τ ).
τk + · · · + τn
¤
Using a similar argument, one could replace S by maxA⊆S πA H(π, A) in this theorem. This
latter quantity seems to be related to the “conductance” of the chain ([18]), but we have not been
able to prove an explicit connection.
The winning streak chain shows that the logarithmic factor in the upper bound in the theorem
cannot be saved even if the target distribution is the stationary distribution π.
6
6.1
Estimating the mixing time
*A linear programming bound
Theorem 6.1 Assume that for some T ≥ 0 and for every u ∈ V , there exists a vector r ∈ IRV such
that
r ≥ 0;
X
pij rj − ri ≤ 1
(45)
(for all i 6= u);
(46)
j
X
πi ri ≤ T.
(47)
i
Then H ≤ T .
45
Proof. The proof is a variation on the proof of Lemma 3.1. Let s be a pessimal starting state and
u, a halting state from s to π. Let v 0 = s, v 1 , . . . be a walk in the chain, and consider the random
variables
Xt = rvt+1 − rvt + 1 +
1
(i = u) ,
πu
where the ri satisfy (45)-(47) for this choice of u. From (46) we get, for all i 6= u, that
X
E(Xt |v t = i) =
pij rj − ri − 1 ≥ 0.
j
hence if T is the number of steps taken by some optimal stopping
Thus Xt is a supermartingale, and
PT−1
rule from u to π, then the sum t=0 Xt has non-negative expectation. But (using that u is never
exited during the first T steps), this implies
0≤
T−1
X
EXt = E(rvT ) − ru − ET =
t=0
X
πi ri − ru − H(u, π) ≤ T − H.
i
¤
6.2
Conductance
Recall that the conductance of a Markov chain is defined by
P
P
i∈A
j∈V \A pij πi
φ = min
,
A
πA πV \A
where the minimum is extended over all non-empty proper subsets A of V . Standard results,
first obtained by Jerrum and Sinclair [18], and extended in various directions ([25], [20]) use the
conductance φ to bound the mixing time from above. Let us start from distribution σ. Define
K = max
i∈V
σi
.
πi
Choose Y from the binomial distribution with mean t (in other words, do a lazy random walk for
2t steps); then Corollary 1.5 in [20] says that
µ
¶2t
√
φ2
d(σ Y , π) ≤ K 1 −
.
8
Hence
Hε ≤
4
K
log .
2
φ
ε
(48)
This implies for time-reversible chains, by Lemma 5.20, that
H≤
2
64
log .
φ2
π̂
Below we prove that this relation (with a better constant even) holds for all chains. We in fact
prove a more general result.
Define the conductance function by
Φ(x) =
min
S⊂V,
0<π(S)≤x
Q(S, S)
.
π(S)π(S)
P
(here Q(S, T ) = i∈S,j∈T Qij . Obviously, 0 < Φ(x) ≤ 2, and Φ(x) is monotone decreasing as a
function of x. For x ≥ 1/2, we have Φ(x) = Φ.
Our main theorem about mixing in general Markov chains is the following.
46
Theorem 6.2 The mixing time H of any Markov chain can be bounded by
Z 1
dx
H ≤ 30
2
π̂ xΦ(x)
Corollary 6.3
H(σ, π) ≤
30
1
log .
φ2
π̂
Lemma 6.4 Let 0 = y1 ≤ y2 ≤ · · · ≤ yn be the scaled exit frequencies of an optimal stopping rule
from s to π. Let 1 ≤ k < m ≤ n, and set A = {1, . . . , k}, and C = {m, . . . , n}. Then
ym − yk ≤
π(A)
Q(C, A)
Proof. It is easy to see that s = n. Let B = {k + 1, . . . , m − 1}. We start with the identity
XX
XX
yj Qji −
yi Qij = π(A),
(49)
i≤k j>k
i≤k j>k
since the left hand side counts the expected number of steps from V \ A to A, less the expected
number of steps from A to V \ A, when following an optimal stopping rule from s to π. Since we
never start in A but stop in A with probability π(A), this proves (49).
Now we estimate the left hand side of (49) as follows:
XX
X
X
yj Qji ≥
ym Qji +
yk Qji = ym Q(C, A) + yk Q(B, A),
i≤k j>k
i≤k
j≥m
i≤k
k<j<m
and
XX
yi Qij ≤
XX
yk Qij = yk Q(A, B ∪ C) = yk Q(B ∪ C, A).
i≤k j>k
i≤k j>k
Substituting in (49), we get
ym Q(C, A) + yk Q(B, A) − yk Q(B ∪ C, A) = (ym − yk )Q(C, A) ≤ π(A),
which proves the lemma.
¤
Our other preliminary observation is that
X
X
H(s, π) =
πi y i =
(yj+1 − yj )π(> j).
i
(50)
j
Proof. [of Theorem 6.2] Let 1 = m0 < m1 < · · · < mk < mk+1 . Set Ti = [1..mi ], T i = V \ Ti , and
ai = π(Ti ). We choose the sequence (mi ) recursively so that
1
ai+1 − πmi+1 < ai (1 + Φ(ai )) ≤ ai+1 .
4
We stop with
ak ≤ 1/2 < ak+1 .
We bound a portion of the sum in (50) as follows:
mi+1 −1
X
(yj+1 − yj )π(> j) ≤ (1 − ai )(ymi+1 − ymi )
j=mi
47
≤
ai (1 − ai )
.
Q(T i+1 ∪ mi+1 , Ti )
Now here
Q(T i+1 ∪ mi+1 , Ti ) = Q(T i , Ti ) − Q(T i \ T i+1 \ mi+1 , Ti ) ≥ Q(T i , Ti ) − π(T i \ T i+1 \ mi+1 )
1
≥ Φ(ai )ai (1 − ai ) − ai+1 + πmi+1 + ai > Φ(ai )ai (1 − ai ).
2
Hence we get
mi+1 −1
X
2
.
Φ(ai )
(yj+1 − yj )π(> j) ≤
j=mi
Next we show that
Z ai+1
2
dx
≤ 10
.
2
Φ(ai )
xΦ(x)
ai
Indeed,
Z
ai+1
ai
1
dx
≥
xΦ(x)2
Φ(ai )2
Z
ai+1
ai
µ
¶
1
ai+1
1
1
1
dx
=
ln
≥
ln
1
+
Φ(a
)
≥
i
x
Φ(ai )2
ai
Φ(ai )2
4
5Φ(ai )
(using that Φ(ai ) ≤ 2). Summing up, we get
Z
mk+1
X
ak+1
(yj+1 − yj )π(> j) ≤ 10
π̂
j=1
dx
< 10
xΦ(x)2
Z
1
π̂
dx
xΦ(x)2
(51)
The estimate on the other half of the sum (50) is similar. We define a sequence n0 = n > n1 >
· · · > nr , and set Si = [ni ...n], Si = V \ Si and bi = π(Si ). We choose the numbers ni recursively
so that
1
bi+1 − πni+1 < bi (1 + Φ(bi )) ≤ bi+1 .
4
We stop with
br ≤ 1/2 < br+1 .
Similarly as before, we consider the partial sum
nX
i −1
(yj+1 − yj )π(> j) ≤ (bi+1 − pni+1 )(yni − yni+1 ),
j=ni+1
and we use again lemma 6.4 to estimate it:
≤
(bi+1 − pni+1 )(1 − bi )
Q(S i+1 ∪ ni+1 , Si )
Here
Q(S i+1 ∪ ni+1 , Si ) = Q(S i , Si ) − Q(S i \ S i+1 \ ni+1 , Si ) ≥ Q(S i , Si ) − π(S i+1 \ S i − πni+1 )
1
≥ Φ(bi )bi (1 − bi ) − bi+1 + πni+1 + bi > Φ(bi )bi (1 − bi ).
2
Hence
nX
i −1
j=ni+1
(yj+1 − yj )π(> j) ≤
2(bi+1 − πni+1 ) 1
bi
Φ(bi )
48
As before,
2
≤ 10
Φ(bi )
Z
bi+1
bi
dx
,
xΦ(x)2
Moreover,
1
bi+1 − πni+1 < bi (1 + Φ(bi )) < 2bi .
4
Summing up, we get
n−1
X
Z
br+1
(yj+1 − yj )π(> j) ≤ 20
πn
j=nr+1
dx
< 20
xΦ(x)2
Z
1
π̂
dx
.
xΦ(x)2
(52)
Since obviously nr+1 < mk+1 , all terms in (50) have been accounted for, and it follows that
Z
1
H ≤ 30
π̂
dx
,
xΦ(x)2
which proves the theorem.
6.3
¤
*Coupling
A coupling in a Markov chain is a sequence of random variables ((v 0 , w0 ), (v 1 , w1 ), . . . ) such that
each of the sequences (v 0 , v 1 , . . . ) and (w0 , w1 , . . . ) is a walk of the chain. For a given coupling, the
collision time is the expectation of the first index t with v t = wt . The coupling time C(α, β) between
two distributions α and β is the minimum collision time of couplings where v 0 is from distribution
α and w0 is from distribution β.
We shall only be concerned with Markovian coupling, which is defined as a Markov chain on
V
×
P probabilities pij,kl such that for all (i, j) ∈ V × V , and every l ∈ V , we have
P V with tranisition
p
=
p
and
ij,kl
jl
l pij,kl = pik . Let D = {(i, i) : i ∈ V } be the diagonal. The collision time for
k
a given Markovian coupling is the maximum hitting time to D from any starting state (i, j).
Lemma 6.5 For every two distributions,
kz(α, β)kπ ≤ C(α, β).
Proof. Consider an optimal coupling for α and β, and let τ be the distribution of the state where
the two walks first collide. The coupling rule defines a stopping rule Γ from α to τ (by constructing
the chain (w0 , w1 , . . . ) only in the background, to provide a stopping criterion), and similarly, it
defines a stopping rule Γ0 from β to τ . Trivially, EΓ = EΓ0 = C(α, β).
Thus we have
¡
¢ ¡
¢
zi (α, β) = zi (α, τ ) − zi (β, τ ) = yi (Γ) − EΓ − yi (Γ0 ) − EΓ0 = yi (Γ) − yi (Γ0 ) ≤ yi (Γ) ,
and hence
kz(α, β)kπ ≤
X
yi (Γ) = EΓ = C(α, β).
i
¤
49
References
[1] D.J. Aldous and J. Fill, Reversible Markov Chains and Random Walks on Graphs (book), to
appear. URL for draft at http://www.stat.Berkeley.EDU/users/aldous/book.html.
[2] D.J. Aldous, Some inequalities for reversible Markov chains, J. London Math. Soc. 25 (1982),
564–576.
[3] D.J. Aldous, Applications of random walks on graphs, preprint (1989).
[4] D.J. Aldous, The random walk construction for spanning trees and uniform labelled trees,
SIAM J. Discrete Math. 3 (1990), 450–465.
[5] D.J. Aldous and P. Diaconis, Shuffling cards and stopping times, Amer. Math. Monthly 93 #5
(1986), 333–348.
[6] D.J. Aldous, L. Lovász and P. Winkler, Mixing times for uniformly ergodic Markov chains,
Stochastic Processes and their Applications, to appear.
[7] S. Asmussen, P.W. Glynn and H. Thorisson, Stationary detection in the initial transient problem, ACM Transactions on Modeling and Computer Simulation 2 (1992), 130–157.
[8] J.R. Baxter and R.V. Chacon, Stopping times for recurrent Markov processes, Illinois J. Math.
20 (1976), 467–475.
[9] D. Bayer and P. Diaconis, Trailing the dovetail shuffle to its lair, Ann. Appl. Probab. 2 (1992),
294–313.
[10] A. Broder, Generating random spanning trees, Proc. 30th Annual Symp. on Found. of Computer
Science, IEEE Computer Soc. (1989), 442–447.
[11] R.V. Chacon and D.S. Ornstein, A general ergodic theorem, Illinois J. Math. 4 (1960), 153–160.
[12] A.K. Chandra, P. Raghavan, W.L. Ruzzo, R. Smolensky, and P. Tiwari, The Electrical Resistance of a Graph Captures its Commute and Cover Times, Proceedings of the 21st Annual
ACM Symposium on Theory of Computing, May (1989).
[13] D. Coppersmith, P. Tetali, and P. Winkler, Collisions among Random Walks on a Graph, SIAM
J. on Discrete Mathematics 6 #3 (1993), 363–374.
[14] P.G. Doyle and J.L. Snell, Random Walks and Electric Networks, Mathematical Assoc. of
America, Washington, DC 1984.
[15] L.E. Dubins, On a theorem of Skorokhod, Ann. Math. Statist. 39 (1968), 2094–2097.
[16] M. Dyer, A. Frieze and R. Kannan, A random polynomial time algorithm for estimating volumes
of convex bodies, Proc. 21st Annual ACM Symposium on the Theory of Computing (1989), 375–
381.
[17] J. Fill, ***
[18] M. Jerrum and A. Sinclair, Conductance and the rapid mixing property for Markov chains: the
approximation of the permanent resolved, Proc. 20nd Annual ACM Symposium on Theory of
Computing (1988), 235–243.
[19] R. Kannan, L. Lovász and M. Simonovits, ***
[20] L. Lovász and M. Simonovits, Random walks in a convex body and an improved volume
algorithm, Random Structures and Alg. 4 (1993), 359–412.
[21] L. Lovász and P. Winkler, A note on the last new vertex visited by a random walk, J. Graph
Theory 17 (1993), 593–596.
50
[22] L. Lovász and P. Winkler, Exact mixing in an unknown Markov chain, Electronic J. Comb. 2,
(1995), Paper R15.
[23] L. Lovász and P. Winkler, Mixing of random walks and other diffusions on a graph, Surveys in
Combinatorics, 1995, P. Rowlinson, ed., London Math. Soc. Lecture Note Series 218, Cambridge
U. Press (1995), 119–154.
[24] L. Lovász and P. Winkler, Reversal of Markov chains and the forget time, Combinatorics,
Probability and Computing, to appear.
[25] M. Mihail, ***
[26] J.W. Pitman, Occupation measures for Markov chains, Adv. Appl. Prob. 9 (1977), 69–86.
[27] D.H. Root, The existence of certain stopping times on Brownian motion, Ann. Math. Statist.
40 (1969), 715–718.
[28] A. Skorokhod, Studies in the Theory of Random Processes, orig. pub. Addison-Wesley (1965),
2nd ed. Dover, New York 1982.
[29] P. Tetali, Random walks and effective resistance of networks, J. Theoretical Prob. #1 (1991),
101–109.
51

Download Report

Fast Mixing in a Markov Chain

Paperzz.com

Your Paperzz