Markov Processes : Estimation and Sampling in Undersampled Regime
A DISSERTATION SUBMITTED TO THE GRADUATE DIVISION OF THE
UNIVERSITY OF HAWAI‘I AT MĀNOA IN PARTIAL FULFILLMENT OF
THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
IN
ELECTRICAL ENGINEERING
August 2015
By
Ramezan Paravitorghabeh
Dissertation Committee :
Narayana Prasad Santhanam, Chairperson
Anders Høst-Madsen
Aleksander Kavčić
Anthony Kuh
Yuriy Mileyko
©Copyrigth 2015
by
Ramezan Paravitorghabeh
To my parents
Acknowledgements
First and foremost, I would like to express my special appreciation and thanks to my adviser,
Prof. Santhanam for his support during these past five years. Prasad has supported me not
only by providing logistical support over the span of almost five years, but also academically
and emotionally through my Ph.D. journey. I am grateful for his insightful comments and
constructive criticisms at different stages of my research.
I would like to give special thank to my committee members Prof. Høst-Madsen, Prof. Kavčić,
Prof. Kuh and Prof. Mileyko for their precious comments and patience. Special “Shaka” to
Prof. Kuh and Prof. Kavčić for providing advice many times during my graduate school
career. Special thanks to Dr. Rui Zhang for providing me with precious advises/comments
after my graduate seminar presentation. I am also grateful to all the faculty and staff members
of the Electrical Engineering department, University of Hawai‘i at Mānoa. I will be always
thankful for the opportunity that I was given in this department to learn and to grow.
Furthermore, I would like to thank my fellow labmates of many years : Maryam and Meysam.
We had great time together in the office in every aspect– from boring research discussions
to fun chillaxing tea breaks and keeping me stay sane ! You guys are GREAT. Also, my
good friends Masoud, Saeed, Navid, Elyas, Reza, Harir Che, Ashkan, Ali and Ehsaneh have
accompanied me through this journey in many ways. Mahalo nui loa !
Finally, I will forever be thankful to my parents, Tahereh and Hassan, my older brother
Naser and my lovely sister Nasrin for their great support and encouragement. Words cannot
express my feelings, nor my thanks for all your help.
iii
Abstract
This work concentrates on estimation and sampling aspects in slow mixing Markov Processes.
When a process has slow mixing property, a large number of observations is needed before
exploring the state space of the process properly. Empirical properties of finite sized samples
from Markov processes need not reflect stationary properties. When empirical counts of
samples eventually reflect the stationary properties, we say that the process has mixed.
The contributions of this work revolve around interpretation of samples before mixing has
occurred. In first part of the work, we deal with estimation of parameters of the process,
while in the second part, we will show how the evolution of the samples obtained from the
process can be exploited to identify subsets of state space which communicate well together.
This insight will be translated into algorithmic rules for detecting communities in graphs.
Estimation : We observe a length-n sample generated by an unknown, stationary ergodic
Markov process (model ) over a finite alphabet A. Given any string w of symbols from A
we want estimates of the conditional probability distribution of symbols following w (model
parameters), as well as the stationary probability of w.
Two distinct problems that complicate estimation in this setting are (i) long memory, and (ii)
slow mixing which could happen even with only one bit of memory. Any consistent estimator
can only converge pointwise over the class of all ergodic Markov models. Namely, given any
estimator and any sample size n, the underlying model could be such that the estimator
performs poorly on a sample of size n with high probability. But can we look at a length-n
sample and identify if an estimate is likely to be accurate ?
Since the memory is unknown a-priori, a natural approach is to estimate a potentially
coarser model with memory kn = O(log n). As n grows, estimates get refined and this
approach is consistent with the above scaling of kn also known to be essentially optimal.
But while effective asymptotically, the situation is quite different when we want the best
iv
answers possible with a length-n sample, rather than just consistency. Combining results in
universal compression with Aldous’ coupling arguments, we obtain sufficient conditions on
the length-n sample (even for slow mixing models) to identify when naive (i) estimates of
the model parameters and (ii) estimates related to the stationary probabilities are accurate ;
and also bound the deviations of the naive estimates from true values.
Sampling : We introduce a new randomized algorithm for community detection in graphs
by defining Markov random walks on them. The mixing properties of the random walks we
construct are used to identify communities. The more polarized the communities are, the
slower mixing the random walk is. Our algorithm creates a random walk on the nodes of
the graph. We start different, coupled random walks from different nodes. We then adapt
the coupling from the past approach to identify clusters before the chain mixes (rather than
sample from the stationary distribution). The Markov random walks are built such that
the restriction of the random walk to within any one cluster mixes much faster than the
overall walk itself. Finally, we analyze the performance of the algorithm on specific graph
structures, including the Stochastic Block Models, LFR benchmark random graphs and real
world networks. The number of communities is not known to our algorithm in advance, nor
is any generative model we may have used. Where relevant, we used cluster edit distance
(number of addition and deletion of edges needed to turn a graph into disjoint cliques) and
compared the results with the state-of-the-art Correlation Clustering CC-PIVOT algorithm.
Contents
Page
Acknowledgments
iii
Abstract
iv
Contents
1 Introduction
1
1.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.2.1
Prior Work on Compression of Markov Processes . . . . . . . . . . .
5
1.2.2
Prior Work on Estimation of Markov Processes . . . . . . . . . . . .
6
1.2.3
Prior Work on Entropy Rate of Markov Processes . . . . . . . . . . .
8
1.2.4
Prior Work on Sampling From Markov Processes . . . . . . . . . . .
9
Contributions and Future Directions . . . . . . . . . . . . . . . . . . . . . .
10
1.3
2 Markov Processes
12
2.1
Alphabet and Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.2
Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.3
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.4
Difficulties in Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
3 Summary of Results
19
3.1
Results on Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.2
Results on Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
4 Background Topics
25
4.1
Context Tree Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
4.2
Coupling For Markov Processes . . . . . . . . . . . . . . . . . . . . . . . . .
26
5 Model Aggregation
28
5.1
Aggregations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
5.2
Upper Bound on Entropy Rate . . . . . . . . . . . . . . . . . . . . . . . . .
31
6 Dependencies in the Markov Process
35
6.1
Modeling Dependencies Die Down . . . . . . . . . . . . . . . . . . . . . . . .
35
6.2
Aggregated Models in Md . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
7 Estimation of Model Parameters
41
7.1
Naive Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
7.2
Estimate of Transition Probabilities . . . . . . . . . . . . . . . . . . . . . . .
43
8 Estimation Along a Sequence of Stopping Times
8.1
Restriction of pT ,q to G̃ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
Properties of {Zm }m≥1 . . . . . . . . . . . . . . . . . . . . . . . . . .
51
Estimate Of Stationary Probabilities . . . . . . . . . . . . . . . . . . . . . .
52
8.2.1
Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
8.2.2
The Coupling Argument . . . . . . . . . . . . . . . . . . . . . . . . .
54
8.2.3
Description of Coupling ω . . . . . . . . . . . . . . . . . . . . . . . .
55
8.1.1
8.2
49
0
00
0
00
8.2.3.1
Sampling from the Joint Distribution ω(Yj1 , Yj1 |Zj , Zj ) . . .
56
8.2.3.2
Sampling from ω({Yji , Yji }i≥2 |Zj Yj1 , Zj Yj1 ) . . . . . . . . .
58
8.2.4
Some Observations on Coupling . . . . . . . . . . . . . . . . . . . . .
59
8.2.5
Deviation Bounds on Stationary Probabilities . . . . . . . . . . . . .
62
0
00
0
0
00
00
9 Sampling From Slow Mixing Markov Processes
67
9.1
First Order Markov Processes . . . . . . . . . . . . . . . . . . . . . . . . . .
67
9.2
Overview of CFTP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .
68
9.2.1
CFTP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
9.2.2
Analysis of CFTP . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
Partial Coupling From the Past . . . . . . . . . . . . . . . . . . . . . . . . .
72
9.3.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
9.3.2
Restriction of a Markov Process to G . . . . . . . . . . . . . . . . . .
73
9.3.3
Analysis of Partial Coupling . . . . . . . . . . . . . . . . . . . . . . .
77
9.3.4
Analysis of Partial Coupling Along a Sequence of Stopping Times . .
81
9.3.5
Algorithm for Community Detection . . . . . . . . . . . . . . . . . .
85
9.3
10 Modeling Community Detection Using Slow Mixing Markov Random Walks 88
10.1 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
10.1.1 Prior Work on Community Detection . . . . . . . . . . . . . . . . . .
89
10.1.1.1 Deterministic Approaches . . . . . . . . . . . . . . . . . . .
89
10.1.1.2 Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . .
90
10.1.1.3 Random Graph Generative Models . . . . . . . . . . . . . .
91
10.1.1.4 Prior Work on Correlation Clustering . . . . . . . . . . . . .
95
10.1.2 Overview of Correlation Clustering . . . . . . . . . . . . . . . . . . .
97
10.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
10.2.1 Stochastic Block Model . . . . . . . . . . . . . . . . . . . . . . . . . .
98
10.2.1.1 Comparing the Cost . . . . . . . . . . . . . . . . . . . . . .
98
10.2.1.2 Finding Communities . . . . . . . . . . . . . . . . . . . . . 103
10.2.2 LFR Random Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
10.2.3 Performance on Benchmark Networks . . . . . . . . . . . . . . . . . . 109
10.2.3.1 American Football College . . . . . . . . . . . . . . . . . . . 109
Bibliography
121
List of Figures
2.1
(a) States and parameters of a binary Markov process in Example 2.1, (b)
Same Markov process reparameterized to be a complete tree of depth 2. We
can similarly reparameterize the process on the left with a complete tree of
any depth larger than 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
Markov processes in Example 2.2 with stationary probabilities µ(1) =
and µ(0) =
2.3
m
.
m+1
15
1
m+1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
Markov process in Example 2.3. With high probability, we cannot distinguish
pT ,q from an i.i.d. Bernoulli(1/2) process if the sample size n satisfies k ω(log n). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4
Markov processes in Example 2.4 with stationary probabilities (a) µ(1) =
µ(0) =
1
2
(b) µ0 (1) = 23 , µ0 (0) = 31 . Given a sample with size n with o(1/n),
we cannot distinguish between these two models.
5.1
17
. . . . . . . . . . . . . . .
18
(a) Markov process in Example 5.1, (b) Aggregated model at depth 1. From
Observation 1, the model on the left can be reparameterized to be a complete
tree at any depth ≥ 2. We can hence ask for its aggregation at any depth.
Aggregations of the above model on the left at depths ≥ 2 will hence be the
5.2
model itself. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
(a) Markov in Example 5.2, (b) Same process when = 0 . . . . . . . . . . .
34
8.1
0
00
(a) The conditional probabilities with which Yj1 and Yj1 have to be chosen
respectively are q(·|u) and q(·|v). The line on the left determines the choice
00
0
of Yj1 and the one on the right the choice of Yj1 . For example, if Uj1 is
0
chosen uniformly in [0,1], the probability of choosing Yj1 = a1 is q(a1 |u).
0
00
Instead of choosing Yj1 and Yj1 independently, we will reorganize the intervals
0
00
in the lines so as to encourage Yj1 = Yj1 . (b) Reorganizing the interval [0, 1]
according to the described construction. Here r(a1 ) = min {q(a1 |u), q(a1 |v)}
and similarly for r(a2 ). If Uj1 falls in the interval corresponding to r(a1 ), then
0
00
0
00
(Yj1 , Yj1 ) = (a1 , a1 ). If Uj1 > C(A) in this example, then (Yj1 , Yj1 ) = (a1 , a2 ).
0
When Uj1 is chosen uniformly in [0,1], the probability Yj1 outputs any symbol
0
is the same as in the picture on the left, similarly for Yj2 .
9.1
. . . . . . . . . .
57
CFTP for Example 9.1. The event that both chains coalesce with output
s1 (s2 ) can only happen in even (odd) time indices. The paths leading to
coalescence are highlighted with thick arrows. . . . . . . . . . . . . . . . . .
9.2
70
CFTP for Example 9.4 where partial coalescence happens w.r.t. to G =
{s2 , s3 }. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
10.1 Bitmap of the original graph. . . . . . . . . . . . . . . . . . . . . . . . . . . 103
10.2 Comparing our algorithm with CC-PIVOT when no scaling is performed. . . 104
10.3 Comparing scaled version of our algorithm with CC-PIVOT. . . . . . . . . . 104
10.4 Comparing scaled version of our algorithm with CC-PIVOT. . . . . . . . . . 104
10.5 A realization of LFR network with n = 200 nodes and 6 communities. . . . . 106
10.6 The output of our algorithm when the cost is optimized. . . . . . . . . . . . 106
10.7 The output of CC-PIVOT algorithm. . . . . . . . . . . . . . . . . . . . . . 107
10.8 The output of our algorithm before coalescence happens. . . . . . . . . . . . 108
10.9 The schedule of American football games during season 2000. Nodes in the
represent teams and edges represent games between teams. . . . . . . . . . . 109
10.10The output of CC-PIVOT algorithm. . . . . . . . . . . . . . . . . . . . . . 110
10.11The output of our algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
List of Tables
10.1 All cluster have the same size with different m = 10 samples were taken
uniformly at random from G(n, p, q). . . . . . . . . . . . . . . . . . . . . . . 100
10.2 All cluster have the same size. Here, n = 600 and m = 10 samples were taken
uniformly at random from G(n, p, q). . . . . . . . . . . . . . . . . . . . . . . 101
10.3 All cluster have the same size. Here, n = 1000 and m = 10 samples were
taken uniformly at random from G(n, p, q). . . . . . . . . . . . . . . . . . . . 102
List of Algorithms
9.1
Coupling From the Past . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
9.2
Detecting Communities in a Markov process . . . . . . . . . . . . . .
87
1
Introduction
1.1
Overview
Science in general, information theory and learning in particular, have long concerned themselves with parsimonious descriptions of data. One of the driving motivations is that if we
understand the statistical model underlying the data, we should be able to describe the data
using the fewest possible bits—a concept that is formalized by the notion of entropy and
that forms the foundation for quantifying information.
A more interesting situation arises when the underlying statistical model is unknown. Depending on the application at hand, we then consider a set P of all possible models that are
1
consistent with what is known about the application—say all i.i.d. or Markov models. It
is often possible that we can still describe the data, without knowing the underlying model
other than that it belongs to P, using almost as succinct a description as if we knew the
underlying statistics. This notion is formalized as universal compression [1, 2]. Of course,
the universal description always incurs a penalty—the redundancy or excess bits beyond the
minimum needed if we knew the model.
Closely related to universal compression is the concept of Minimum Description Length
(MDL), where the redundancy above is interpreted as the complexity of the model class. A
natural question then is—if the redundancy is small (i.e., can be upper bounded uniformly
for all models in P by a function sublinear in the data length), in the process of obtaining
a universal description did we somehow estimate the unknown model in P?
While the general answer is no, universal compression and estimation often do go hand in
hand, at least in several elementary settings. Suppose P is the collection of all distributions
over a finite set A, and the data at hand is a length-n sequence in An (the set of all lengthn strings of symbols from A) obtained by independent and identically distributed (i.i.d.)
sampling from an unknown distribution in P. It is then possible to estimate the underlying distribution using a universal description, as is the case with the universal estimators
by [3] or [4, 5]. As a more complex example, the Good Turing estimator (see [6]) can also
be interpreted as being obtained from such a universal compression [7] in a more general
setting—data is exchangeable, rather than i.i.d..
With Markov processes, the connections between universal compression and estimation become more subtle. At the gross level, compression guarantees hold uniformly over Markov
model classes but usually not for estimation. Despite this apparent dichotomy, our estimation results rely fundamentally on universal compression being possible. Consider a length-n
sample obtained from an unknown source in the class of binary Markov sources with memory
one. No matter what the source is or how the sample looks like, there are universal algorithms which describe the sample using at most O(log n) bits more than the best description
2
if the source were known. Yet, as we will see, irrespective of how large n is, we may not be
in a position to reliably provide estimates of the stationary probabilities of 0s and 1s.
Let the transition probability from 1 to 0 in our memory-1 source be 1/n. By changing
the transition probability from 0 to 1 appropriately, we can vary the stationary probabilities
of 1s and 0s in a wide range, without changing how a length-n sample will look like.
Example 2.4 in Chapter 2.4 gives two such binary, one-bit memory Markov sources with
stationary probabilities (1/2,1/2) and (2/3,1/3) respectively. But, if we start from 1, both
sources will, with high probability, yield a sequence of n 1s. We cannot distinguish between
these two sources with a sample of this size—and hence it is futile to estimate stationary
probabilities from this sample. This particular phenomenon, where the number of times
each state (0 and 1 here) appears is very different from their stationary probabilities is often
formalized as slow mixing, see [8].
Motivated by an abstraction of a channel estimation problem to be mentioned shortly, we ask
how best to estimate properties of an ergodic, yet potentially slow mixing Markov process
from a sample of size n. As the above example shows, we have an estimation problem where
any estimator can only converge pointwise to the true values, rather than uniformly over
the model class. One way to get around this impasse is to add restrictions on the model
space as is done in most prior work. However, very few such restrictions are justified in
our application. So we take a different approach: can look at some characteristics of our
length-n sample and say if any estimates are doing well?
Say, for the sake of a concrete example, that we have a sample x1 , with n − log n 1s followed
by a string of log n 0s. Perhaps, this may have come from a one-bit memory, slow mixing
Markov source as in Example 2.4. As we saw, it is futile to estimate stationary probabilities
in this case. Contrast this sample with a new sample x2 , also with n − log n 1s and log n
0s, but x2 has 0s spread uniformly in the sequence. Unlike with x1 , upon seeing x2 we may
want to conclude that we have an i.i.d. source with a high probability for 1.
The particular application we are motivated by arises in high speed chip-to-chip commu3
nications, and is commonly called the backplane channel [9]. Residual reflections between
inter-chip connects introduce significant complications, and most commonly used modeling
assumptions are unjustified here. But they form the backbone of data centers, and operating efficiently on them is a significant commercial imperative. Without going into details,
engineering questions on such channels can be reduced to computing entropy rates of, or
estimating parameters or stationary probabilities of Markov processes.
Our main results on estimation are summarized in chapter 3.1. The results show how to
look at a data sample and identify properties of the process that are amenable to accurate
estimation from the sample. They also allow us to sometimes (depending on how the data
looks) conclude that certain naive estimators of stationary probabilities or model parameters
probabilities happen to be accurate, even if the process is slow mixing.
To obtain estimates on transition probabilities we use universal compression approaches. As
we observed above, stationary probabilities can be a very sensitive function of the transition
probabilities. What can we say about them from the few approximate transition probabilities
obtained from the sample? We use a coupling argument [10] in Chapter 8.2 to answer this
question.
While estimation of parameters when the source has not mixed yet is a daunting task on its
own, the second part of this work revolves around how to interpret samples before mixing
has occurred. More specifically, can tracking the evolution of samples from a Markov source–
before mixing happens, reveal something interesting about the polarization of state space?
To answer this question, we consider the elegant coupling from the past (CFTP) algorithm [11] that allows for perfect sampling from the stationary distribution of any Markov
chain. Chapter 9 is devoted into theoretical results of how to identify subset of states which
coalesce faster during the execution of CFTP algorithm. The theoretical results can be
translated into algorithmic rules (Algorithm 9.2) for detecting communities in graphs. We
show how to construct Markov random walks on graphs and use the proposed algorithm to
identify communities within the graphs.
4
1.2
Prior Work
In this section, we review prior work on general theoretical results on estimation and sampling
in Markov processes. We will review prior work on more specific topics related to community
detection on graphs in Chapter 10.
1.2.1
Prior Work on Compression of Markov Processes
For k-memory Markov processes with known k, optimal redundancy rates for the universal
compression and estimation have been established, see e.g., [12] for an overview and also [13–
15]. These universal compression results imply consistent estimators for probabilities of
sequences. Moreover, the rate of convergence of these estimators can be bounded uniformly
over the entire k-memory Markov model class e.g., [16–19]. This rate typically depends
exponentially on k and diminishes with the sample length as log n/n. We point out two
complications when confronted with our problem.
First, we deal with the case of unbounded memory—namely no a−priori bound on k. For the
set of all finite memory Markov sources, only weakly universal [20] compression schemes—
those that convergence in a pointwise sense—can be built (see [21] for a particularly nice
construction). Namely, the convergence of the weak universal algorithms varies depending
on the true unseen memory of the source. However, as we will see in Example 2.3, it may
be impossible to estimate the memory of the source from a finite length sample. There has
been a lot of work on the topic of estimating the memory of the source consistently when a
prior bound on the memory is unknown, see [22–25]—but as one would expect, given a finite
length sample no estimator developed will always have a good answer.
Second, despite the positive result for estimation of the probabilities of sequences, as mentioned in the introduction there can not be estimators for transition and stationary probabilities whose rate of convergence is uniform over the entire model class. This negative
observation follows simply because of the way we are forced to sample from slow mixing
5
processes—and this is a complication compression does not encounter. For instance, in the
example outlined above in the introduction, both samples x1 and x2 can be well compressed
by universal estimators, but estimation is a whole different ballgame.
The complications above apart, the nature of questions we ask is different as well. Rather
than consistency results, or establishing process-dependent rates of convergence for the case
where the memory can be unbounded, we ask how to give the best possible answer with a
given sample of length n. It is not to say, however, that the above compression results are
irrelevant to our problem. Far from it, one of our results, Theorem 7.1 in Chapter 7 builds on
(among other things) the universal compression results obtained for k−memory processes.
1.2.2
Prior Work on Estimation of Markov Processes
Estimation for Markov processes has been extensively studied and falls into three major
categories (i) consistency of estimators e.g., [22, 23, 26–28], (ii) guarantees on estimates that
hold eventually almost surely e.g., [29, 30], and (iii) guarantees that hold for all sample
sizes but which depend on the model parameters e.g., [24, 25, 31, 32]. The list above is not
exhaustive, rather it focuses on the work closest to the approaches we take.
As mentioned earlier, performance of any estimator cannot not be bounded uniformly over
all Markov models, something reflected in the line (iii) of research and in our work. While
unavoidable, it poses a problem since the deviation bounds now depend on the unknown
model. How then do we say if our estimate is doing well? Our thrust in this paper focuses
on exactly this question—it is not just about consistency, rather that we want to gauge from
the observed sample if our estimator is doing well relative to the unknown probability law
in force.
In [28] a survey on consistent estimators for conditional probabilities of Markov processes
is provided. For instance, given a realization of a Markov process, they provide a sequence
of estimators for transition probabilities along some of time steps which converges almost
6
surely to the true values.
Consistent estimators for the order of Markov processes have been studied in prior literature e.g., [30, 33–35]. In [30, 34], the penalized maximum likelihood technique is used to
provide a consistent order estimator. In [35], a different consistent order estimator based
on empirical counts is proposed which minimizes the asymptotic underestimation exponent
while keeping the overestimation exponent at a certain level. Using the same technique, in
[36] an optimal order estimator is provided, however they assume a prior upper bound on
the memory of underlying process.
In [22], estimation of the minimal context tree of the Markov process is addressed. Two different information criteria, namely Bayesian Information Criterion and Minimum Description
Length are used, and consistency of estimation of the underlying context tree was established
provided that the depth of hypothetical trees grow as o(log n). Moreover, it was shown in
[23] that when the process has finite memory, the o(log n) condition is not necessary for
estimation consistency.
In [24, 25, 32], exponential upper bounds on probability of incorrect estimation of (i) conditional and stationary probabilities and (ii) the underlying context tree, are provided for
variants of Rissanen’s algorithm context and penalized maximum likelihood estimator. The
introduced deviation bounds depend on the model parameters (e.g., minimum stationary
probability of all contexts pmin , depth of the tree and continuity rate coefficients) of underlying process.
One particular paper that we would like to highlight is [31], where the problem of estimating
a stationary ergodic processes by finite memory Markov processes based on an n-length
sample of the process is addressed. A measure of distance between the true process and its
estimation is introduced and a convergence rate with respect to that measure is provided.
However, the bounds proved there hold only when the infimum of conditional probabilities
of symbols given the pasts are bounded away from zero.
In this work, given a realization of a Markov process, we consider a coarser model and provide
7
deviation bounds for sequences which have occurred frequently enough in the sample. We
make an assumption justified by the physical motivation we consider—that dependencies
die down in the Markov model class we consider. What we insist is that our bounds, while
model dependent as is to be expected, must also be calculated using only parameters which
are well-approximated from data. In particular, we do not assume a-priori knowledge on
the depth of context tree of the process, nor do we assume that the conditional probabilities
given the pasts are bounded uniformly away from zero.
1.2.3
Prior Work on Entropy Rate of Markov Processes
Entropy rate estimation methods fall into two major categories, (i) the methods which use
empirical counts and (ii) the methods which are built on universal compression results.
In approaches based on empirical counts (sometimes called plug-in methods), the maximum
likelihood estimation of transition and stationary probabilities are calculated from the sample
and are used to approximate entropy rate. For instance, in [37, 38], the entropy rate of
a stationary Markov chain is estimated using empirical counts and it is shown that the
estimator is strongly consistent. Furthermore, it is proven that if the chain has finite state
space, the estimator is asymptotically normal. In [39], the sufficient conditions for which the
plug-in estimator of entropy rate for denumerable Markov chains be asymptotically normal is
provided. In [40], an estimator for the entropy rate of ergodic Markov chains with finite state
space based on sample path simulation is proposed for which the converge rate is exponential.
The extension to Markov chains with countable state space is treated in [41]. However, the
deviation bounds depend on unknown underlying true parameters of the process.
Universal entropy rate estimators for sources with memory has been studied in earlier work,
see e.g., [42–46].
8
1.2.4
Prior Work on Sampling From Markov Processes
In many practical applications, it is desirable to simulate a random variable X over a finite
state space S according to a given probability distribution π. A promising method for
obtaining samples according to π is to construct an irreducible aperiodic Markov chain over
S which has the unique stationary distribution π. The class of Markov Chain Monte Carlo
(MCMC) random sampling methods, e.g., Gibbs sampler, Metropolis chain and Glauber
dynamics, falls into this category and are specially well-suited for high dimensional state
spaces. However, in most cases, these approaches suffer from the fact that it is difficult to
obtain upper bound on the sample size which guarantees a prescribed discrepancy.
On the other hand, to overcome the aforementioned drawbacks for MCMC methods, Prop
and Wilson devised an algorithm known as Coupling From The Past (CFTP) which generates
exact samples according to π. We will provide a brief summary of the algorithm and proof of
the correctness of the algorithm in Section 9.2. In short, the idea is to simulate |S| different
copies of a Markov chain– each starting from a different initial state backward in time. When
all copies of the chain coalesce at time 0, a sample is obtained which is distributed perfectly
according to π.
The first drawback associated with CFTP is when the state space large, one must keep track
of a large number of chains. For reducing the number of trajectories to be tracked, different
approaches such as sandwiching technique have been proposed [47, 48]. This approach is
specially suited for state spaces which admit a partial order with a maximal and a minimal
element. If one could come up with a coupling construction which preserves the order, then
only the chains starting from maximal and minimal are needed to be tracked. This approach
is sometimes referred as monotone-CFTP [11, 49]. The second drawback associated with
CFTP is the need to reuse random numbers in different runs of the algorithm. To circumvent
this issue, the read-once version of CFTP was developed in [50].
The other issue is related to CFTP is the running time of the algorithm which is intrinsically
random. More precisely, if the coupling design is poor, the the coalescence may take a
9
long time to occur. Besides the effect of coupling construction on the performance, if the
underlying Markov process is slow mixing, the time needed to get an exact sample from
target distribution may be extremely long.
1.3
Contributions and Future Directions
This work leverages the statistical estimation using finite samples from Markov sources
before they have mixed. We have shown how to use data generated by potentially slow
mixing Markov sources to identify those states for which naive approaches will estimate
both parameters and functions related to stationary probabilities. To do so, we require that
the underlying Markov source have dependencies that are not completely arbitrary, but die
down eventually. In such cases, we show that even while the source may not have mixed
(explored the state space properly), certain properties related to contexts w, can be well
estimated, if |w| grows as Θ(log n).
This work also uncovers a lot of open problems. The estimation results are sufficient to say
that some estimates are approximately accurate with high confidence. A natural, but perhaps
difficult, question is whether we can give necessary conditions on how the data must look for
a given estimate to be accurate. This work also forms a cog in the growing understanding of
the information theoretic underpinnings involving estimation problems with memory. These
results add to the understanding of model classes that only admit estimators converging
pointwise over the class (namely at rates that could be arbitrarily slow depending on the
underlying model), but are special in the sense that it is possible to say if the algorithm is
doing well or not.
We provide algorithmic primitives based on Coupling From the Past in order to build novel,
promising and fast spectral clustering and community detection algorithms. Our algorithm
creates a random walk on the nodes of the graph. We start different, coupled random walks
from different nodes. We then adapt the coupling from the past approach to identify clusters
10
before the chain mixes (rather than sample from the stationary distribution).
This work can be extended in following directions: (i) the theoretical foundation that explains the proposed algorithm for community detection algorithms, (ii) obtain theoretical
conditions for recovery of clusters in the Stochastic Block Models and compare with information theoretic limits, (iii) expanding our algorithmic primitives to handle a broader set
of community detection problems, in particular, to understand and model inherent biases in
widely prevalent means of information collection today.
11
2
Markov Processes
Most notation, while standard, is included for completeness.
2.1
Alphabet and Strings
A is a finite alphabet with cardinality |A|, A∗ =
S
k≥0
Ak and A∞ denotes the set of all
semi-infinite strings of symbols in A.
We denote the length of a string u = u1 , . . . ,ul ∈ Al by |u|, and use uji = (ui , · · · , uj ). The
concatenation of strings w and v is denoted by wv. A string v is a suffix of u, denoted by
v u, if there exists a string w such that u = wv. A set T of strings is suffix-free if no
12
string of T is a suffix of any other string in T .
2.2
Trees
As in [18] for example, we use full A−ary trees to represent the states of a Markov process.
We denote full trees T as a suffix-free set T ⊂ A∗ of strings (the leaves) whose lengths satisfy
Kraft’s lemma with equality.
The depth of the tree T is defined as κ(T ) = max{ |u| :
u ∈ T }. A string v ∈ A∗ is an internal node of T if either v ∈ T or there exists u ∈ T such
that v u. The children of an internal node v in T , are those strings (if any) av, a ∈ A
which are themselves either internal nodes or leaves in T .
For any internal node w of a tree T , let Tw = {u ∈ T : w u} be the subtree rooted at
w. Given two trees T1 and T2 , we say that T1 is included in T2 (T1 T2 ), if all the leaves in
T1 are either leaves or internal nodes of T2 .
2.3
Models
Let P + (A) be the set of all probability distributions on A such that every probability is
strictly positive.
Definition 2.1.
A context tree model is a finite full tree T ⊂ A∗ with a collection of
probability distributions q(·|s) ∈ P + (A) assigned to each s ∈ T . We will refer to the
elements of T as states (or contexts), and q(T ) = {q(a|s) : s ∈ T , a ∈ A} as the set of state
transition probabilities or the model parameters.
Every model (T , q(T )) allows for an irreducible, aperiodic 1 and ergodic [51] Markov process
1. Irreducible since q(·|s) ∈ P + (A), aperiodic since any state s ∈ T can be reached in either |s| or |s| + 1
steps.
13
with a unique stationary distribution µ satisfying
µ Q = µ,
(2.1)
where Q is the standard transition probability matrix formed using q(T ). Let pT ,q be the
unique stationary Markov process {. . . , Y0 , Y1 , Y2 , . . .} which takes values in A satisfying
0
pT ,q (Y1 |Y−∞
) = q(Y1 |s)
0
in T which we denote by
whenever s is unique suffix s Y−∞
0
s = cT (Y−∞
).
Namely the mapping cT : A∞ → T maps any left-infinite sequence in A∞ to its unique suffix
in T .
As a note, when we write out actual strings in transition probabilities as in q(0|1000), the
state 1000 is the sequence of bits as we encounter them when reading the string left to right.
If 0 follows · · · 1100, the next state is a suffix of · · · 11000, and if 1 follows · · · 1100, the next
state is a suffix of · · · 11001.
Observation 2.1.
A useful observation is that any model (T , q(T )) yields the same
Markov process as a model (T 0 , q(T 0 )) where T T 0 and for all s0 ∈ T 0 , q(·|s0 ) = q(·|cT (s0 )).
Example 2.1.
Let (T , q(T )) be a binary Markov process with T = {11, 01, 0} and transi-
tion probabilities q(1|11) = 41 , q(1|01) = 31 , q(1|0) =
3
4
as shown in Fig. 2.1. (a). Observe that
Fig. 2.1. (b) shows the same Markov process as a model (T 0 , q(T 0 )) with T 0 = {11, 01, 10, 00}
satisfying conditions in Observation 2.1.
We will sometimes refer to stationary probabilities of any string u as µ(u), not just strings
|u|
in T . This is just a synonym for pT ,q (Y1
= u).
As emphasized in the introduction, we do not assume the true model is known nor do we
14
1
4
1
1
3
0
1
4
1
3
4
0
(a)
1
1
3
3
4
0
3
4
0
1
1
0
(b)
Figure 2.1: (a) States and parameters of a binary Markov process in Example 2.1, (b)
Same Markov process reparameterized to be a complete tree of depth 2. We can similarly
reparameterize the process on the left with a complete tree of any depth larger than 2.
assume it is fast mixing. We would like to know if we can estimate the model parameters
and the stationary probabilities of various states even when we are in the domain where the
mixing has not happened.
2.4
Difficulties in Estimation
There are two distinct difficulties in estimating Markov processes as the ones we are interested
in. The first is memory that is too long to handle given the size of the sample at hand. The
second issue is that even though the underlying process might be ergodic, the transition
April 29, 2014
DRAFT
probabilities are so small such that the process effectively acts like a non-ergodic process
given the sample size available. We illustrate these problems in following simple examples.
It is quite possible all strings in a finite sample, no matter how large, have arbitrarily small
mass under the stationary distribution. We illustrate this in Example 2.2 below. Our
results, particularly Theorem 8.3 incorporates this phenomenon, and we try to provide the
best results despite this apparent difficulty.
Example 2.2.
Let A = {0, 1} and T = {0, 1} with q(1|1) = 1 − , and q(1|0) =
m
(see
Figure 2.2). For > 0 and a constant m ∈ R with m > , this model represents a stationary
ergodic Markov process pT ,q with stationary distributions µ(1) =
15
1
, µ(0)
m+1
=
m
.
m+1
Note
1−
1
0
m
Figure 2.2: Markov processes in Example 2.2 with stationary probabilities µ(1) =
m
µ(0) = m+1
.
1
m+1
and
that µ(1) can be arbitrarily small for sufficiently large m.
Now suppose we have a length-n sample with 1/n. If we start from 1, with high
probability we see a sequence of n consecutive 1’s. For instance, if = 1/nj for some j ≥ 2,
then with probability ≥ 1 − 1/n under pT ,q , we see a sequence of n consecutive 1’s. Clearly,
the stationary probability of any sequence of 1’s is ≤
1
,
m+1
and this can be made arbitrarily
small by choosing m large enough.
The next example illustrates one pitfall of having no bound on the memory. We therefore
require that dependencies die down by requiring that conditional probabilities satisfy (6.1)
in Chapter 6.
Example
April 29, 2014 2.3.
Let T = Ak denote a full tree with depth k and A = {0, 1}. Assume
DRAFT
that q(1|0k ) = 2 and q(1|10k−1 ) = 1 − with > 0, and let q(1|s) =
1
2
(where 0k indicates
a string with k consecutive zeros) for all other s ∈ T . Let pT ,q represent the stationary
ergodic Markov process associated with this model. Observe that stationary probability of
being in state 0k is
1
2k+1 −1
while all other states have stationary probability
2
.
2k+1 −1
Let Y1n
0
be a realization of this process with initial state 1k Y−∞
. Suppose k ω(log n). 2 With
high probability we will never find a string of k − 1 zeros among n samples, and every bit
is generated with probability 1/2. Thus with samples of size n, no matter how large n may
be, with high probability we cannot distinguish certain long-memory processes from even an
2. A function fn = ω(gn ) if limn→∞ fn /gn = ∞.
16
1/2
1/2
1/2
1/2
↑1
↓0
1/2
1/2
1−
2
κ(T ) = k
Figure 2.3: Markov process in Example 2.3. With high probability, we cannot distinguish
pT ,q from an i.i.d. Bernoulli(1/2) process if the sample size n satisfies k ω(log n).
i.i.d. Bernoulli(1/2) process.
We therefore require that dependencies die down by requiring that model parameters satisfy (6.1) in Chapter 6.
The next example illustrates complications while estimating stationary probabilities.
May 27, 2014
Example 2.4.
DRAFT
Let A = {0, 1} and T = {0, 1} with q(1|1) = 1 − , and q(1|0) = (see
Figure 2.4). For > 0, this model represents a stationary ergodic Markov processes with
stationary distributions µ(1) = 21 , µ(0) = 12 . Let T 0 = {0, 1} with q 0 (1|1) = 1 − , q 0 (1|0) =
2. Similarly, for > 0 this model represents a stationary ergodic Markov processes with
stationary distributions µ0 (1) = 32 , µ0 (0) = 31 .
Suppose we have a length-n sample and suppose 1/n. If we start from 1 (or 0), both
models will yield a sequence of n 1’s (or 0’s) with high probability. Therefore, the length n
samples from the two sources look identical. Hence no estimator could distinguish between
these two models with high probability if o(1/n), and therefore no estimator can obtain
17
1−
1−
1
0
1
0
2
(a)
(b)
Figure 2.4: Markov processes in Example 2.4 with stationary probabilities (a) µ(1) = µ(0) =
1
(b) µ0 (1) = 32 , µ0 (0) = 13 . Given a sample with size n with o(1/n), we cannot distinguish
2
between these two models.
their stationary probabilities either.
April 29, 2014
DRAFT
18
3
Summary of Results
3.1
Results on Estimation
We consider the following abstraction: we have a length n sample from a stationary, ergodic,
A−ary Markov source pT ,q , where both T and q(T ) are unknown. Using this sample, we want
(i) to approximate as best as possible, the parameter set q(T ) (ii) the stationary probabilities
µ(s) of strings s ∈ T , and (iii) to estimate or at least obtain heuristics of the entropy rate
of the process.
Two distinct problems complicate estimation of q(T ) and the stationary probabilities. First
is the issue that the memory may be too long to handle—in fact, if the source has long
19
enough memory it may not be possible, with n samples, to distinguish the source even from
a memoryless source (Example 2.3). Second, even if the source has only one bit of memory,
it may be arbitrarily slow mixing (Example 2.4). No matter what n is, there will be sources
against which our estimates perform very poorly.
A natural way to estimate a general Markov process with unknown states T is to try estimat-
ing a potentially coarser model with states T̃ = Akn for some known kn . With the benefit
of hindsight, we take kn = O(log n) and write kn = αn log n for some function αn = O(1).
This scaling of kn also reflects well known conditions for consistency of estimation of Markov
processes in [22]. This will be the high level approach we will take as well.
0
An important point needs to be noted here. Given Y−∞
, we get the sample sequence Y1n
from the (unknown) model parameters q(T ). Equivalently, via the conditional probabilities
i−1
q(Yi |s) = P (Yi |s), where s ∈ T are contexts (and in this case, s = cT (Y−∞
)).
But we do not know T . So the best we can ask for is to estimate (using samples from the
true model), the conditional probability P (Y1 |u), where u ∈ Akn is a known string. The
probability P (Y1 |u) refers to the probability under the true model of seeing Y1 , given that
0
Y−k
= u, and is well defined whether or not u is a context in T . This is not the same as
n +1
estimating a Markov process with memory kn —a distinction that will be clarified below.
For convenience, we rephrase the above problem by defining an aggregation of a Markov
process in Chapter 5 at depth kn . The aggregation of the true model can be thought of a
coarse approximation of the true model, one that has memory kn and (unknown) parameters
corresponding to u ∈ Akn that are set to P (Y1 |u).
In any estimation problem posed in this paper, we do not get to see observations corresponding to the aggregated model. We need to estimate the aggregated parameters using
the observations from the true, underlying model (T , q(T )). Therefore the task is not the
same as estimating a model with memory kn —and estimating a model with memory kn could
have been easily approached via known universal compression techniques. The aggregation
here is simply an abstraction that helps us write results more transparently—we never see
20
samples from it.
Entropy rate We first compare the entropy rates of the true model with the aggregated
model. This is of course, a purely hypothetical comparison—it is not as if we get to observe
the aggregated model. Rather, the hope is that we might somehow be able to estimate the
aggregated model by observing the true model.
We show in Proposition 1 that the entropy rate corresponding to the aggregated model
(T̃ , q̃(T̃ )) is an upper bound on the entropy rate of the true model (T , q(T )). While it may
not always be possible to directly use Proposition 1 in the slow mixing case, we develop the
notion of a partial entropy rate, a useful heuristic when the source has not mixed. Moreover,
Proposition 1 also motivates the estimation questions that form our main results.
0
) at hand is from
Naive estimates The issue remains that the sample Y1n (given past Y−∞
pT ,q , not from pT̃ ,q̃ . To obtain the parameters of the aggregated process, q̃(T̃ ), we use a naive
estimator—we just pretend that our sample was in fact generated from pT̃ ,q̃ . Equivalently,
we assume that for any w ∈ T̃ , the subsequence of symbols in our sample that follow w is
i.i.d.. Consider the following illustration.
0
Suppose our sample is 1101010100. Let Y−∞
= · · · 00. Suppose we want to estimate given a
two bit string, the conditional probability under q(T ) that in the next step, bit a appears.
Or to rephrase it, the aggregated parameters at depth 2.
In particular, say we want the aggregated parameters associated with string 10, namely
q̃(a|10). The subsequence following 10 in the sample is 1110. Then the naive estimate of
q̃(1|10), denoted by q̂(1|10), is 3/4 and the naive estimate of q̃(0|10), denoted by q̂(0|10), is
1/4. In general, there is no reason why this naive approach even makes sense because the
sample is not generated from pT ,q .
21
Dependencies die down However, it is reasonable given our physical motivation that
the influence of prior symbols dies down as we look further into the past. We formalize this
notion in (6.1) with a function d(i) that controls how symbols i locations apart can influence
P
each other, and require i≥1 d(i) < ∞. Let Md be the set of all models that conform
to (6.1). Note that there is no bound on the memory of models in Md .
For sources in Md , we show how to obtain (by looking at the data sample) G̃ ⊆ T̃ , a set of
good states 1 or good length-kn strings from sample sequence. These are strings that will be
amenable to concentration results, and hence the adjective “good”.
Model parameters In Theorem 7.1, we show that with probability (under the underlying
0
unknown model in Md ) ≥ 1− 2|A|kn1+1 log n (conditioned on any past Y−∞
), for all states w ∈ G̃
simultaneously
s
kq̃(.|w) − q̂(.|w)k1 ≤ 2
Here, δkn =
P
i≥kn
ln 2
ln 2
+
.
log n log δk1
(3.1)
n
d(i) and q̂(.|w) is the naive estimator of q̃(.|w) as described above. In
Theorem 7.3 we strengthen the convergence rate to polynomial in n if the dependencies die
down exponentially.
The above estimation result associated with length-kn strings w ∈ G̃ is built on two facts:
(i) dependencies dying down and (ii) universal compression results on Md built on the fact
that length-n sequences generated by Markov sources with memory kn can be universally
compressed if kn = O(log n). Specifically, and perhaps surprisingly, it does not matter
whether the empirical frequencies of strings w ∈ G̃ are anywhere near their stationary
probabilities.
A second related curiosity arises due to the fact that the above result does not depend on
empirical frequencies being close to stationary probabilities. The result above is sometimes
tight for strings w while being vacuous for their suffixes w0 . For example, it could be that
we estimate parameters associated with a string w of length Θ(log n) (say w is a string of
1. Strings in T̃ may not be states of pT ,q , but we abuse notation here for convenience.
22
ten 0’s) but not those associated with w0 where w0 w (say w0 is a string of five 0’s).
We now are in a position to tackle the hardest problem next.
Stationary probabilities of strings In general, stationary probabilities of strings can
be a very sensitive function of the model parameters. We now have the model parameters of
contexts in G̃ that we obtained approximately. With this little bit of information we have
gleaned from the sample, can we even hope to say anything about stationary probabilities
of w ∈ G̃? How then do we interpret the empirical counts Nw of various strings w? Here
we come to the most involved problem in this work.
To answer this question, we define a parameter ηG̃ in (8.2) that is calculated using just the
parameters associated with w ∈ G̃ (note that G̃ is a function of the sample). Suppose {δi }i≥1
P
is summable as well, and let ∆j = i≥j δi . In Theorem 8.3 we prove that if the evolution of
process restricted (see Chapter 8 for description) to just contexts in G̃ is aperiodic, then for
0
all t > 0, Y−∞
and w ∈ G̃ the counts of w in the sample, Nw , concentrates:
(t − B)2
µ(w)
0
pT ,q (|Nw − ñ
| ≥ t|Y−∞ ) ≤ 2 exp −
,
2ñB 2
µ(G̃)
(3.2)
where B ≈ 4 max {`n , kn }/[ηG̃kn (1 − ∆kn )]. Here, `n is the smallest integer such that ∆`n ≤ n1 ,
ñ is the total count of good states in the sample and µ denotes the stationary distribution of
pT ,q . Note once again that pT ,q is the probability law under the underlying unknown model
in Md .
A few points about the above bound need to be noted. First that kn , ∆kn and `n (the later
two related to how fast dependencies die down) are known a − priori. But ñ is a random
variable found from the sample. So is ηG̃ , but one that can also be well estimated from the
sample—it is only dependent on the transition probabilities we already estimated in (3.1)
(simultaneously well for all contexts in G̃ with confidence ≥ 1 −
1
2|A|
kn +1 log n
). It is thus
perfectly possible that for some samples, we may be able to say little about the counts Nw ,
but in other samples we could interpret it well.
23
Thus (3.2) is a natural deviation bound where the right side is a random variable generated
from the model pT ,q , but one that can be well estimated from the sample. To use it when
confronted with a sample, we would further lower bound ηG̃ by an estimate derived from
the data using (3.1). With the effect that the model dependent right side is replaced by
another upper bound—potentially worse, but entirely data dependent (with a new reduced
confidence obtained by a union bound on (3.1) and (3.2)).
The above estimation result (3.2) follows from a coupling argument [10] on a natural Doob
Martingale construction in Chapter 8.2.
Remark
A couple of remarks about notation since there is quite a bit to keep track of.
All logarithms are base 2. We use bold font w or s for strings. Typically s is a generic state,
while w is a “good” state as a mnemonic. A subscript s usually refers to an instance of the
Markov process whose past corresponds to s—for example q(·|s), or Ns (for the count of s
in the sample—the number of times the sample had s as its immediate past).
3.2
Results on Sampling
In chapter 9, we review the Coupling From the Past (CFTP) which is an algorithm for exact
sampling from stationary distribution of a Markov process. In section 9.3.2, we introduce the
restriction of the process to a subset of states and in Theorem 9.6 provide sufficient conditions
under which the CFTP will yield a sample from the restricted process. Based on Theorem
9.6, we then provide Algorithm 9.2 which can be used a tool to identify communities within
the state space of the process.
Finally, in chapter 10, we use the algorithm 9.2 to identify communities in graphs by defining
Markov random walks on them. The mixing properties of the random walks we construct
are used to identify communities. We then show the performance of our algorithms on
specific graph structures, including the stochastic block model and few real world benchmark
networks to identify communities in the graph.
24
4
Background Topics
4.1
Context Tree Weighting
Context tree weighting algorithm is a universal data compression algorithm for Markov
sources [18, 19], and the algorithm can be used to capture several insights into how Markov
processes behave in the non-asymptotic regime. Let y1n be sequence of symbols from an
alphabet A. Let T̂ = AK for some positive integer K. For all s ∈ T̂ and a ∈ A, let nsa
be the empirical counts of string sa in y1n . The depth-K context tree weighting constructs a
25
distribution pc satisfying
0
pc (y1n |y−K
)
−|A|K+1 log n
≥2
YY
nsa
nsa
P
a∈A
s∈T̂ a∈A
nsa
.
Note that no Markov source with memory K could have given a higher probability to y1n
than
YY
P
s∈T̂ a∈A
nsa
nsa
a∈A
nsa
.
So, if |A|K log n = o(n), then pc underestimates any memory-K Markov probability by only a
subexponential factor. Therefore, K = O(log n) is going to be the case of particular interest.
4.2
Coupling For Markov Processes
Coupling is an elegant technique that will help us understand how the counts of certain
strings in a sample behave. Let pT ,q be our Markov source generating sequences from an
alphabet Ã.
A coupling ω for pT ,q is a special kind of joint probability distribution on the sequences
{Ym , Ȳm }m≥1 where Ym ∈ Ã and Ȳm ∈ Ã. ω has to satisfy the following property: individually
taken, the sequences {Ym } and {Ȳm } have to be faithful evolutions of pT ,q . Specifically, for
m ≥ 0, we want
m
m
m
m
) = pT ,q (Ym+1 |Y−∞
) = pT ,q (Ym+1 |cT (Y−∞
)),
ω(Ym+1 |Y−∞
, Ȳ−∞
(4.1)
and similarly for {Ȳm }.
In the context of this paper, we think of {Ym } and {Ȳm } here as copies of pT ,q that were
started with two different states s, s0 ∈ T respectively, but the chains evolve jointly as ω
instead of independently. For any r and w ∈ Ãr , Nw (respectively N̄w ) is the number of times
0
w forms the context of a symbol in a length-n time frame, {Yi }ni=1 given Y−∞
(respectively
26
n
0
{Ȳi }i=1 given Ȳ−∞
). Then, for any ω,
0
|EpT ,q [Nw |Y−∞
]
−
0
EpT ,q [N̄w |Ȳ−∞
]|
n
X i
i
Eω 1 cÃr (Y−∞ ) = w − 1 cÃr (Ȳ−∞ ) = w =
i=1
n X
i
i
Eω 1 c r (Y−∞ ) = w − 1 c r (Ȳ−∞ ) = w ≤
Ã
Ã
≤
i=1
n
X
i=1
i
i
ω cÃr (Y−∞
) 6= cÃr (Ȳ−∞
) ,
where the first equality follows from (4.1).
The art of a coupling argument stems from the fact that ω is completely arbitrary apart
from having to satisfy (4.1). If we can find any ω such that the chains coalesce, namely
0
i
i
]
) becomes small as i increases, then we know that EpT ,q [Nw |Y−∞
) 6= cÃr (Ȳ−∞
ω cÃr (Y−∞
0
cannot differ too much from EpT ,q [N̄w |Ȳ−∞
].
0
0
is, we could then pick Ȳ−|T
Now if, in addition, such a coalescence holds no matter what Ȳ−∞
|
according to the stationary distribution of pT ,q . Then N̄w would be close to the stationary
count of w, and from the coupling argument above, so is Nw . For tutorials, see e.g., [8, 52, 53].
27
5
Model Aggregation
5.1
Aggregations
Since the memory is unknown a-priori, a natural approach, known to be consistent, is to use a
potentially coarser model with depth kn . Here, kn increases logarithmically with the sample
size n, and reflects [22] well known results on consistent estimation of Markov processes. We
show that coarser models formed by properly aggregating states of the original context tree
model are useful in upper bounding entropy rates of the true process.
Definition 5.1.
Suppose T̃ = Ak for some positive integer k. The aggregation of pT ,q at
level k, denoted by pT̃ ,q̃ , is a stationary Markov process with state transition probabilities
28
1
4
1
3
1
4
13
1
0
0
3
4
1
0
3
4
(a)
(b)
Figure 5.1: (a) Markov process in Example 5.1, (b) Aggregated model at depth 1. From
Observation 1, the model on the left can be reparameterized to be a complete tree at any
depth ≥ 2. We can hence ask for its aggregation at any depth. Aggregations of the above
model on the left at depths ≥ 2 will hence be the model itself.
given by
P
q̃(a|w) =
µ(v)q(a|v)
0
v0 ∈Tw µ(v )
v∈Tw
(5.1)
P
for all w ∈ T̃ and a ∈ A, where µ is the stationary distribution associated with pT ,q . Using
Observation 2.1, wolog, no matter what T̃ is, we will assume pT ,q has states T such that
T̃ T .
Example 5.1.
This example illustrates the computations in Definition above. Let pT ,q
be a binary Markov process with T = {11, 01, 0} and q(1|11) = 14 , q(1|01) = 13 , q(1|0) = 43 .
April 29, 2014 process p
aggregated
T̃ ,q̃
and q̃(1|0) = 34 .
Lemma 5.1.
4
, µ(01)
25
=
9
25
and µ(0) =
12
.
25
Fig. 5.1. (b) shows an
4
4 1
9 1
9 DRAFT
4
with T̃ = {1, 0}. Notice that q̃(1|1) = 25
+ 25
/( 25 + 25
) = 13
4
3
For this model, we have µ(11) =
Let pT ,q be a stationary Markov process with stationary distribution µ. If
pT̃ ,q̃ aggregates pT ,q then it has a unique stationary distribution µ̃ and for every w ∈ T̃
µ̃(w) =
X
v∈Tw
29
µ(v).
Proof
For all w ∈ T̃ , let F (w) be the set of states w0 ∈ T̃ that reach w in one step, i.e.,
F (w) = {w0 ∈ T̃ :
∃a ∈ A s.t. w w0 a}.
Let Q̃ be the transition probability matrix formed by the states of pT̃ ,q̃ . First notice that by
definition, for all w, w0 ∈ T̃ ,
Q̃(w|w0 ) =
q̃(a|w0 ) if w0 ∈ F (w)
if w0 ∈
/ F (w)
0
where q̃ is as in Definition 5.1. Since pT ,q is irreducible and aperiodic, pT̃ ,q̃ will also be
irreducible and aperiodic. Thus, there is a unique stationary distribution µ̃ corresponding
to pT̃ ,q̃ , i.e., unique solution for
µ̃(w) =
X
w0 ∈T̃
µ̃(w0 )Q̃(w|w0 ) ∀w ∈ T̃
(5.2)
We will consider a candidate solution of the form
µ̃(w) =
X
µ(v)
(5.3)
v∈Tw
for every w ∈ T̃ and show that this candidate will satisfy (5.2). Then, the claim will follow
30
by uniqueness of the solution. To show this, note that for all w ∈ T̃ ,
X
X
µ̃(w0 )Q̃(w|w0 ) =
µ̃(w0 )q̃(a|w0 )
w0 ∈F (w)
w0 ∈T̃
X X
=
w0 ∈F (w)
=
(i)
v∈Tw0
µ(v)
P
v∈Tw0
X
X
w0 ∈F (w)
v∈Tw0
X
=
P
µ(v)q(a|v)
v0 ∈Tw0
µ(v0 )
µ(v)q(a|v)
µ(s)
s∈Tw
(ii)
= µ̃(w)
where (i) holds because
[
w0 ∈F (w)
Tw 0 = Tw ,
and (ii) follows from the definition of the proposed solution given in (5.3).
5.2
Upper Bound on Entropy Rate
The entropy rate of a stationary Markov process pT ,q , denoted by HT , is defined as [54]
HT = −
X
s∈T
µ(s)
X
def
q(a|s) log q(a|s) =
a∈A
X
µ(s)Hs .
s∈T
Suppose pT̃ ,q̃ aggregates pT ,q , and let the entropy rate of the aggregated process be HT̃ .
Then,
Proposition 1.
HT ≤ HT̃ .
31
Proof
Note that for all w ∈ T̃ , Tw = {s ∈ T : w s}. Since T̃ T , we have
HT̃ =
≥
(a)
X
µ̃(w)
X
q̃(a|w) log
w∈T̃
a∈A
X
XX
µ̃(w)
a∈A v∈Tw
w∈T̃
=
XX X
=
X X
=
X
1
q̃(a|w)
µ(v)
1
P
q(a|v) log
0
q(a|v)
v0 ∈Tw µ(v )
µ(v)q(a|v) log
w∈T̃ a∈A v∈Tw
µ(v)
w∈T̃ v∈Tw
s∈T
µ(s)
X
X
q(a|v) log
a∈A
q(a|s) log
a∈A
1
q(a|v)
1
q(a|v)
1
= HT
q(a|s)
where the first inequality follows because
P
q̃(a|w) =
µ(v)q(a|v)
0
v0 ∈Tw µ(v )
v∈Tw
P
and because g(x) = x log x1 is concave for x ∈ [0, 1]. The equality (a) follows since µ̃(w) =
P
v∈Tw µ(v).
Remark
In this paper, we are particularly concerned with the slow mixing regime.
As our results will show, in general it is not possible to obtain a simple upper bound on
the entropy rate using the data (given a particular starting state) and taking recourse to
the Proposition above. Instead, we introduce the partial entropy rate that can be reliably
obtained from the data
HG̃ =
X µ(w)
Hw ,
µ(G̃)
w∈G̃
where G̃ ⊆ T̃ will be a set of good states that we show how to identify. The partial entropy
rate is not necessarily an upper bound, but in slow mixing cases it is sometimes the best
heuristic possible. We systematically handle the entropy rates of slow mixing processes using
the estimation results below in different paper.
32
Notwithstanding the previous remark, we will focus on estimating the aggregated parameters
q̃(T̃ ), where T̃ = Akn has depth kn where kn grows logarithmically as αn log n for some
αn = O(1).
The aggregated model defined in this section is purely to facilitate our analysis.
Remark
We never get to observe the aggregated model, though we will pose estimation problems in
terms of its parameters.
We only can observe samples from the underlying (original as opposed to aggregated) model.
From these observations we must figure out q(T ) if at all possible. In general, there is no
real reason why it should even be possible. But we will be able to handle this task because
of (6.1).
And there is, of course, no guarantee that the counts of short strings are any more reliable
in a long-memory, slow mixing process.
Example 5.2.
Let T = {11, 01, 10, 00} with q(1|11) = , q(1|01) = 21 , q(1|10) = 1 − ,
q(1|00) = . If > 0, then pT ,q is a stationary ergodic binary Markov process. Let µ denote
the stationary distribution of this process. A simple computation shows that µ(11) =
µ(01) =
2−2
7−6
+
2−2
,
7−6
2−2
7−6
=
µ(10) =
2−2
7−6
and µ(00) =
2−2
,
7−6
and µ(1) =
1
7−6
+
2−2
7−6
=
3−2
7−6
1
,
7−6
and µ(0) =
4−4
.
7−6
Suppose we have a length n sample. If n1 , then µ(1) ≈
3
7
and µ(0) ≈ 47 . If the initial state
belongs to {11, 01, 10}, the state 00 will not be visited with high probability in n samples,
and it can be seen that the counts of 1 or 0 will not be near the stationary probabilities
µ(1) or µ(0). For this sample size, the process effectively acts like the irreducible, aperiodic
Markov chain in Fig. 5.2. (b) which can be shown to be fast mixing. Therefore the stationary
probabilities of the chain in Fig. 5.2. (b),
µ(01)
, µ(10) ,
µ(1)+µ(01) µ(1)+µ(01)
and
µ(11)
µ(1)+µ(01)
converge quicker
than µ(1) or µ(0). Indeed, this observation guides our search for results in Chapter 8.2.
33
1
1
1
2
0
1−
1
10
1
1
1
1
2
11
0
01
1
2
0
(a)
(b)
Figure 5.2: (a) Markov in Example 5.2, (b) Same process when = 0 .
April 29, 2014
34
DRAFT
6
Dependencies in the Markov Process
6.1
Modeling Dependencies Die Down
As noted before in Example 2.3, if the dependencies could be arbitrary in a Markov process,
we will not estimate the model accurately no matter how large the sample is. Keeping
in mind Observation 2.1, we formalize dependencies dying down by means of a function
P
d : Z+ → R+ with ∞
i=1 d(i) < ∞.
Let Md be the set of all models (T , q(T )) that satisfy for all u ∈ A∗ and all b, b0 ∈ A
q(a|bu)
≤ d(|u|),
−
1
q(a|b0 u)
35
(6.1)
0
for all a ∈ A, where q(a|bu) = P (Y1 = a|cT (Y−∞
) = bu). Note that our collection Md has
bounded memory iff there exist a finite K such that d(i) = 0 for all i > K.
Remark
We emphasize that the restriction {d(i)}i≥1 does not preclude slow mixing pro-
cesses in Md —we clarify this point in following example. The restriction {d(i)}i≥1 we use is
related to the notion of “continuity rate” of stochastic processes used in [31, 32, 55, 56].
Example 6.1.
Let (T , q(T )) be a context tree model with T = {0, 1} and transition
probabilities q(1|1) = 1 − , q(1|0) = as in Fig. 2.2. (a). A simple calculation shows that
the stationary probability of being at state 0 or 1 is 12 . Note that even with very strong
restriction on d, namely d(i) = 0 for i ≥ 1, (T , q(T )) belongs to Md no matter how is.
While we do not need the notions of φ-mixing and β-mixing coefficients for stationary
stochastic processes [57] in the rest of the paper, we compute them for this example as an
illustration that {d(i)}i≥1 are unrelated to mixing. Recall that the j 0 th φ-mixing coefficient
φj ≥ max
max |P (Yj |Y0 ) − P (Yj )|
Y0 ∈{0,1} Yj ∈{0,1}
≥ |P (Yj = 1|Y0 = 1) − P (Yj = 1)|
≥ P (Yj = 1, Yj−1 = 1, · · · , Y1 = 1|Y0 = 1) − P (Yj = 1)
1
= (1 − )j − .
2
Similarly, the j 0 th β−mixing coefficient
βj ≥ EY0
max |P (Yj |Y0 ) − P (Yj )|
Yj ∈{0,1}
1
= (1 − )j − .
2
No matter what > 0 is, (T , q(T )) ∈ Md even under stringent restriction d(i) = 0 for i ≥ 1.
But, any (β, φ)-mixing coefficient can be made arbitrarily as close to
1
2
as possible by picking
small enough. Therefore, the condition (6.1) we impose does not preclude slow mixing.
36
As mentioned in the last section, we will focus on set of the aggregated parameters at depth
kn , q̃(Akn ) where kn = αn log n. If kn is large enough, these aggregated parameters start to
reflect the underlying parameters q(T ). Indeed, by using an elementary argument in Chapter
7 we will show that both the underlying and aggregated parameters will then be close to the
empirically observed values for states that occur frequently enough—even though the sample
we use comes from the true model (T , q(T )) instead of the aggregated model (T̃ , q̃(Akn )).
In the context of information theory, the function d controls the incremental value of one
additional data point in the history. This can be characterized by means of conditional
mutual information [54].
Lemma 6.1.
Let pT ,q ∈ Md . Condition (6.1) implies that
I(Y0 ; Yi+1 |Y1i ) ≤ log(1 + d(i)),
(6.2)
where I(Y0 ; Yi+1 |Y1i ) denotes the conditional mutual information between Y0 and Yi+1 given
Y1i .
37
Proof
Note that
I(Y0 ; Yi+1 |Y1i )
X
X
=
pT ,q (Y1i = u)
pT ,q (Y0 = a, Yi+1 = b|Y1i = u)
a,b∈A
u∈Ai
log
=
X
X
pT ,q (Y1i = u)
pT ,q (Y0 = a, Yi+1 = b|Y1i = u)
pT ,q (Y0 = a|Y1i = u)P (Yi+1 = b|Y1i = u)
pT ,q (Y0 = a|Y1i = u)pT ,q (Yi+1 = b|Y0i = au)
a,b∈A
u∈Ai
pT ,q (Yi+1 = b|Y0i = au)
pT ,q (Yi+1 = b|Y1i = u)
X
X
X
pT ,q (Y1i = u)
=
pT ,q (Y0 = a|Y1i = u)
pT ,q (Yi+1 = b|Y0i = au)
log
a∈A
u∈Ai
b∈A
pT ,q (Yi+1 = b|Y0i = au)
i
c∈A pT ,q (Yi+1 = b, Y0 = c|Y1 = u)
X
X
X
=
pT ,q (Y1i = u)
pT ,q (Y0 = a|Y1i = u)
log P
a∈A
u∈Ai
X
b∈A
pT ,q (Y0 = c|Y1i = u)pT ,q (Yi+1 = b|Y0i = au)
c∈A
P
i
i
c∈A pT ,q (Y0 = c|Y1 = u)pT ,q (Yi+1 = b|Y0 = au)
log P
i
i
c∈A pT ,q (Y0 = c|Y1 = u)pT ,q (Yi+1 = b|Y0 = cu)
X
X
(i) X
pT ,q (Y0 = a|Y1i = u)
≤
pT ,q (Y1i = u)
a∈A
u∈Ai
X
b∈A
pT ,q (Y0 =
c|Y1i
pT ,q (Y1i = u)
X
= u)pT ,q (Yi+1 =
c∈A
(ii)
≤
X
u∈Ai
pT ,q (Y0 = a|Y1i = u)
a∈A
X
pT ,q (Y0 =
p (Yi+1 = b|Y0i = au)
= au) log T ,q
pT ,q (Yi+1 = b|Y0i = cu)
X
b|Y0i
b∈A
c|Y1i
= u)pT ,q (Yi+1 =
b|Y0i
= au) log(1 + d(i))
c∈A
= log 1 + d(i) ,
where (i) follows by applying log-sum inequality to the term inside (·) and (ii) holds by
equation (6.1).
38
It is interesting to note that not all Markov processes even in practical settings need to satisfy
this observation (in fact problems in DNA folding specifically cannot make this assumption),
but that this assumption is sufficient for statistical estimation problems to be well posed.
6.2
Aggregated Models in Md
In this section, we investigate the consequences for a Markov process being in Md .
Lemma 6.2.
Let {d(i)}i≥1 be a sequence of real numbers such that there exists some
n0 ∈ N for which, 0 ≤ d(i) ≤ 1 for all i ≥ n0 . Then, ∀j ≥ n0 , we have
1−
Proof
X
i≥j
d(i) ≤
Y
(1 − d(i)) ≤ Q
i≥j
1
.
i≥j (1 + d(i))
Without loss of generality, let d(i) be decreasing, and consider a distribution q over
N such that
X
q(j) = d(i).
j≥i
Let {Xn }n≥1 be a sequence of i.i.d. random variables distributed according to q and let
Ei = P (Xi ≥ i). Therefore, Ei are independent with P(Ei ) = d(i). Then,
P(
[
i≥j
Ei ) = 1 −
Y
i≥j
(1 − d(i)).
Since
P(
[
i≥j
Ei ) ≤
X
P(Ei ) =
i≥j
X
d(i),
i≥j
we have
1−
X
i≥j
d(i) ≤
Y
(1 − d(i)).
i≥j
Since by assumption 0 ≤ d(i) ≤ 1 for all i ≥ n0 , the second inequality can easily be derived
39
by the fact that
Y
(1 − d2 (i)) ≤ 1.
i≥j
Lemma 6.3. Let (T , q(T )) ∈ Md . Suppose pT̃ ,q̃ aggregates pT ,q with T̃ = Akn . If
P
i≥kn d(i) ≤ 1, then for all w ∈ T̃ and a ∈ A
X
mins∈Tw q(a|s)
.
P
1−
d(i) max q(a|s) ≤ q̃(a|w) ≤
s∈Tw
1
−
d(i)
i≥k
n
i≥k
n
Proof
Let w ∈ T̃ and fix a ∈ A. Note that for T̃ = Akn , by assumption we have for all
b0 , b00 ∈ A
q(a|b0 w)
≤ d(kn ).
−
1
q(a|b00 w)
(6.3)
According Lemma 5.1, q̃(a|w) is weighted average of q(a|bw), b ∈ A. Hence,
min q(a|bw) ≤ q̃(a|w) ≤ max q(a|bw).
b∈A
b∈A
Combining (6.3) with (6.4) and straightforward elementary algebra shows that ∀b ∈ A
q̃(a|w) 1 − d(kn ) ≤ q(a|bw) ≤ 1 + d(kn ) q̃(a|w).
Proceeding inductively, for all s ∈ Tw we have
Y
i≥kn
1 − d(i)
q̃(a|w) ≤ q(a|s) ≤
Y
1 + d(i) q̃(a|w).
i≥kn
Now, Lemma 6.2 implies that
X
mins∈Tw q(a|s)
.
P
1−
d(i) max q(a|s) ≤ q̃(a|w) ≤
s∈Tw
1
−
d(i)
i≥k
n
i≥k
n
40
(6.4)
7
Estimation of Model Parameters
7.1
Naive Estimators
Even in the slow mixing case, we want to see if any estimator can be accurate at least
partially. In particular, we consider the naive estimator that operates on the assumption
that samples are from the aggregated model pT̃ ,q̃ . There is no reason that the naive estimates
should reflect the parameters associated with the true model pT ,q . Even in the slow mixing
case, we want to see if any estimator can be accurate at least partially. In particular, we
consider the naive estimator that operates on the assumption that samples are from the
aggregated model pT̃ ,q̃ . Without our assumption on dependencies falling off, there is no a-
41
priori reason that the naive estimates should reflect the parameters associated with the true
model pT ,q .
0
, and we see n samples
Throughout this chapter, we assume that we start with some past Y−∞
Y1n from pT ,q . All confidence probabilities are conditional probabilities on Y1n obtained from
0
0
underlying unknown model pT ,q , given past outputs Y−∞
. The results hold for all Y−∞
(not
just with probability 1).
Definition 7.1.
Given a sample sequence Y1n , let T̃ = Akn with kn = αn log n for some
function αn = O(1). For s ∈ T̃ let Ys be the sequence of symbols that follows the string s.
Hence, the length of Ys is
Ns =
n
X
i=1
i−1
) = s}.
1{cT̃ (Y−∞
Therefore, the number of a0 s in Ys is
nsa =
n
X
i=1
Observer that Ns =
P
a∈A
i
) = sa}.
1{cT̃ (Y−∞
nsa . The naive estimate of the aggregated parameters is
q̂(a|s) =
Remark
nsa
.
Ns
Note that Ys is i.i.d. only if s ∈ T , the set of states for the true model. In
general, since we do not necessarily know if any of nsa are close to their stationary frequencies,
there is no obvious reason why q̂(a|s) shall reflect q̃(a|s).
Since the process could be slow mixing, not all parameters are going to be accurate. Rather,
there will be a set of good states in which we can do estimation properly.
Let δj =
P
i≥j
d(i). Note that δj → 0 as j → ∞ and that −δj log δj → 0 as δj → 0.
42
Definition 7.2.
Given a sample sequence with size n from pT ,q , let
1
, |A|kn +1 log2 n} .
G̃ = w ∈ T̃ : Nw ≥ max {nδkn log
δkn
Note that the set G̃ is obtained only using the sample and the known function d. We call the
set G̃ “good” in an anticipatory fashion because we are going to prove concentration results
for estimates attached to these states.
Secondly, since we include arbitrary slow mixing sources, there is no way to estimate all
model parameters with a sample of length n. Therefore, while the definition of G̃ above may
not be the tightest, some notion of good states is unavoidable.
7.2
Estimate of Transition Probabilities
Let T̃ = Akn . Using samples from pT ,q , we consider the estimation of parameters q̃(T̃ ) of
the aggregated model at depth kn , and derive deviation bounds on the estimates in Theorem
7.1. However, before going to the proof of theorem, we want to make the following important
remark.
Remark
Observe that because we do not assume the source has mixed, the theorem
below does not imply that the parameters are accurate for contexts shorter than kn . We
may therefore be able handle longer states’ parameters (say a sequence of ten 0s), without
being able to infer those attached to their suffixes (say a sequence of five 0s).
This is perhaps counterintuitive at first glance. To see why this could happen, note that
the result below shows what can be obtained without using anything about the mixing
properties of the source—namely, it does not rely on empirical frequencies of various strings
being close to their stationary probabilities. Therefore, results on longer strings do not
automatically translate to results on shorter ones. Secondly, longer strings have attached
conditional probabilities closer to the true conditional probabilities with which the source
43
generates the data—therefore, there is less bias to counter.
This is contrary to most prior work which obtain bounds on transition probabilities subsequent to concentration of empirical counts of strings around their stationary probabilities.
Additional information about the mixing of the source would further strengthen the following
results.
Theorem 7.1.
Let Y1n be generated by an unknown model pT ,q ∈ Md . Let kn = αn log n.
0
, with probability under pT ,q ≥ 1 −
Given any Y−∞
1
2|A|
kn +1 log n
, for all w ∈ T̃ simultaneously
2(|A|kn +1 log n + nδkn )
.
D q̂(·|w)kq̃(·|w) ≤
Nw
Proof
As before, let δkn =
P
i≥kn
d(i) and let n be large enough that δkn ≤ 21 . Note that
0
Proposition 6.3 implies that for all sequences y1n ∈ An and all Y−∞
Y Y
1
q̃(a|w)nwa
n
(1 − δkn )
w∈T̃ a∈A
Y Y
≤ 4nδkn
q̃(a|w)nwa
0
pT ,q (y1n |Y−∞
)≤
w∈T̃ a∈A
1 n
where nwa was defined in Definition 7.1 and the second inequality is because ( 1−t
) ≤ 4nt
whenever 0 ≤ t ≤ 12 . Now, let Bn be the set of all sequences that satisfy
nδkn
4
Y Y
nwa
q̃(a|w)
w∈T̃ a∈A
Q
≤
nwa
a∈A q̂(a|w)
.
22|A|kn +1 log n
w∈T̃
Q
Using a depth-kn context tree weighting algorithm [19] we obtain a distribution pc satisfying 1
0
)
pc (y1n |Y−∞
Q
≥
w∈T̃
Q
a∈A q̂(a|w)
2|A|kn +1 log n
nwa
.
1. While we use the context tree weighting algorithm, any worst case optimal universal compression
algorithm would do for this theorem to follow.
44
Now, for all sequences y1n ∈ Bn , we have
0
pc (y1n |Y−∞
)
Q
≥
≥
w∈T̃
4nδkn
nwa
a∈A q̂(a|w)
2|A|kn +1 log n
Q
Q
w∈T̃
Q
nwa 2|A|kn +1 log n
2
a∈A q̂(a|w)
k
+1
n
log n
2|A|
0
≥ pT ,q (y1n |Y−∞
)2|A|
kn +1
log n
.
Thus, Bn is the set of sequences y1n such that pc assigns a much higher probability than pT ,q .
Such a set Bn can not have high probability under pT ,q .
pT ,q (Bn ) = pT ,q
≤
X
y1n ∈Bn
≤ 2−|A|
y1n
:
0
pc (y1n |Y−∞
)
0
pc (y1n |Y−∞
)2−|A|
kn +1
kn +1
log n
log n
kn +1
Therefore, with probability ≥ 1 − 2−|A|
Y Y
≥
kn +1 log n
0
pT ,q (y1n |Y−∞
)2|A|
q̃(a|w)
nwa
w∈T̃ a∈A
log n
0
, (no matter Y−∞
) we have
Q
≥
nwa
w∈T̃
a∈A q̂(a|w)
22|A|kn +1 log n 4nδkn
Q
which implies simultaneously for all w ∈ T̃
Y
q̃(a|w)
a∈A
nwa
nwa
a∈A q̂(a|w)
22|A|kn +1 log n 4nδkn
Q
≥
.
The above equation implies that q̃ and q̂ are close distributions, since we can rearrange (and
divide both sides by Nw ) to obtain:
X nwa
a∈A
Nw
log
2(|A|kn +1 log n + nδkn )
q̂(a|w)
= D q̂(·|w)kq̃(·|w) ≤
,
q̃(a|w)
Nw
where the first equality follows by writing out the value of the naive estimate, q̂(a|w) =
nwa /Nw .
45
Corollary 7.2.
Let Y1n be generated by an unknown model pT ,q ∈ Md . Let kn = αn log n.
0
, with probability under pT ,q ≥ 1 −
Given any Y−∞
s
kq̃(·|w) − q̂(·|w)k1 ≤ 2
Proof
1
2|A|
kn +1 log n
, for all w ∈ T̃ simultaneously
(ln 2)(|A|kn +1 log n + nδkn )
.
Nw
The proof follows immediately from Theorem 7.1 and Pinsker’s inequality (see e.g.,
[54])
D q̂(·|w)kq̃(·|w) ≥
1
kq̃(·|w) − q̂(·|w)k21 .
2 ln 2
Remark
0
0
.
hold for all history Y−∞
We emphasize that the above results depend on Y−∞
Remark
For all w ∈ T̃ with |A| = 2, by Corollary 7.2, we have
q
2 (2 ln 2)2kn +1 log n
Nw
kq̃(.|w) − q̂(.|w)k1 ≤
q
2 (2 ln 2)nδkn
Nw
If d(i) = γ i for some 0 < γ < 12 , then δkn =
of above estimation is
Therefore, if Nw =
n
,
log n
1
2
and by choosing kn = γ log n, the accuracy
then the accuracy of the estimation will be
p
nγ log γ log n .
< γ < 1, then the accuracy will be
r γ
n log n
Θ
.
Nw
46
.
otherwise.
s 1+γ log γ n
Θ
.
Nw
Θ
If
γ kn
1−γ
nδ
if kn ≥ log( 2 logknn ),
Therefore, if Nw =
n
,
log n
then the accuracy of the estimation will be
Θ(n
If d(i) =
1
ir
γ−1
2
log n).
for some r > 2, then δkn ≈ kn 1−r and by choosing kn =
1
r
log n − log log n, the
accuracy will be
r
Θ
Therefore, if Nw =
n
,
log n
n
kn r−1 Nw
.
then the accuracy of the estimation will be
s
Θ
1
.
(log n)r−2
Good states G̃ in Definition (7.2) identify all strings w in the sample whose transition
probabilities are accurate to at least
√1 .
log n
It is quite possible that for all w ∈ G̃, the
accuracy is significantly better.
When the dependencies among strings die down exponentially, we can strengthen Theorem
7.1 with a more careful calculations to get a stronger convergence rate polynomial in n.
Theorem 7.3.
Let kn =
log n
|A|
log γ
Suppose d(i) = γ i for some 0 < γ < 1. Let ζ be a nonnegative constant.
. In analogy with G̃, we define
log |A|
ζ+
log |A|−log γ
F̃ , w ∈ T̃ : Nw ≥ n
.
kn +1
0
Then, conditioned on any past Y−∞
, with probability under pT ,q greater than 1−2−|A|
log n
simultaneously for all w ∈ F̃
s
kq̃(.|w) − q̂(.|w)k1 ≤ 2
Proof
ln 2 · (1 − γ)|A| log n + 1
.
(1 − γ)nζ
The proof of Theorem 7.3 is similar to Theorem 7.1, but involves more careful but
47
elementary algebra specific to the exponential decay case using the value of kn and F̃ noted
in the statement. Note that ζ <
Remark
1
γ
|A|
γ
log
log
for the Theorem not to be vacuously true.
According to definition of good states in Theorem 7.3 and the fact that T̃ =
Akn , we obtain
log |A|
log γ
|F̃ | ≤ n− log |A|−log γ , |T̃ | = n log |A|−log γ
implying that if γ ≤ 1/|A|, all states of T̃ can potentially be good.
Remark
The rate of convergence in Theorem 7.1 is the minimum that hold for all
strings of length kn simultaneously, not just good strings– and strings that appear more
often will have stronger bounds automatically. Specifically, strings that appear nβ times
for any β > 0 have convergence rates that are polynomial in n for the exponential decay
case. We emphasize that our results are not too off because even in i.i.d. case, the Chernoff
bounds do not provide much stronger relative to the bounds we obtain in the exponential
decay case.
48
8
Estimation Along a Sequence of Stopping
Times
Note that the parameters q̃(·|w) associated with any good state w ∈ G̃ can be well estimated
from the sample while the rest may not be accurate. From Example 2.4, we know that
the stationary probabilities may be a very sensitive function of the parameters associated
with states. It is therefore perfectly possible that we estimate the parameters at all states
reasonably well, but are unable to gauge what the stationary probabilities of any state may
be. How do we tell, therefore, if the few model parameters we estimated can say anything
at all about stationary probabilities?
49
Before we can interpret the counts of various strings w in the sample, we make some observations about the process {Yi }i≥1 restricted to states in G̃ in this chapter. Some of the
observations regarding stopping times in this chapter are well known, see for example [58].
In Chapter 8.2, we use these observations with a coupling argument to derive Theorem 8.3
that provides deviation bounds on the counts of strings.
8.1
Restriction of pT ,q to G̃
To find deviation bounds for stationary distribution of good states, we construct a new
process {Zm }m≥1 , Zm ∈ T from the process {Yi }i≥1 . At the outset, note that Zm ∈ T ,
where T is unknown. We use this process {Zm }m≥1 as an analytical tool, and we will not
need to actually observe it. We will need to know if the process is aperiodic, but that can
be resolved by only looking at G̃.
im
) ∈ G̃, then Zm =
If Yi m is the (m + 1)th symbol in the sequence {Yi }i≥1 such that cT̃ (Y−∞
im
cT (Y−∞
). The strong Markov property allows us to characterize {Zm }m≥1 as a Markov
process with transitions that are lower bounded by those transitions of the process {Yi }i≥1
that can be well estimated by the Theorems above. More specifically, let
j
) ∈ G̃},
T0 = min {j ≥ 0 : cT̃ (Y−∞
T0
and let Z0 = cT (Y−∞
). For all m ≥ 1, Tm is the (m + 1)0 th occurrence of a good state in the
sequence {Yi }i≥1 , namely
j
Tm = min {j > Tm−1 : cT̃ (Y−∞
) ∈ G̃},
Tm
and Zm = cT (Y−∞
). Note that Tm is a stopping time [8], and therefore {Zm }m≥1 is a
Markov chain by itself. Let B̃ = {s ∈ T̃ : s ∈
/ G̃}. The transitions between states
w, w0 ∈ G̃ are then the minimal, non-negative solution of the following set of equations
50
in {Q(w|s) : s ∈ Akn , w ∈ G̃}
1
0
) = w|cT̃ (Y−∞
) = w0
Q(w|w0 ) = pT ,q cT̃ (Y−∞
X
1
0
+
Q(w|s)pT ,q cT̃ (Y−∞
) = s|cT̃ (Y−∞
) = w0
s∈B̃
An important point to note here is that if w and w0 are good states,
1
0
Q(w|w0 ) ≥ pT ,q cT̃ (Y−∞
) = w|cT̃ (Y−∞
) = w0 ,
and the lower bound above can be well estimated from the sample as shown in Theorem 7.1.
Definition 8.1.
8.1.1
We will call {Zm }m≥1 , the restriction of pT ,q to G̃.
Properties of {Zm }m≥1
Property 1.
A few properties about {Zm }m≥1 are in order. {Zm }m≥1 is constructed from
an irreducible process {Yi }i≥1 , thus {Zm }m≥1 is irreducible as well. Since {Yi }i≥1 is positive
recurrent, so is {Zm }m≥1 . But despite {Yi }i≥1 being aperiodic, {Zm }m≥1 could be periodic as
in the Example below. But periodicity of {Zm }m≥1 can be determined by G̃ alone (because
T , while unknown, is a full, finite A-ary tree).
Example 8.1.
Let {Yi }i≥1 be a process generated by context tree model pT ,q with T =
{11, 01, 10, 00} and q(1|11) = 12 , q(1|01) = , q(1|10) = 1 − , q(1|00) = 21 . If > 0, then pT ,q
represents a stationary aperiodic Markov process. If {Zm }m≥1 be the restriction of process
{Yi }i≥1 to G̃ = {01, 10}, the restricted process will be periodic with period 2.
Property 2.
Suppose {Zm }m≥1 is aperiodic. Let µY and µZ denote the stationary dis-
tribution of the processes {Yi }i≥1 and {Zm }m≥1 , respectively, with n samples of a sequence
{Yi }i≥1 yielding mn samples of {Zm }m≥1 . Similarly, let µZ (w) denote the stationary probability of the event that w is a suffix of samples in the process {Zm }m≥1 . Then for all
51
w, w0 ∈ G̃ (note that µY (w0 ) > 0)
P n 1(wZj )
Pn 1(c (Y i )=w)
limmn →∞ m
µY (w) wp1 limn→∞ i=1 T̃ −∞
wp1 µZ (w)
j=1
mn
n
=
=
.
Pmn 1(w0 Zj ) =
i
0)
P
0
1
(c
(Y
)=w
n
µY (w )
µZ (w0 )
lim
limn→∞ i=1 T̃ −∞
mn →∞
j=1
m
n
n
8.2
Estimate Of Stationary Probabilities
Thus far, we have identified a set G̃ ⊆ Akn of good strings using n observations from pT ,q . For
strings w ∈ G̃, we have been able to estimate approximately the conditional distributions,
conditioned on past strings w—namely P (Y |w), or equivalently the aggregated parameters
q̃(·|w). As mentioned before, it is not at all clear if this information we have obtained from
the sample will allow us to say anything at all about the stationary probabilities of strings
w ∈ G̃. This section develops on this question, and shows how to interpret naive counts of
w ∈ G̃ in the sample.
8.2.1
Preliminaries
For any (good) state w, let Gw ⊂ A be the set of letters that take w to another good state,
Gw = {a ∈ A : cT̃ (wa) ∈ G̃}.
(8.1)
Our confidence in the empirical counts of good states matching their (aggregated) stationary
probabilities follows from a coupling argument, and depends on the following parameter
ηG̃ = min
u,v∈G̃
X
a∈Gu ∩Gv
min {q̃(a|u), q̃(a|v)},
52
(8.2)
where q̃(a|u) is as defined before in (5.1) for aggregated model parameters. Note that for
any state s ∈ T of the original process pT ,q , if u s, using Lemma 6.3
q̃(a|u) ≤
Remark
q(a|s)
1 − δ|u|
(8.3)
Recall that the deviation bounds in Theorems 7.1 and 7.3 hold simultaneously
for all contexts in G̃. The above definition only depends on parameters associated with G̃.
Hence we can estimate ηG̃ from the sample with the same accuracy and confidence given in
those Theorems.
The counts of various w ∈ G̃ now concentrates as shown in the Theorem below, and how
good the concentration is can be estimated as a function of ηG̃ (and δkn ) and the total count
of all states in G̃ as below. Now G̃ as well as ηG̃ are well estimated from the sample—thus
we can look at the data to interpret the empirical counts of various substrings of the data.
We are now ready to interpret the counts of various strings in the sample. Let
∆j =
X
δi .
i≥j
For the following results, we require {δi }i≥1 to be summable. Thus, ∆j is finite for all j
and decreases to 0 as j increases. If d(i) ∼ γ i , then ∆j also diminishes as γ j . But if
d(i) ∼
1
ir
diminishes polynomially, then ∆j diminishes as 1/j r−2 . If d(i) = 1/i2+η for any
η > 0, we therefore satisfy the summability of {δi }i≥1 . However, d(i) can also diminish
as 1/(i2 poly (log i)) for appropriate polynomials of log i for the counts of good states to
converge. In what follows, we assume that δi ≤ 1i .
Definition 8.2.
Let G̃ be the set of good states from Definition 7.2. Let ñ be total count
of all good states in the sample and `n denote the smallest integer such that ∆`n ≤ n1 . To
analyze the naive counts of w ∈ G̃, we define
def
Vm = E[Nw |Z0 , Z1 , . . . ,Zm ],
53
where Nw is as in Definition 7.1.
Observe that {Vm }ñm=0 is a Doob Martingale. Note that
V0 = E[Nw |Z0 ] and Vñ = Nw .
Once again, note that we do not have to observe the restriction process {Zm }m≥1 at any
point, nor do we have to observe the martingale V except for noting Vñ = Nw .
Remark
To prove Theorem 8.3, we first bound the differences |Vm − Vm−1 | of the mar-
tingale using a coupling argument in Lemma 8.1. Since the memory of the process pT ,q
could be large, our coupled chains may not actually coalesce in the usual sense. But they
get “close enough” that the chance they diverge again within n samples is less than 1/n.
Once we bound the differences in the martingale {Vm }ñm=0 , Theorem 8.3 follows as an easy
application of Azuma’s inequality.
8.2.2
The Coupling Argument
Since for all m ≥ 1
|Vm − Vm−1 | = |E[Nw |Z0 , . . . ,Zm ] − E[Nw |Z0 , . . . ,Zm−1 ]|
0
00 ≤ max
E[N
|Z
,
.
.
.
,Z
]
−
E[N
|Z
,
.
.
.
,Z
]
w
0
w
0
m
m ,
0
00
Zm ,Zm
we bound the maximum change in Nw if the mth good state was changed into another (good)
state.
0
0
00
Suppose there are sequences {Zj }nj=m (starting from state Zm ) and {Zj }nj=m (starting from
00
state Zm ), both faithful copies of the restriction of pT ,q to G̃ but coupled with a joint
distribution ω to be described below. From the coupling argument of Chapter 4.2, we have
54
for w ∈ G̃ (hence |w| = kn ) for all ω
0
n
X
00
|E[Nw |Z0 , . . . ,Zm ] − E[Nw |Z0 , . . . ,Zm ]| ≤
0
j=m+1
kn
00
ω(Zj 6≈ Zj ),
(8.4)
where we use
0
kn
00
0
00
Zj ≈ Zj for cAkn (Zj ) = cAkn (Zj ).
We will bound the right side of (8.4) above using properties of ω that we describe. Note that
0
00
the summation goes up to n since no matter what Zm and Zm are, a length-n sample can
have at most n good states on the right side of (8.4). The coupling technique is a convenient
“thought experiment” that culminates in Theorem 8.3 giving a deviation bound on the naive
counts of states. We do not actually have to generate the two chains as part of the estimation
algorithm, nor do we need to observe the martingale V , except for noting Vñ = Nw .
In Subsection 8.2.3 below, we describe ω, highlight some properties of the described coupling
in Subsection 8.2.4, and use the said properties to obtain bound on Martingale difference in
Subsection 8.2.5.
8.2.3
Description of Coupling ω
0
00
0
Suppose we have Zj and Zj . This subsection describes how to obtain next sample Zj+1 and
00
Zj+1 of the two coupled chains, namely how to sample from
0
00
0
00
ω(Zj+1 , Zj+1 |Zj , Zj ).
0
00
Recall from chapter 4.2 that individually taken, both Zj+1 and Zj+1 are faithful evolutions
0
00
0
00
of the restriction of pT ,q to G̃. However, given Zj and Zj , Zj+1 and Zj+1 are not necessarily
independent.
0
00
0
00
0
To obtain Zj+1 and Zj+1 , starting from states Zj and Zj we run two copies {Yji }i≥1 and
55
00
{Yji }i≥1 1 of coupled chains individually faithful to pT ,q . If l is the smallest number such that
0
0
0
cT̃ (Zj Yj1 · · · Yjl ) ∈ G̃,
then
0
0
0
0
Zj+1 = cT (Zj Yj1 · · · Yjl ).
00
Similarly for Zj+1 .
8.2.3.1
0
00
0
00
Sampling from the Joint Distribution ω(Yj1 , Yj1 |Zj , Zj )
While the following description appears verbose, Fig. 8.1. (b) represents the description
00
0
pictorially. Specifically, the chains {Yji }i≥1 and {Yji }i≥1 are coupled as follows. We generate
0
00
a number Uj1 uniformly distributed in [0, 1]. Given (Zj and Zj ) with suffixes u and v
respectively in G̃, we let Gu ∈ A (and Gv similarly) be the set of symbols in A defined
as in (8.1). We split the interval from 0 to 1 as follows: for all a ∈ A, we assign intervals
r(a) of length min {q(a|u), q(a|v)}, in the following order: we first stack the above intervals
corresponding to a ∈ Gu ∩Gv (in any order) starting from 0, and then we put in the intervals
corresponding to all other symbols. Now let,
0
00
(Yj1 , Yj1 ) = (a, a) if Uj1 ∈ r(a).
Let
C(A) =
X
a0 ∈A
r(a0 ) =
X
a0 ∈A
0
00
min {q(a0 |Zj ), q(a0 |Zj )},
(8.5)
be the part of the interval is already filled up. Thus if Uj1 < C(A), equivalently with
probability C(A), the two chains output the same symbol. We use the rest of the interval
0
0
00
[C(A), 1] in any valid way to satisfy the fact that Yj1 is distributed as pT ,q (·|Zj ) and Yj1 is
0
00
1. Note that {Yji }i≥1 and {Yji }i≥1 are sequences of symbols from A, generated according to transitions
defined by pT ,q .
56
1
1
q(a2 |u) = 0.3
1
q(a2 |v) = 0.6
1
ru (a1 ) = 0.3
rv (a2 ) = 0.3
r(a2 ) = 0.3
r(a2 ) = 0.3
1 − C(A)
C(A)
q(a1 |u) = 0.7
q(a1 |v) = 0.4
0
u
0
v
r(a1 ) = 0.4
0
u
r(a1 ) = 0.4
0
v
(a)
(b)
0
00
Figure 8.1: (a) The conditional probabilities with which Yj1 and Yj1 have to be chosen
0
respectively are q(·|u) and q(·|v). The line on the left determines the choice of Yj1 and
00
the one on the right the choice of Yj1 . For example, if Uj1 is chosen uniformly in [0,1], the
0
0
00
probability of choosing Yj1 = a1 is q(a1 |u). Instead of choosing Yj1 and Yj1 independently, we
00
0
will reorganize the intervals in the lines so as to encourage Yj1 = Yj1 . (b) Reorganizing the
interval [0, 1] according to the described construction. Here r(a1 ) = min {q(a1 |u), q(a1 |v)}
0
00
and similarly for r(a2 ). If Uj1 falls in the interval corresponding to r(a1 ), then (Yj1 , Yj1 ) =
00
0
(a1 , a1 ). If Uj1 > C(A) in this example, then (Yj1 , Yj1 ) = (a1 , a2 ). When Uj1 is chosen
0
uniformly in [0,1], the probability Yj1 outputs any symbol is the same as in the picture on
0
the left, similarly for Yj2 .
00
distributed as pT ,q (·|Zj ). For one standard approach, for all a assign
ru (a) = (q(a|u) − q(a|v))+ = max {q(a|u) − q(a|v), 0}
April 29, 2014
DRAFT
and similarly rv (a). Note that only one of ru (a) and rv (a) can be strictly positive and that
for all a, r(a) + ru (a) = q(a|u) while r(a) + rv (a) = q(a|v). Therefore,
X
a0 ∈A
ru (a0 ) =
X
a0 ∈A
rv (a0 ) = 1 − C(A).
An example of such construction for binary alphabet is illustrated in Fig. 8.1 in which we
have assumed Gu ∩ Gv = {a1 }. We will keep two copies of the interval [C(A), 1], and if
0
00
Uj1 > C(A) we output (Yj1 , Yj1 ) based on where Uj1 falls in both copies. We will stack the
57
first copy of [C(A), 1] with intervals of length ru (a) for all a and the second copy of [C(A), 1]
with intervals length rv (a) for all a. We say Uj1 ∈ (ru (a), rv (a0 )) if Uj1 ∈ ru (a) in the first
copy and Uj1 ∈ rv (a0 ) in the second copy,
0
00
(Yj1 , Yj1 ) = (a, a0 ) if Uj1 ∈ (ru (a), rv (a0 )).
Note in particular that
0
0
00
0
0
ω(Yj1 |Zj , Zj ) = pT ,q (Yj1 |Zj ).
0
00
00
0
00
00
0
and similarly for Yj1 . If cT̃ (Zj Yj1 ) ∈ G̃ and cT̃ (Zj Yj1 ) ∈ G̃, we have Zj+1 and Zj+1 . There
0
00
is therefore no need for further samples Yj2 and Yj2 onwards.
0
00
0
0
00
00
0
0
Sampling from ω({Yji , Yji }i≥2 |Zj Yj1 , Zj Yj1 )
8.2.3.2
00
00
/ G̃, then the
/ G̃ or cT̃ (Zj Yj1 ) ∈
In case at least one of the following holds: cT̃ (Zj Yj1 ) ∈
following subsection explains how to proceed.
0
0
00
00
0
00
1. If Yj1 = Yj1 but only one of cT̃ (Zj Yj1 ) ∈ G̃ and cT̃ (Zj Yj1 ) ∈ G̃, then we have one of
0
00
Zj+1 and Zj+1 . To get the other, we continue (according to transitions defined by pT ,q )
only its corresponding chain till we get a good state.
0
00
0
00
0
00
/ G̃, we need to continue both chains. We
/ G̃ and cT̃ (Zj Yj1 ) ∈
2. If Yj1 = Yj1 , cT̃ (Zj Yj1 ) ∈
0
00
generate Yj2 , Yj2 as we did for the first samples—by generating a new random number
Uj2 uniform in [0, 1], and by coupling as in Fig. 8.1. (b) and the two distributions
0
0
00
00
q(·|Zj Yj1 ) and q(·|Zj Yj1 ). And continue in this fashion so long as the samples in the
two chains remain equal but do not hit good contexts. This will be case that will be
most important for us later on.
0
00
3. If Yjl 6= Yjl at any point and neither chain has seen a good state yet, we just run
the chains independently from that point on for how long it takes each to hit a good
aggregated state.
58
Once again we have
00
0
0
0
0
0
00
00
0
0
0
ω(Yj(r+1) |Zj Yj1 , . . . ,Yjr , Zj Yj1 , . . . ,Yjr ) = pT ,q (Yj(r+1) |Zj Yj1 , . . . ,Yjr ).
00
and similarly for Yj(r+1) .
8.2.4
Some Observations on Coupling
0
00
For any r, let Zr ∼ Zr denote the following event that happens to be a subset of case where
0
00
we do not need Yr2 and Yr2 onwards,
o
n 0
00
00
0
0
00
Yr1 = Yr1 and cT̃ (Zr Y(r)1 ) ∈ G̃ and cT̃ (Zr Y(r)1 ) ∈ G̃ .
Recall the definition of ηG̃ from (8.2).
00
0
0
00
ω(Zr ∼ Zr |Zr−1 , Zr−1 ) ≥ ηG̃ (1 − δkn ).
Observation 8.1.
Combining (8.2) and (8.3), we have
Proof
0
00
0
00
ω(Zr ∼ Zr |Zr−1 , Zr−1 ) =
0
X
min {qZ 0 (a), qZ 00 (a)} ≥ ηG̃ (1 − δkn ).
r−1
a∈A
r−1
00
Furthermore, if Zi ∼ Zi for the kn consecutive samples, j − kn + 1 ≤ i ≤ j, then we have
0
kn
00
Zj ≈ Zj .
0
kn
00
To proceed, once Zj ≈ Zj , we would like the two chains to coalesce tighter in every subsequent
0
kn +l
00
0
kn
00
step, namely we want for all 1 ≤ l ≤ n, Zj+l ≈ Zj+l . Starting from Zj ≈ Zj , we can have
0
kn +1
00
Zj+1 ≈ Zj+1 if
0
00
1. Zj+1 ∼ Zj+1 , or
0
00
2. if the chains {Yji }i≥1 and {Yji }i≥1 evolve through a sequence of m > 1 steps before
59
0
00
hitting a context in G̃ on the m0 th step with Yjl = Yjl for each l ≤ m.
0
00
0
00
0
00
kn
Suppose Zj ≈ Zj . While sampling from ω for the samples Zj+1 and
Observation 8.2.
Zj+1 , suppose Yji = Yji for i ≥ 1. If m is the first time the first chain hits a good context,
namely m is the smallest number such that
0
0
cAkn (Zj {Yji }m
i=1 ) ∈ G̃,
it follows that the second chain also hits a good context at the same time, namely
00
00
0
00
m
cAkn (Zj {Yji }m
i=1 ) = cAkn (Zj {Yji }i=1 ) ∈ G̃.
0
kn
00
Note that we may not be able to say the above if Zj 6≈ Zj . Furthermore, now we also have
0
kn +m
00
Zj+1 ≈ Zj+1 .
Let us now bound how likely this sort of increasingly tighter merging is. Because of the way
we have set up our coupling, the probability
0
00
0
kn
00
ω(Yj1 = Yj1 |Zj ≈ Zj ) =
≥
X
o
n
00
0
min q(a|Zj ), q(a|Zj )
a∈A
X 0
q̃ a|cT̃ (Zj ) (1 − δkn )
a∈A
= 1 − δkn ,
where q and q̃ are the model parameters associated with pT ,q and pT̃ ,q̃ respectively. Similarly
0
ω Yj(l+1)
k
0 l
00 l
0 n 00
= Yj(l+1) Zj ≈ Zj , {Yji }i=1 = {Yji }i=1 ≥ 1 − δkn +l .
00
It is important to note that the above statement holds whether
0
0
l
0
0
l
cAkn (Zj {Yji }i=1 ) ∈ G̃ or cAkn (Zj {Yji }i=1 ) ∈
/ G̃,
60
and is simply a consequence of dependencies dying down. Therefore (no matter what m is),
0
kn +1
00
0
kn
00
ω(Zj ≈ Zj |Zj−1 ≈ Zj−1 ) ≥
∞
Y
l=kn
(i)
(1 − δl ) ≥ 1 − ∆kn ,
where the inequality in (i) is by Lemma 6.2.
The above equation, while accurate, is not the strongest we can say with essentially the
same argument. We are now going to progressively make stronger statements with the same
arguments. First, note that we obtain for all `, the event
kn +`
0
00
{∃ `0 ≤ ` s.t. Zj+`0 ≈ Zj+`0 }
can happen by going through a sequence of tighter and tighter coalesced transitions of pT ,q
(no matter in how many of those steps we saw contexts in G̃). Therefore,
kn +`
0
0
00
0
kn
00
ω(∃ ` ≤ ` s.t. Zj+`0 ≈ Zj+`0 |Zj ≈ Zj ) ≥
∞
Y
l=kn
(1 − δl ) ≥ 1 − ∆kn .
And we can easily strengthen the above to say for all `,
0
kn +`
0
00
kn
00
ω(Zj+` ≈ Zj+` |Zj ≈ Zj ) ≥ 1 − ∆kn
for the same reason. Indeed, we can further strengthen the above statement to note:
Observation 8.3.
0
`
00
If Zj ≈ Zj for any ` ≥ kn , the chance of ever diverging,
`
0
00
0
`
00
ω(∃l > 0 s.t. Zj+l 6≈ Zj+l |Zj ≈ Zj ) ≤ ∆` .
This motivates following definition of coalescence when dealing with finite length-n samples.
Definition 8.3.
0
00
We say the chains {Zi }i≥1 and {Zi }i≥1 have coalesced if for any j,
0
Zj
max {kn ,`n }
≈
61
00
Zj ,
where as in Definition 8.2, `n is the smallest number satisfying ∆`n ≤ 1/n. Note that our
definition of coalescence differs from regular literature which requires the equality in above
definition.
8.2.5
Deviation Bounds on Stationary Probabilities
Definition 8.4.
Let
def
def
B = B(kn , `n , ηG̃ ) = 1 +
4 max {`n , kn }
ηG̃kn (1 − ∆kn )
where `n is as in Definition 8.3 and ηG̃ is from (8.2).
Remark
The above quantity B will be a determining factor in how well we can estimate
stationary probabilities. At this point, it may be useful to estimate the order of magnitude
of the various terms. First, kn = αn log n as mentioned before. The parameter ηG̃ < 1. The
√
results below hold when ηG̃ is not too small, it should be a constant such that ηG̃kn = ω(1/ n).
Since kn = αn log n, we typically can take αn as large as a small constant in light of the above.
While the results below do not require it, it is helpful to keep the case ñ = O(n) in mind
while interpreting the results below.
Note that if the dependencies die down exponentially, namely d(i) = γ i for some 0 < γ < 1,
then `n = dlog n/(1 − γ)2 / log 1/γe. If the dependencies die down polynomially, namely
1
r−2
n
d(i) = 1/ir for some r > 2, then `n = d2 + (r−1)(r−2)
e.
Lemma 8.1.
Let {Vm }ñm=0 be the Doob Martingale described in Definition 8.2. For all
m ≥ 1,
|Vm − Vm−1 | ≤ B(kn , `n , ηG̃ ),
where B(kn , `n , ηG̃ ) is as defined in Definition 8.4.
62
Proof
Recall that
|Vm − Vm−1 | = |E[Nw |Z0 , . . . ,Zm ] − E[Nw |Z0 , . . . ,Zm−1 ]|
0
00 ≤ max
E[Nw |Z0 , . . . ,Zm ] − E[Nw |Z0 , . . . ,Zm ]
0
00
≤
Zm ,Zm
n
X
j=m+1
0
kn
00
ω(Zj 6≈ Zj ).
0
`n
00
Let τ be the smallest number bigger than m such that Zτ ≈ Zτ . For any positive integer
t0 , the probability τ > t0 `n can be upper bounded by splitting the t0 `n samples of the two
0
m+t0 `n
00
m+t0 `n
chains {Zj }j=m+1 and {Zj }j=m+1 into blocks of length `n . While the approach remains the
same, we consider two calculations: (i) kn ≤ `n and (ii) kn ≥ `n below.
If kn ≤ `n , the probability the two chains coalesce in any single block is, using Observation 8.2
for kn times and then Observation 8.3
≥ ηG̃kn (1 − δkn )kn (1 − ∆kn )
(i)
≥ ηG̃kn (1 −
1 kn
) (1 − ∆kn )
kn
≥ ηG̃kn (1 − ∆kn )/4,
where (i) is because δkn ≤
1
.
kn
Thus,
t0
ω(τ > t0 `n ) ≤ 1 − ηG̃kn (1 − ∆kn )/4 .
Furthermore, Eτ − m can be bounded using the expected number of blocks before the chains
merge in any single block, thus,
Eτ ≤ m +
4`n
.
− ∆kn )
η kn (1
G̃
63
Then for all j ≥ m + 1
0
`n
`n
0
00
0
00
`n
00
ω(Zj 6≈ Zj ) = ω(Zj 6≈ Zj and τ < j) + ω(Zj 6≈ Zj and τ ≥ j)
(i)
0
`n
00
≤ ∆`n + ω(Zj 6≈ Zj and τ ≥ j)
≤
1
+ ω(τ ≥ j),
n
0
`n
00
where inequality (i) above follows because Zτ ≈ Zτ by definition and from Observation 8.3.
Finally we upper bound (8.4),
n
X
0
j=m+1
kn
00
ω(Zj 6≈ Zj ) ≤
n
X
0
j=m+1
≤n·
`n
00
ω(Zj 6≈ Zj )
n
X
1
+
ω(τ ≥ j)
n j=m+1
≤ 1 + Eτ − m
≤1+
4`n
.
− ∆kn )
(8.6)
η kn (1
G̃
If kn > `n , we follow an identical line of argument with the exception that we divide the
0
kn
00
processes into blocks of kn samples, and τ as the first time the two processes satisfy Zτ ≈ Zτ .
We then bound
0
kn
00
0
kn
00
0
kn
00
ω(Zj 6≈ Zj ) = ω(Zj 6≈ Zj and τ < j) + ω(Zj 6≈ Zj and τ ≥ j) ≤
1
+ ω(τ ≥ j),
n
and finally obtain
n
X
n
X
1
4kn
ω(Zj 6≈ Zj ) ≤ n · +
ω(τ ≥ j) ≤ 1 + kn
.
n j=m+1
ηG̃ (1 − ∆kn )
j=m+1
0
kn
00
64
Let {Vm }ñm=0 be the Doob Martingale described in Definition 8.2. Then
Corollary 8.2.
µ(w)
V0 − ñ
≤ B(kn , `n , η ),
G̃
µ(G̃) where B(kn , `n , ηG̃ ) is as defined in Definition 8.4.
Proof
We bound the value of V0 = E[Nw |Z0 ] by a coupling argument identical to Lemma
0
00
8.1. Suppose {Zm } and {Zm } are coupled copies of the restriction of pT ,q to G̃, where
0
00
{Zm } starts from state Z0 , while {Zm } starts from a state chosen randomly according to the
stationary distribution of {Zm }. The same analysis holds and the Corollary follows using
Property 2.
Theorem 8.3.
Let (T , q(T )) be an unknown model in Md . If {Zm }m≥1 is aperiodic, then
0
and w ∈ G̃ we have
for any t > 0, Y−∞
µ(w)
(t − B)2
0
pT ,q (|Nn (w) − ñ
| ≥ t|Y−∞ ) ≤ 2 exp −
,
2ñB 2
µ(G̃)
where B = B(kn , `n , ηG̃ ) is as defined in (8.4).
Proof
Note that aperiodicity of the restriction {Zm }m≥1 of pT ,q to G̃ does not require an
observation of {Zm }m≥1 . We can check for this property using only G̃ as noted in Property 1.
Theorem follows by Lemma 8.1, Corollary 8.2 and using Azuma’s inequality.
Remark
The theorem only has at most constant confidence if t ≤
√
ñB. Generally
speaking,
B≈
max {kn , `n }
.
ηG̃kn (1 − ∆kn )
Therefore, the best accuracy to which we can estimate the stationary probability ratio is
roughly
t
max {kn , `n }
≈√
.
ñ
ñηG̃kn (1 − ∆kn )
For the exponential decay of dependencies, if ñ = O(n), we have both kn and `n of the order
of log n. Therefore, if ηG̃ is such that ηG̃kn = O(n−β ) for some constant β < 1/2, the accuracy
65
to which we can estimate the stationary probability ratio µ(w)/µ(G̃) in Theorem 8.3 is
≈
log n
.
n1/2−β
We conclude with a couple of remarks and conjectures.
It is to be noted that for the concentration bounds on stationary probabilities to hold, we
P
must have
δi < ∞, something that was unnecessary for the transition probabilities. At
this point, lacking a matching lower bound on deviation, we cannot say if this is an artifact
of our arguments or if this is an interesting nuance that holds. However, we conjecture
P
that estimating stationary probabilities is harder—that for the case
δi is not finite, we
may be able to estimate only transition probabilities without ever estimating stationary
probabilities.
Finally, to actually use Theorem 8.3, we further lower bound ηG̃ by estimates of aggregate
transition probabilities derived from the data using Theorem 7.1 (or 7.3). With the effect
that the model dependent right side in Theorem 8.3 is replaced by another upper bound—
potentially worse, but entirely data dependent. The new data-only dependent upper bound
holds with a reduced confidence obtained by a union bound on the confidences of Theorems 7.1 (or 7.3) and 8.3.
Note also that the accuracy to which ηG̃ can be estimated is the same order of magnitude
as the bound on the `1 distance (between the naive and aggregated parameters) given in
Theorems 7.1 (or 7.3). The accuracy in Theorems 7.1 (or 7.3) suffices, since we intend to use
Theorem 8.3 when ηG̃ scales 1
log n
(as mentioned before, we like ηG̃ to be Θ(1)). While it
is unclear if this scaling is a necessary for ηG̃ , we believe this could be mildly improved on.
66
9
Sampling From Slow Mixing Markov Processes
9.1
First Order Markov Processes
Let pq be an irreducible and aperiodic first-order Markov process {Yi }∞
i=−∞ taking values in
a finite alphabet S with |S| = r. Let Q = [q(s|s0 )] denote the transition probability matrix
of the process. Here, q(s|s0 ) denotes the transition probability from s0 to s. Furthermore, let
π denote the unique stationary distribution of pq which is the unique solution of πQ = π,
i.e. π is the unique solution to
π(s) =
X
π(s0 )q(s|s0 ),
s0 ∈S
67
(9.1)
for all s ∈ S. As a remark, note that pq (Y1 = s) = π(s) and pq (Y1 = s|Y0 = s0 ) = q(s|s0 ).
9.2
Overview of CFTP Algorithm
9.2.1
CFTP Algorithm
Coupling from the past (CFTP) is an algorithm for exact sampling from π introduced in
[11], [8, 59]. More specifically, we want to obtain a sample which is distributed according
to π. To do so, we run r different copies of pq , each one starts from a distinct initial state
from S. More formally, let {· · · , −n2 , −n1 } be an increasing sequence of negative integers
and {U−n }n≥1 = {· · · , U−2 , U−1 } be a sequence of i.i.d random variables chosen uniformly
in [0, 1] . In CFTP algorithm, we run r copies of pq starting with different initial states at
times −n1 , −n2 and so on, as described in Algorithm 9.1, until all paths coalesce.
Algorithm 9.1: Coupling From the Past
Input: A Markov process pq over S.
Output: A sample according to stationary distribution π.
Initialize: k ← 1, CF lag = F alse.
while CF lag = F alse do
Run r copies of the chain starting at time −nk using {U−nk , · · · , U−2 , U−1 }.
if All r different chains reach to the same state s at time 0 then
Output s;
CF lag = T rue;
else
k ← k + 1;
end
end
Remark
We emphasize that the same random sequence · · · , U−2 , U−1 is used in the
execution of CFTP algorithm.
When all paths end up in a common state, we say that the chains have coalesced. The term
coupling in “coupling from the past” means that while each copy of the chain is individually faithful to pq , but together they do not evolve independently from each other. This
68
phenomenon happens since the same sequence of random variables · · · , U−2 , U−1 is used to
evolve the chains.
At this point, we provide a simple example which clarifies how the algorithm CFTP works.
Why it is necessary to couple from the past instead of coupling into the future?
Example 9.1.
Let pq be a Markov process taking values in S = {s1 , s2 } with transition
probability matrix
Q=
0 1
1
3
2
3
.
Observe that the process is aperiodic and irreducible with stationary distribution π = ( 41 , 34 ).
Suppose we perform coupling into the future. Let k ≥ 1 be the first time which two copies of
the chain, namely {Yn }n≥0 and {Ȳn }n≥0 starting from Y0 = s1 and Ȳ0 = s2 , coalesce. Denote
the common value of the chains after coalescence by X, which is a random variable taking
values in S. By the structure of Q and definition of k, no matter what k is, we always have
Yk = Ȳk = s2 . Therefore, P(X = s2 ) = 1 for all k, which is not equal to the stationary
distribution. Notice that P(X = s2 ) denotes the probability of the event in which the first
time coalescence happens, the common output equals s2 .
In contrast, consider the case where we perform coupling from the past according to Algorithm 9.1. Let −k be the first time which two copies of the chain {Y−n }n≥0 and {Ȳ−n }n≥0
coalesce. Denote the output at time 0 by {Y0 } and {Ȳ0 }. If the chains coalesce, Y0 = Ȳ0 = X
for some X ∈ {s1 , s2 }. Notice that if −k is odd, we always have X = s2 . Similarly, if −k is
even, we always have X = s1 . See Figure. 9.1 for more details. Hence,
P(X = s2 ) =
∞
X
m=1
P(X = s2 , k = −m)
2 1
2 1
2
+ 0 + ( )2 + 0 + ( )4 + · · ·
3
3 3
3 3
∞
X
2
1
=
( )2m
3 m=0 3
=
=
3
.
4
69
s1
s1
s2
s2
k = −1
k=0
s1
s1
s1
s2
s2
s2
k = −2
k = −1
k=0
s1
s1
s1
s1
s2
s2
s2
s2
k = −3
k = −2
k = −1
k=0
Figure 9.1: CFTP for Example 9.1. The event that both chains coalesce with output s1 (s2 )
can only happen in even (odd) time indices. The paths leading to coalescence are highlighted
with thick arrows.
Similarly, P(X = s1 ) = 14 . Thus, we successfully obtain samples according to π.
June 10, 2014
9.2.2
DRAFT
Analysis of CFTP
For the purpose of analysis, we represent r different copies of pq starting from time −nk as
(s,−nk )
{Y−n
(s,−nk )
}n≥0 where superscript s denotes the initial state of each copy, i.e. Y−nk
= s for
all s ∈ S.
Definition 9.1.
The coalescence time of the CFTP algorithm is defined as
(s,−n)
(s0 ,−n)
τ , min n : Y0
= Y0
, Y0 , ∀s, s0 ∈ S and for some Y0 ∈ S .
(9.2)
Notice that the coalesce time τ is a random variable. It is important to note that if coalesce
happens, the output of the algorithm could be any state from S. We are interested in the
probability distribution of the output.
70
Remark
Generally speaking, how fast the coalescence happens depends on the coupling
construction. It is well known that if the Markov process is irreducible and aperiodic, then
the CFTP algorithm terminates with probability 1. In what follows, we assume τ is finite
with probability 1 and use the symbol P(E) to denote the probability of an event E under
the coupling distribution.
Observation 9.1.
Let τ be the coalescence time and assume that the output of algorithm
is Y0 (a random variable over S). For all starting times −nk ≤ −τ , we have
(s0 ,−nk )
Y0
= Y0
∀s0 ∈ S.
In other terms, if we simulate the chains starting from any time before −τ , coalescence still
happens with the same output Y0 . Observe that Y0 is a random variable taking values in S,
but no matter what Y0 is— if we run the chains before −τ , the output will be the same state.
The reason for this phenomenon is that we use the same random sequence · · · , U−2 , U−1 in
different runs of CFTP algorithm.
Next, we prove the correctness of the algorithm, i.e., we show in Theorem 9.1 that the output
of CFTP algorithm is an exact sample from π.
Theorem 9.1.
Assume we simulate CFTP algorithm based on aforementioned construc-
tion. Let Y0 be the output of the CFTP algorithm. Then, for all s ∈ S
P(Y0 = s) = π(s),
where P denotes the coupling distribution.
Proof
By Observation 9.1, we only need to find the distribution of Y0 at the first time
coalescence happens. Let τ denote the coalescence time and Y0 be the corresponding output.
Similarly, we define
(s,−n)
(s0 ,−n)
τ 0 = min n : Y−1
= Y−1
, Y−1 ,
∀s, s0 ∈ S and for some Y−1 ∈ S .
71
(9.3)
Hence, Y−1 denotes the random variable when coalescence happens at time −1. Note that we
always have τ 0 ≥ τ , i.e., whenever coalescence happens at time −1, it implies that coalescence
will happen at time 0 . However, Y0 and Y−1 are not necessarily equal. First, we show that
Y0 and Y−1 have the same distribution. To see this, note that for all s ∈ S, we have
P(Y−1 = s) =
(a)
=
∞
X
n=2
∞
X
n=1
P(Y−1 = s, −τ 0 = −n)
P(Y0 = s, −τ = −n)
= P(Y0 = s),
(9.4)
where (a) holds since pq is time-homogeneous. On other other hand,
(a)
P(Y0 = s) =
X
P(Y−1 = s0 )q(s|s0 )
s0 ∈S
(b)
=
X
P(Y0 = s0 )q(s|s0 ),
s0 ∈S
where (a) holds since the evolution between Y0 and Y−1 is according to pq and (b) holds by
(9.4). Hence, P(Y0 = s) is a solution to (9.1). Thus, uniqueness of the solution implies that
P(Y0 = s) = π(s).
9.3
9.3.1
Partial Coupling From the Past
Motivation
Consider the scenario in which in the CFTP algorithm, all the paths starting from a
nonempty subset G ⊆ S coalesce. Then, the natural question that arises is that Y0 is
sampled from which distribution? What are the necessary conditions under which we can
be sure about the distribution of Y0 ?
72
In section 9.3.2, we will show that the restriction of a Markov process pq over S restricted to
G ⊆ S has a unique stationary distribution given by π(·)/π(G). In Section 9.3.3, we derive
the sufficient conditions under which the CFTP output is an exact sample from π(·)/π(G).
This is particularly important in slow mixing regime when the time needed for coalescence
to happen is long.
9.3.2
Restriction of a Markov Process to G
Let G be an arbitrary nonempty subset of S with |G| = r0 . Let π(G) denote the stationary
distribution of G, i.e.
π(G) =
X
π(w).
w∈G
The transition probability matrix of the process, Q, can be written as
Q=
QGG QGB
QBG QBB
,
(9.5)
where B = S\G. Note that QGG and QBB are square matrices of size r0 × r0 and (r − r0 ) ×
(r − r0 ), respectively.
Let {Ỹj } = {· · · , Ỹ−1 , Ỹ0 , Ỹ1 , · · · } be the restriction of pq to G . By strong Markov property,
{Ỹj } is a Markov process over G which we denote it by pq̃ . Let Q̃ = [q̃(w|w0 )] with w, w0 ∈ G,
represent the transition probability matrix of pq̃ . In Proposition 9.3, we will show how to
compute Q̃ from block matrices within Q given in (9.5). Observe that since pq is irreducible,
pq̃ is irreducible as well. On the other hand, |G| = r0 < ∞. Thus, pq̃ has a unique stationary
distribution denoted by π̃ (see e.g., [60]). Furthermore, if in addition pq̃ is aperiodic, then
empirical counts converge to stationary distribution π̃(·). Before proceeding further, we
prove following useful lemma.
Lemma 9.2.
Let A be an m × m matrix with max |λi | < 1 where λi ’s are the eigenvalues
1≤i≤m
73
of A. Then,
(I − A)−1 = I + A + A2 + A3 + · · ·
Proof
The condition max |λi | < 1 insures that the infinite series I + A + A2 + A3 + · · ·
1≤i≤m
is convergent. The equality follows by direct multiplication.
Theorem 9.3.
The transition probability matrix of the Markov process pq̃ is given by
Q̃ = QGG + QGB (I − QBB )−1 QBG ,
(9.6)
where I is the identity matrix with size r − r0 . Furthermore, pq̃ has a unique stationary
distribution π̃ given by
π̃(w) =
π(w)
,
π(G)
(9.7)
for all w ∈ G.
Proof
Fix w, w0 ∈ G. For convenience, we define following vectors obtained from Q
V , [q(s|w0 )],
s∈B
W , [q(w|s)],
s ∈ B.
and
Note that V is the row vector containing the set of transition probabilities from w0 to all
states in B. Similarly, W is the column vector containing the set of transition probabilities
74
from all states in B to w. To compute q̃(w|w0 ), we have
q̃(w|w0 ) = pq̃ (Ỹ1 = w|Ỹ0 = w0 )
= pq (Y1 = w|Y0 = w0 )
+ pq (Y2 = w, Y1 ∈ B|Y0 = w0 ) + · · ·
∞
X
0
= q(w|w ) +
pq (Yk+1 = w, Y1k ∈ B k |Y0 = w0 )
k=1
= q(w|w0 ) +
∞
X
k−1
V QBB
W
k=1
0
= q(w|w ) + V
∞
X
QkBB W
k=0
0
= q(w|w ) + V I − QBB
−1
(9.8)
W.
It only remains to show that the (I − QBB )−1 in (9.8) exists or equivalently,
∞
P
k=0
QkBB is
convergent. By using Lemma 9.2, the sufficient condition for convergence is that all the
eigenvalues of QBB (say λ` ’s, 1 ≤ ` ≤ r − r0 ) lie strictly inside the unit circle. To see this,
first note that by irreducibility of Q, QBB and all its minors can not be stochastic matrices.
Second, every diagonal element of QBB must be strictly less than 1, otherwise if we start
from the corresponding state, we always remain on that state which contradicts irreducibility.
Hence, QBB is a non-negative matrix which always can be dominated —element-wise, by an
irreducible 1 stochastic matrix L. Then, Perron-Frobenius theorem (see e.g., [61]) guarantees
that |λ` | ≤ 1.
Next, we claim that |λ` | < 1 for all 1 ≤ ` ≤ r − r0 . If not, then there exists some `0 such that
|λ`0 | = 1. Since 0 ≤ QBB ≤ L with L irreducible and |λ`0 | = 1, Perron-Frobenius theorem
for irreducible matrices implies that QBB = L which is clearly a contradiction. Arranging
(9.8) for every pair of states in G into the matrix form, we have (9.6).
Since pq is irreducible, pq̃ is irreducible as well. Also, since G is finite, every state in pq̃ is
1. A non-negative matrix Am×m is called irreducible if for all possible partitions J and K of {1, 2, · · · , m},
there exist j ∈ J and k ∈ K such that ajk 6= 0.
75
positive recurrent. Hence, pq̃ has a unique stationary distribution (see e.g., [60]).
Let Twm with m ≥ 1 denote the m’th recurrence time of the state w in {Yi }. Note that
Twm − Twm−1 are i.i.d. with expected value 1/π(w) = π(S)/π(w) [60]. Similarly, let T̃wn with
n ≥ 1 denote the n’th recurrence time of the state w in {Ỹj }. Observe that T̃wn − T̃wn−1 are
i.i.d. with expected value π(G)/π(w). Therefore, since the stationary distribution of w in
pq̃ equals to the reciprocal of its mean recurrence time, we have
π̃(w) =
Remark
π(w)
.
π(G)
Note that for all w ∈ G, π̃(w) is the unique solution to
π̃(w) =
X
π̃(w0 )q̃(w|w0 ).
(9.9)
w0 ∈G
Furthermore, if pq̃ is aperiodic then for all w, w0 ∈ G
lim pq̃ (Ỹj = w|Ỹ0 = w0 ) = π̃(w).
j→∞
Example 9.2.
Let pq be a Markov process on S = {s1 , s2 , s3 , s4 } with transition proba-
bility matrix
1
4
1
3
Q=
0
0
1
2
2
3
2
3
1
2
1
4
0
0 0
.
0 13
1
0 2
The unique stationary distribution of the process is π = ( 12
, 27 , 3 , 2 ). Consider the restric44 44 44 44
tion of pq to G = {s2 , s3 }. Hence, we can rewrite Q as in (9.5) with block matrices given
by
QGG =
2
3
2
3
0
,
0
QGB =
76
1
3
0
0
1
3
,
QBG =
1
2
1
2
1
4
QBB =
,
0
1
4
0
0
1
2
.
Using (9.6), we obtain
Q̃ =
8
9
1
9
.
1 0
Note that
π̃ = (
Example 9.3.
9 1
π(s2 )
π(s3 )
, )=(
,
).
10 10
π(s2 ) + π(s3 ) π(s2 ) + π(s3 )
Let pq be a Markov process on S = {s1 , s2 , s3 , s4 } with transition proba-
bility matrix
1
2
0
1
2
0
1
3 0 23 0
.
Q=
2
1
0 3 0 3
0 21 0 12
3 3 1
The stationary distribution is π = ( 15 , 10
, 10 , 5 ). Consider the restriction of pq to G = {s2 , s3 }.
Using (9.6), we have
0 1
.
Q̃ =
1 0
Clearly, the Markov process pq̃ with transition probability matrix Q̃ have a stationary distribution
1 1
π(s2 )
π(s3 )
π̃ = ( , ) = (
,
).
2 2
π(s2 ) + π(s3 ) π(s2 ) + π(s3 )
Observe that in this case, pq̃ is periodic (with period 2).
9.3.3
Analysis of Partial Coupling
In this section, we show that if all the copies of pq starting from a nonempty subset G ⊆ S (i)
maintain their path in G and (ii) coalesce with the output belonging to G, then the output
is an exact sample from π(·)/π(G). We will refer to such situations as partial coalescence
77
with respect to G.
As before, we represent r different copies of pq starting from time −n with n ≥ 1 as
(s,−n)
{Y−i
(s,−n)
}i≥0 where superscript s denotes the initial state of each copy, i.e. Y−n
= s
with s ∈ S.
We say that the CFTP algorithm at time −n is partially coalesced with
Definition 9.2.
respect to G if
(w,−n)
Y0
(w0 ,−n)
= Y0
(w,−n)
, Ỹ0 ∈ G and Y−i
∈ G, ∀i ∈ [−n, 0],
(9.10)
for all w, w0 ∈ G. By abuse of notation, we refer to Ỹ0 as the output of the algorithm.
Note that partial coalescence means that the paths starting from G, maintain their trajectories in G and by time 0, have been coalesced into some element in G. Obviously, if chains
starting from S\G at any point hit G, then we have a perfect sample from π. We are interested in answering following question: If partial coalescence happens, which distribution Ỹ0
belongs to?
Definition 9.3.
Define the coalescence time w.r.t. G as
τG , min n : partial coalescence w.r.t. G occures at time −n .
(9.11)
Notice that τG is a random variable. It is important to note that if the partial coalescence
w.r.t. to G happens, the output of the algorithm at time 0 could be any symbol from G. In
other terms, Ỹ0 is a random variable over G. We are interested in the probability distribution
of Ỹ0 .
Observation 9.2.
Let τG be the coalesce time w.r.t. to G and assume that Ỹ0 ∈ G is
the output of the algorithm. Suppose there exists a starting times −n0 ≤ −τG such that
partial coalescence w.r.t. to G happens at −n0 . Then, it implies that partial coalescence
w.r.t. to G still happens at time 0. More importantly, the output Ỹ0 still remains the same,
irrespective of which state Ỹ0 is. The reason for this phenomenon is that we use the same
78
s4
s4
s4
s4
s4
s3
s3
s3
s3
s3
s2
s2
s2
s2
s2
s1
s1
s1
s1
s1
n = −4
n = −3
n = −2
n = −1
n=0
Figure 9.2: CFTP for Example 9.4 where partial coalescence happens w.r.t. to G = {s2 , s3 }.
random sequence · · · , U−2 , U−1 in different runs of CFTP algorithm.
Example 9.4.
A typical example in which partial coalescence with respect to a subset of
states occurs, is illustrated in Fig. 9.2. In this case, S = {s1 , s2 , s3 , s4 } and G = {s2 , s3 }.
Partial coalescence w.r.t. to G happens at −n = −4. Note that all the paths starting from
s2 and s3 maintain their trajectories within G and the output is s3 .
Theorem 9.4.
Let G be an arbitrary nonempty subset of S. Assume that the CFTP
algorithm at time −n is partially coalesced with respect to G. Let Ỹ0 ∈ G be the output of
the CFTP algorithm. Then, for all w ∈ G, we have
June 6, 2014
DRAFT
π(w)
P(Ỹ0 = w) =
,
π(G)
where P and π denote the coupling distribution and the stationary distribution of pq , respectively.
Proof
By Observation 9.2, we only need to find the distribution of Ỹ0 at the first time
that partial coalescence with respect to G happens. Let τG be the coalesce time w.r.t. to
G and Ỹ0 ∈ G be the output of the algorithm. Suppose we further continue the run of the
algorithm (purely for the sake of analysis). Let −τG0 be the smallest time for which partial
coalescence w.r.t. to G happens at time −1 and Ỹ−1 be corresponding output at time −1.
Note that we always have −τG0 ≤ −τG , i.e., whenever partial coalescence w.r.t. to G happens
at time −1, it implies that partial coalescence will happen at time 0 . However, Ỹ0 and Ỹ−1
are not necessarily equal. First, we show that Ỹ0 and Ỹ−1 have the same distribution. To
79
see this, note that for all w ∈ G, we have
∞
X
P(Ỹ−1 = w) =
i=2
∞
X
(a)
=
i=1
P(Ỹ−1 = w, −τG0 = −i)
P(Ỹ0 = w, −τG = −i)
= P(Ỹ0 = w),
(9.12)
where (a) holds since pq is time-homogeneous and thus, has shift-invariant distribution.
On other other hand,
(a)
X
P(Ỹ0 = w) =
P(Ỹ−1 = w0 )q̃(w|w0 )
w0 ∈G
(b)
X
=
P(Ỹ0 = w0 )q̃(w|w0 ),
w0 ∈G
where (a) holds since the evolution between Ỹ0 and Ỹ−1 is according to pq̃ and (b) is by
(9.12). Hence, P(Ỹ0 = w) is a solution to (9.9). Thus, uniqueness of the solution implies
that
P(Ỹ0 = w) = π̃(w)
π(w)
,
=
π(G)
which completes the proof.
Example 9.5.
Let pq be a Markov process taking values in S = {s1 , s2 , s3 } with transition
probability matrix
0 1−
Q = 12
0
1
2
0 ,
1−
for some 0 < < 1. The stationary distribution is π = ( 14 , 12 , 41 ). Note that this a slow mixing
source for sufficiently small , namely if we start from s1 or s2 with high probability remain
80
in this set. Consider the restriction of pq to G = {s1 , s2 }. Using (9.6), we have
Q̃ =
0 1
1
2
1
2
.
Clearly, the Markov process pq̃ with transition probability matrix Q̃ have a stationary distribution
1 2
π(s1 )
π(s2 )
π̃ = ( , ) = (
,
).
3 3
π(s1 ) + π(s2 ) π(s1 ) + π(s2 )
Suppose we perform coupling from the past and that the partial coalescence w.r.t. G happens. Let τG be the coalescence time w.r.t. G. Denote the output at time 0 by Ỹ0 . The
probability distribution of Ỹ0 can be computed as follows:
P(Ỹ0 = s2 ) =
∞
X
i=1
P(Ỹ0 = s2 , τG = −i)
1
1 1
1 1
=
+ 0 + ( )2 + 0 + ( )4 + · · ·
2
2 2
2 2
∞
X
1
1
=
( )2i
2 i=0 2
=
2
.
3
Similarly, P(Ỹ0 = s1 ) = 13 . Thus, we successfully obtain samples according to π̃.
9.3.4
Analysis of Partial Coupling Along a Sequence of Stopping
Times
In this section, we further relax the sufficient conditions for partial coalescence given in
Definition 9.2. Recall that we represent r different copies of pq starting from time −n
(s,−n)
with n ≥ 1 as {Y−i
}i≥0 where superscript s denotes the initial state of each copy, i.e.
81
(s,−n)
Y−n
= s with s ∈ S. For a fixed n, we define
w
N−n
,
n−1
X
i=0
Definition 9.4.
1(Y−i(w,−n) ∈ G).
(9.13)
We say that the CFTP algorithm at time −n is partially coalesced with
respect to G if −n is the smallest time index such that
(w,−n)
Y0
(w0 ,−n)
= Y0
0
w
w
, Ỹ0 ∈ G and N−n
= N−n
(9.14)
for all w, w0 ∈ G. By abuse of notation, we refer to Ỹ0 as the output of the algorithm.
Note that partial coalescence means that the paths starting from G have been coalesced
into some element in G by time 0. Furthermore, the restriction of each path to G have the
equal number of visits to G. We are interested in answering following question: If partial
coalescence happens, which distribution Ỹ0 belongs to?
Definition 9.5.
Define the coalescence time w.r.t. G as
τG , min n : partial coalescence w.r.t. G occures at time −n .
(9.15)
Notice that τG is a random variable. It is important to note that if the partial coalescence
w.r.t. to G happens, the output of the algorithm at time 0 could be any symbol from G. In
other terms, Ỹ0 is a random variable over G. We are interested in the probability distribution
of Ỹ0 .
Observation 9.3.
Let τG be the coalesce time w.r.t. to G and assume that Ỹ0 ∈ G is the
output of the algorithm. Suppose there exists a starting times −n0 ≤ −τG such that partial
coalescence w.r.t. to G happens in the interval [−n0 , −τG ], i.e.,
(w,−n0 )
Y−τG
(w0 ,−n0 )
= Y−τG
, Ỹ−τ ∈ G,
for all w, w0 ∈ G. Then, it implies that partial coalescence w.r.t. to G still happens at time
82
0. More importantly, the output Ỹ0 still remains the same, irrespective of which state Ỹ0 is.
The reason for this phenomenon is that we use the same random sequence · · · , U−2 , U−1 in
different runs of CFTP algorithm.
Lemma 9.5.
Let pq be an aperiodic, irreducible Markov process taking values in S with
stationary distribution π which is the unique solution to π = πQ. If there exist a probability
distribution ν satisfying ν = νQk for some k ∈ N, then π = ν.
Proof
Pick arbitrary k ∈ N. Suppose ν = νQk . Note that Qk is a stochastic matrix which
naturally defines a Markov process taking values in S. Such Markov process is irreducible
and positive recurrent, thus it has a unique stationary distribution. By assumption, ν must
be the unique stationary distribution of the new process. On the other hand, π satisfies
π = πQk . Hence, π = ν.
Theorem 9.6.
Let G be an arbitrary nonempty subset of S. Assume that the CFTP
algorithm at time −n is partially coalesced with respect to G. Let Ỹ0 ∈ G be the output of
the CFTP algorithm. Then, for all w ∈ G, we have
P(Ỹ0 = w) =
π(w)
,
π(G)
where P and π denote the coupling distribution and the stationary distribution of pq , respectively.
Proof
Let τG be the coalesce time w.r.t. to G and Ỹ0 ∈ G be the output of the algorithm.
Note that by Definition 9.2, for all w, w0 ∈ G
0
w
w
N−τ
= N−τ
, N,
G
G
for some N ∈ N. Suppose we further continue the run of the algorithm (purely for the sake
of analysis). Let −τG0 be the smallest time for which partial coalescence w.r.t. to G happens
in the interval [−n0 , −τG ] and Ỹ−τ be corresponding output at time −τG .
Note that Ỹ0 and Ỹ−τ are not necessarily equal. First, we show that Ỹ0 and Ỹ−τ have the
83
same distribution. To see this, note that for all w ∈ G, we have
P(Ỹ−τ = w) =
(a)
=
∞
X
i=τG +1
∞
X
i=1
P(Ỹ−τ = w, −τG0 = −i)
P(Ỹ0 = w, −τG = −i)
= P(Ỹ0 = w),
(9.16)
where (a) holds since pq is time-homogeneous and τG is a stopping time.
Let ν = [P(Ỹ0 = w)] and ν 0 = [P(Ỹ−τ = w)] , for w ∈ G, be the column vector representations
of the distribution of Ỹ0 and Ỹ−τ , respectively. Hence, by (9.16) we have ν = ν 0 . Observe
that
(b)
ν = Q̃N ν 0
(c)
= Q̃N ν,
where (b) holds because Ỹ0 and Ỹ−τ are samples from pq̃ within N steps and (c) holds since
ν = ν 0 . Hence, Lemma 9.5 implies that ν = π̃, i.e.
P(Ỹ0 = w) = π̃(w)
π(w)
,
=
π(G)
which completes the proof.
Corollary 9.7.
Suppose there exists an arbitrary nonempty subset G of S such that
all the paths starting from w ∈ G have been coalesced in the regular sense at time −n.
Furthermore, assume that partial coalescence w.r.t. to G happens in the interval [−n, −n0 ]
for some n0 < n. If for all w, w0 ∈ G
0
w
w
N−n
= N−n
, N ≥ 1,
84
w
where N−n
is defined in (9.13), then all the occurrences of common elements in G within
each path are distributed according to π(w)/π(G) as well.
Remark
Note that compared to Definition 9.4, in above Corollary we relax the condition
that output at time 0 belong to G.
9.3.5
Algorithm for Community Detection
The mixing properties of the process can be used to identify communities in the set of states
of the process. To this end, we need to consider finding a partition C = {C1 , · · · Cm } of the
S
state space S such that Ci ⊆ S, Ci 6= ∅ for all i and Ci ∩ Cj = ∅ for i 6= j with m
i=1 Ci = S.
We will refer to individual Ci as a cluster.
Note that we can associate to every partition C, a cost function J(C) depending on the
application. For instance, in the context of random walks on graphs, one convenient choice
can be the cluster graph edit distance as we will discuss in section 10.1.2. The other parameter is the number of clusters m. In our approach, in contrast to traditional methods
such as k-means, m need not to be known a a-priori. In the algorithm 9.2, the number
of clusters will automatically will be determined by the algorithm and its value depend on
the cost function J we consider. The algorithm is based on Corollary 9.7 which essentially
describes how to identify a subset of states which have been partially coalesced. In essence,
the algorithm identifies communities which have coalesced together during one sample run
of CFTP algorithm.
Description of the Algorithm: We consider one run of the CFTP algorithm and our
setup is the same as in Algorithm 9.1 (basic CFTP). We emphasize that the chains are
simulated backward in time and the the random variables used during execution are reused.
Recall that |S| = r. Initially, we have a partition of state space which consists of all
singletons, i.e., C0 = {C1 , · · · , Cr } such that Ci = {s} for some s ∈ S. As we proceed
backward in time to obtain a sample from stationary distribution, we will identify a set of
85
critical times denoted by T , which are the times which some clusters have been coalesced.
Definition 9.6.
During execution of CFTP algorithm, we say that a time index n is a
critical time if there exist at least two clusters C1 and C2 such that for all w, w0 ∈ C1 ∪ C2
(w,−n)
Y0
where
w
N−n
(w0 ,−n)
= Y0
0
w
w
and N−n
= N−n
,
n−1 X
(w,−n)
=
1 Y−i
∈ C1 ∪ C2 .
i=0
Note that above definition can naturally be extended to more that two clusters. As a
consequence, we can identify clusters which have been merged in critical times. Observe
that when all the chains coalesce, we have a single partition consisting of all states. In
other terms, we start off with all-singleton partition up to the point where we have a single
partition. However, during this process, at critical times, we can identify clusters which
merge together to yield a coarser partition. We can compute the cost associated to the
particular partition that the sampling process yields and find the optimal partition C ∗ with
respect to the cost function J.
In following pseudo-code, we represent all-singleton partition by C0 and CFlag denotes the
coalescence flag. Here, we assume that we are minimizing a cost function J. A container T
is used to store the critical times.
Remark
We need to carefully define the Markov process depending on the application
on hand to exploit mixing properties of the process for revealing the underlying community
structure. Furthermore, the choice of cost function depends on how the Markov process is
defined.
Remark
Note that in general, the algorithm can stop at any critical time and still yields
a partition (though perhaps not optimal). A criteria which empirically seems to be effective
is to record the difference ∆Ti , Ti −Ti−1 between consecutive critical times. If the difference
86
Algorithm 9.2: Detecting Communities in a Markov process
Input: A Markov process pq over S.
Output: A partition C ∗ of S.
Initialize: k ← 1, C = C0 , C ∗ = C0 , T = [0], CF lag = F alse.
while CF lag = F alse do
Simulate r copies of the chain starting at time −k.
if k is a critical time then
Update C by merging clusters which are coalesced;
T = T.Append{k};
if J(C) ≤ J(C ∗ ) then
C ∗ = C;
end
end
if All r chains reach to some common state at time 0 then
CF lag = T rue;
else
k ←k+1
end
end
return C ∗
exceeds a prescribed threshold, the algorithm will stop. As we discuss simulation results in
next chapter, we will see that if the process is slow mixing, then the resulted clusters usually
coincide with ground truth communities.
87
10
Modeling Community Detection Using Slow
Mixing Markov Random Walks
In this chapter, we use the randomized algorithm introduced in section 9.3.5 for community
detection in graphs by defining Markov random walks on them. The mixing properties of
the random walks we construct are used to identify communities. The more polarized the
communities are, the slower mixing the random walk is. We use coupling from the past to
translate the mixing properties of the walk into algorithmic rules that reveal the community
structure of the graph. We then show the performance of our algorithms on specific graph
structures, including the Stochastic Block Models and LFR random graphs. We also use a
real world benchmark network to identify communities in the graph.
88
10.1
Prior Work
10.1.1
Prior Work on Community Detection
In community detection problem, the goal is to find the underlying community structure
within a network which is usually represented as a graph. The communities also referred as
clusters is a subset of vertices in the graph with common properties [62]. It is common to
imply that nodes within clusters must be connected tighter than nodes in disparate clusters.
From graph theory perspective, the community detection problem refers to partitioning the
vertices into clusters that are strongly connected within themselves while the cross- cluster
connections are weak. In the study of community detection problem, variants of random
graph models play an important role as a flexible probabilistic model that generates the
underlying graph where the resulted networks resemble real networks [63–65]. While the
literature in on this topic is vast, we will briefly review main approaches and recent advances
within next subsections.
10.1.1.1
Deterministic Approaches
Finding community structure can be translated into graph theoretical problems which most
of them are NP-hard in worst case. In these approaches, the problem is usually converted
into a constrained optimization instance where we are interested in partitioning a graph with
respect to an appropriate objective function. In [66, 67] the concept of centrality indices (edge
betweenness) is used to discover communities in social and biological networks. The idea
behind centrality is remove intra-community edges–those which mostly are crossed among
shortest path between different pairs in the network. The modularity-based approaches
[68, 69] are used for this problem as a tool for identifying structure of the underlying graph.
89
10.1.1.2
Spectral Clustering
Spectral clustering is an algebraic technique for community detection in graphs [70, 71]. This
method uses pairwise similarity between nodes in a network to form a (weighted) similarity
graph G (equivalently, similarity matrix S) and then finding the clusters using spectral
properties of G (S). The advantages are: (i) there is no assumption on the underlying model
which generates the data and (ii) implementation is simple and the speed is fast. While
the implementation is simple, a couple of drawbacks need to be addressed—the pairwise
similarity is often not automatically given in community detection problems, the Laplacian
could be ill conditioned and the number of clusters need to be known in advance [72, 73].
In its simplest form, one considers eigenvectors v1 , v2 , · · · , vk corresponding to top k largest
eigenvalues of S, where k is the number of clusters. Then, a node i in the graph is assigned
to cluster j if vj is the eigenvector–among v1 , v2 , · · · , vk , with largest value in entry location
i. Essentially, each node (equivalently, each row in S) is assigned into the eigenvector which
is closest to that node. For two clusters, the Fiedler eigenvector can be partitioned around
the median to yield clusters and for more than two clusters, similar extensions can be used
involving k eigenvectors.
In [74] and motivated by image segmentation problem, the connection between Markov random walks and spectral clustering is addressed. They consider the classical bisection problem
of a graph. The cost of clustering they have considered is normalized cut criterion [75]. In optimal clustering, each point is assigned into one of the clusters by thresholding entries of the
eigenvector corresponding to second smallest eigenvalue of a generalized eigenvalue/vector
problem (involving Laplacian matrix). It was shown that the same goal can be achieved by
finding eigenvector corresponding to second largest eigenvalue of the transition matrix of a
uniform random walk on the graph.
In [76], spectral clustering is analyzed with emphasize on defining a good measure to assess
the quality of clustering. The authors argued that a bicriteria measure to evaluate the quality
of clusters is needed. If we consider the subgraph induced by a cluster, the conductance of the
90
subgraph represents how an good individual cluster is. However, the weight ratio defined
as the ratio between number of edges which fall inside clusters and the total number of
edges in the graph plays an important role in overall quality of clustering. So, one needs
to maximize the smallest conductance associated to individual clusters while minimize the
weight ratio. Intuitively, first criterion prevents from forming of clusters which have smaller
clusters inside them while second one avoids of formation of clusters with small size. Based
on these criteria, they proposed a heuristic recursive algorithm for spectral clustering which
recursively partitions the graph into two subgraphs (using approximate bisection).
Compared to spectral method, our proposed primitives are based on coupling rather than
explicit spectral decomposition—we just need to be able to run a Markov chain, instead
of defining distances and worrying about conditioning of resultant Laplacians. The natural advantage this offers replicates the way coupling sometimes circumvents problems with
spectral approaches in proofs of mixing.
10.1.1.3
Random Graph Generative Models
Random graphs which exhibit a specific structure are widely used in studying community
detection algorithms. In the simplest form, a network with two communities can be modeled
as a planted bisection model[77]: a random graph G(n, p, q) in which n nodes in the graph
are divided into two roughly equal-sized clusters and each pair of nodes within the same
cluster are connected with probability p and otherwise, connected with probability q. The
generalization of this model is called Stochastic Block Model (SBM) or Planted Partition
Model (PPM) where one allows for more than two communities (with variable sizes) and
model parameters are not necessarily symmetric between clusters. There are other variants
of planted random graph models such as planted dense subgraph model, hidden clique problem,
planted clustering, etc. which have been studied for modeling community detection problems.
As a remark, it is worth mentioning that one needs to have the random graph in SBM setting
to be connected with high probability in order to recover communities. So, most of literature
91
focus on regimes in which average degrees grow logarithmically as n → ∞, though the
special case of constant degree (sparse regime) is of special interest from real-world networks
perspective.
The study of planted random graph models concentrates around the following major aspects:
• Exact Recovery: The goal is to identify the regimes in which exact recovery of the clusters
is possible asymptotically, i.e., finding the necessary and sufficient conditions on model
parameters under which probability of error vanishes as n → ∞. Much of the effort in this
line has been dedicated to finding tight bounds on model parameters, namely lower bounds
on p−q and minimum cluster sizes, for which recovering clusters is possible asymptotically.
• Weak Recovery: The aim here is to find a partition of the nodes which is positively correlated to the true underlying planted partition (which sometimes is referred as detection
problem in the literature [78, 79].)
• Phase Transition Phenomena: In the context of exact/partial recovery, planted partition
models exhibit a phase transition phenomena– there is a threshold depending on the model
parameters beyond which the exact recovery is impossible asymptotically. The goal is to
find sharp threshold phenomenon for different random graph models.
• Algorithmic Issues: The goal here is to provide (i) efficient consistent algorithms for exact/partial recovery and (ii) efficient algorithms for learning/estimation of the underlying
parameters of the model. The approaches that have been widely considered are Maximum
likelihood estimator, spectral methods, semidefinite programming and belief propagation
methods.
For the planted bisection model, in [80] it was shown that in the sparse regime where p, q =
O(1/n) (where the average degrees remains constant), the exact recovery is impossible if (a−
b)2 < 2(a + b) where p = a/n and q = b/n. Specifically, it was shown that in G(n, a/n, b/n)
with a + b > 2 (which assures G has a giant component) and (a − b)2 ≤ 2(a + b), if we
take two random nodes from G, the probability that they belong to same cluster tends to
1/2 eventually almost surely which makes finding the true bisection impossible. Also, the
estimation of model parameters is impossible in this setting. In [81], it was shown that if
92
(a − b)2 > 2(a + b), the exact clustering is information theoretically solvable meaning there
is an algorithm with running time O(n2 log n) which finds a bisection whose correlation with
original graph is bounded away from zero as n → ∞. An independent proof was given in [82].
A spectral based method for discovering communities in SBM with k blocks with constant
edge density is proposed in [83].
In [79], it was shown that for the planted bisection model G(n, α log n/n, β log n/n) with
α > β (where the average degrees grows logarithmically), the exact recover is possible if
√
α+β
− αβ > 1. Note that α+β
> 1 is necessary for the graph to have connected components.
2
2
They used ideas from information theory which cast the problem of community detection as
a channel coding problems where the true cluster assignment (0/1 labels on the nodes) is
encoded as pairwise relation indicator which in turn passes through a memoryless channel
with parameters p and q. Then, they showed that Maximum Likelihood (ML) estimator
(under mild assumption that the nodes are uniformly chosen to belong to each cluster, i.e.,
√
flat prior) will fail in recovering true clusters if α+β
−
αβ < 1.
2
In [84], the recovery threshold criteria for the general SBM with multiple communities (community sizes scale linearly) is characterized in terms of a generalized form of Chernoff and
Hellinger divergences. Essentially, they used ideas from information theory to quantify how
far are clusters apart from each other (in terms of CH-divergence) and then showed that
the exact recovery is possible if and only if the minimum value of pairwise CH-divergence
between community profiles is greater than 1. For the constant degree regime, they provide
a quasi-linear algorithm for detecting communities.
In [85], for the planted bisection model G(n, α log n/n, β log n/n) with α, β fixed and n → ∞,
it was shown that the semidefinite relaxation of ML estimator will achieve the optimal
threshold given in [79] eventually almost surely. First, they obtained the ML estimator as a
constrained optimization which essentially minimizes the bisection size and then, they relax
the optimization problem into an instance of SDP which its optimal solution coincides with
true assignment of clusters with high probability. They used the fact that in Erdös-Rényi
93
random graph, the spectral norm (largest singular value) of adjacency matrix concentrates
around its mean in addition to some tail bounds on the binomial distribution. Some extension
with relaxing conditions related to number and size of communities was provided in [86].
For exact recovery in random bisection model, a multi-stage algorithm with running time
linear in number of edges was provided in [87]. Furthermore, they showed that recovery is
possible iff every node belongs to the same community as its majority of neighbors with high
probability.
The problem of community detection in planted clustering is considered in depth in [88].
The authors show that the space of model parameters can be divided into four groups
corresponding to (i) impossible regimes (using information theory arguments via Fano’s
inequality), (ii) hard regimes where ML estimator succeeds, (iii) easy regime where convex
relaxation of ML succeeds and (iv) simple regime where simple counting techniques succeeds.
In [89, 90], for the SBM in sparse regime, a penalized combinatorial objective function
was proposed based on the observation that the weight of error corresponding to missing
edges within a cluster should be smaller compared to errors emanating from existing edges
across different clusters in sparse regime. By convex relation of the objective function, they
√
showed that whenever minimum of cluster sizes is Ω( n log2 n) and p − q is of the order
√
Ω( pn log2 n/K) where K is the minimum cluster size, the exact recovery is possible with
high probability.
In [91], for the sparse SBM it was shown via spectral method arguments that if the adjacency
matrix of G is approximately close (in operator norm) to the density matrix corresponding
to the model, then the partitions can be recovered in polynomial time. In [92], it was shown
p
√
via spectral arguments that if p − q/ p ≥ Ω( log n/n), then exact recovery in multiple
communities with symmetric parameters is possible with high probability.
Remark
In most random graph models mentioned above, it is implicitly assumed that
the true number of clusters is known a prior. As will be discussed in Section 10.1.1.4, this
requirement is no longer necessary in Correlation Clustering setting where the objective
94
function is designed in a way that the optimal cost will automatically capture the true
underlying partition. Specifically in ML estimator setting for planted bisection model, one
needs at least know the true number of clusters in advance (as long as α > β, the ML can
be formulated without knowledge of α or β).
Recently, in [93] the recovery problem for SBM in the setting where the parameters and
number of clusters are not known a a-priori is investigated.
A similar problem is planted dense subgraph problem where there is a subgraph of size k (k
could be either deterministic or random) within a graph of size n. Every pair nodes which
are inside the subgraph are connected with probability p and otherwise with probability q.
In [94], it was shown that computational hardness of community detection admits a phase
transition threshold beyond which the recovery is impossible. In [95], the cavity method from
statistical physics was used to characterize phase transition and it was shown that above a
threshold, the hidden community can be detected by belief propagation.
10.1.1.4
Prior Work on Correlation Clustering
In correlation clustering, we are given a complete graph which all edges in the graph are
labeled either +1 or −1. The goal is to partition nodes in the graph in such a way that
the total sum of disagreement, i.e., (i) number of edges in a cluster with −1 label and,
(ii) number of edges between different clusters with +1 label, is minimized [96]. This is
essentially equivalent to Graph Cluster Editing Problem in which one tries to transform a
graph into disjoint cliques by using minimum number of edge deletion or edge addition. In
[97], it was shown that Cluster Editing problem is NP- complete by reduction from 3-Exact
3-Cover. The advantage of correlation clustering is that we do not need to specify the number
of clusters in advance.
Motivated by document clustering application, in [98] authors provide a constant factor
approximation algorithm for correlation clustering problem when the goal is minimizing the
number of disagreements. The algorithm is based on a combination argument to bound
95
the number of errors. The idea is to first create a clustering consisting “clean clusters”
and singletons and then then bound the number of mistake within non-singleton clusters by
counting erroneous triangles.
In [99], a randomized expected 3-approximation algorithm for correlation clustering was
provided. The algorithm CC-PIVOT proceeds by randomly picking a vertex in the graph
and form a cluster from its neighbors and removing the nodes within the cluster from the
graph. Then, recursively apply this procedure to the rest of the graph until all nodes in the
graph are exhausted.
The correlation clustering problem can be formalized as an integer linear program [100].
Many approximation algorithms are obtained by LP relaxation of linear program. For instance, in [101], a 4-approximation algorithm was proposed which relies on first solving the
linear relaxation of the LP in polynomial time. From the solution of LP, one can form
clusters by centering a randomly picked node u and finding corresponding near neighbors
N (u) and decide to join them together if the average distance between u and N (u) exceeds a
predefined threshold. In [102], a rounding scheme for LP problem is provided which achieves
approximation factor 2.06 which matches the integrality gap of LP relation of CC problem.
In [103], correlation clustering for general weighted graphs are discussed in which the edges
of the graph have non-negative weights in addition to ±1 labels. They provide an O(log n)
approximation algorithm based on LP rounding using region growing technique.
In correlation clustering, in some cases, the cluster sizes tend to be very small. In [104], the
case where the cluster sizes are constrained is investigated.
The bipartite graph editing problem was studied in [105] where the author showed that the
problem is NP-hard and provide a 11-approximation polynomial algorithm which uses linear
programming relaxation and rounding. In [106], a randomized algorithm was provided which
achieves expected 4-approximation factor.
96
10.1.2
Overview of Correlation Clustering
Let G = (V, E) be a graph where V is the set corresponding to the nodes in the graph and
E ⊆ V × V is the set of edges in the graph. Given a node v ∈ V , the set af neighbors of
v are denoted by N (v) where N (v) = {w ∈ V : (v, w) ∈ E}. A graph is called complete if
and only if E = V × V . A complete graph is sometimes called clique. A graph G = (V, E)
is called a cluster graph if every connected component of G is a clique.
Let E4F = (E \ F ) ∪ (F \ E) denote the symmetric set difference between E and F . Given
a graph G = (V, E) and F ⊂ V × V , we say F is an editing set for G if G0 = (V, E4F ) is
a cluster graph. In other term, E4F encapsulates adding and removing edges in the graph
in order to transform G into a cluster graph. Let F denote the set of all editing sets for a
given graph G. Note that F 6= ∅.
Graph Editing Problem: Given a graph G = (V, E), find the editing set F ∗ of minimal
cardinality, i.e.,
F ∗ = arg min |F |.
F ∈F
There is an equivalent way to define the Graph Editing Problem as a clustering problem.
Let C = {C1 , · · · , Cm } be a partition of the set of nodes V in G. Given a pair (v, w) ∈ V ×V ,
we say that (v, w) is violated with respect to C if one of following happens: (i) (v, w) ∈ E
but v ∈ Ci and w ∈ Cj for some i 6= j, (ii) (v, w) ∈
/ E but v ∈ Ci and w ∈ Ci for some i.
Let yC (v, w) be the indicator function that is equal to 1 if (v, w) is violated with respect to
C and 0 otherwise. The cost of a clustering C is defined as
J(C) ,
X
yC (v, w).
(10.1)
(v,w)∈V ×V
Graph Editing Problem:
Given graph G = (V, E), find the cluster C ∗ of minimal cost,
i.e.,
C ∗ = arg min J(C),
C
97
where the minimization is over the space of all partitions of V .
We will use CC-PIVOT algorithm in [99], to compare the performance of our algorithm.
As mentioned earlier, CC-PIVOT will find a cluster with cost at most 3 times greater than
the cost associated to the optimal partition.
10.2
Simulation Results
Our algorithm creates a random walk on the nodes of a similarity graph. We start different,
coupled random walks from different nodes. We then adapt the coupling from the past approach to to identify clusters before the chain mixes (rather than sample from the stationary
distribution). The detailed description of the algorithm outlined in Algorithm 9.2.
10.2.1
Stochastic Block Model
Let G(n, p, q) be the random graph model which will has following properties. Let V =
{1, · · · , n} , [n] be the set of the nodes in G. Assume that there exists an underlying true
P
S
partition V = ki=1 Vi of the nodes such that ki=1 |Vi | = n. In this generative model, for all
w, v ∈ V , if v ∈ Vi and w ∈ Vj for some i 6= j, then w is connected to v with probability q.
If v, w ∈ Vi for some i, then w is connected to v with probability p. All choices of edges are
independent from each other and E(G) denote the set of edges in the G.
10.2.1.1
Comparing the Cost
In order to implement and compare the results of CC-Pivot and our algorithm, we first
partition [n] into blocks of different sizes (as depicted in tables below) and connect nodes
within a block with probability p and nodes in different partitions with probability q. Once
this step is done, then we relabel the node labels while preserving the structure. Namely,
98
we obtain a uniform permutation σ of [n] and relabel nodes based on σ such that the edge
structure is preserved.
We consider different scenarios for the size and number of true clusters as it will be included
in the tables. For fixed n, we vary p, q, |Vi | and draw m samples from G(n, p, q). For each
sample, we run the CC-Pivot and our algorithm and recorded the associate cost given by
(10.1). Then, we will average the costs over m samples.
Construction of Random Walk:
In order to use Algorithm 9.2 for graph partitioning
problem, first we define a non-uniform random walk on the graph as follows:
We assign a weight to every edge e = (v, w) ∈ E(G) given by
w(e) = |N (v) ∩ N (w)|.
(10.2)
This reflects the fact that cross-cluster edges have smaller weight compare to in-cluster edges.
If current state of the random walk in node v, then random walk continues by following an
edge into N (v) with a probability proportional to the weight of that edge. Tables 10.1, 10.2
and 10.3 summarize the average cost of clustering for different model parameters. The cost
of clusters found by algorithm is smaller compared to CC-PIVOT for almost all cases.
Remark
The values of p and q are designed such that the resulting random graphs have
well-knit clusters. The emphasize is to compare purely the cluster graph distance obtained
by proposed method with those obtained by CC-PIVOT. Next section will explores the
actual community discovery capabilities of the algorithm.
99
Table 10.1: All cluster have the same size with different m = 10 samples were taken uniformly
at random from G(n, p, q).
Cluster-Size Num-Cluster
p
q
Our-Cost
5
60
0.95 0.05
3020.3
5
60
0.95 0.1
5638
5
60
0.95 0.15
7579
5
60
0.95 0.2
9803.4
5
60
0.95 0.25 12190.1
5
60
0.9 0.05
2989
5
60
0.9 0.1
5396.1
5
60
0.9 0.15
7651.9
5
60
0.9 0.2
10074.6
5
60
0.9 0.25 11946.1
5
60
0.85 0.05
2939.2
5
60
0.85 0.1
5351.4
5
60
0.85 0.15
7769.2
5
60
0.85 0.2
9852.9
5
60
0.85 0.25 11903.3
5
60
0.8 0.05
2958.2
5
60
0.8 0.1
5249.1
5
60
0.8 0.15
7696
5
60
0.8 0.2
9962.9
5
60
0.8 0.25 12038.6
5
60
0.75 0.05
2991.5
5
60
0.75 0.1
5244.7
5
60
0.75 0.15
7512
5
60
0.75 0.2
9659.4
5
60
0.75 0.25 11888.7
10
30
0.95 0.05
3382.5
10
30
0.95 0.1
5900
10
30
0.95 0.15
8666.3
10
30
0.95 0.2
10425.2
10
30
0.95 0.25 12458.9
10
30
0.9 0.05
3355.6
10
30
0.9 0.1
5854.9
10
30
0.9 0.15
8162.1
10
30
0.9 0.2
10433.5
10
30
0.9 0.25 12535.5
10
30
0.85 0.05
3342.3
10
30
0.85 0.1
5831.9
10
30
0.85 0.15
8122.5
10
30
0.85 0.2
10318.5
10
30
0.85 0.25 12374.8
10
30
0.8 0.05
3319.9
10
30
0.8 0.1
5748.9
10
30
0.8 0.15
8047.9
10
30
0.8 0.2
10332.8
10
30
0.8 0.25 12557.9
10
30
0.75 0.05
3371.9
10
30
0.75 0.1
5760.3
10
30
0.75 0.15
8214.1
10
30
0.75 0.2
10193.4
10
30
0.75 0.25
12758
20
15
0.95 0.05
3743.1
20
15
0.95 0.1
6766.1
20
15
0.95 0.15
9353.2
20
15
0.95 0.2
11466.9
20
15
0.95 0.25 13538.8
20
15
0.9 0.05
3986.2
20
15
0.9 0.1
6861.4
20
15
0.9 0.15
9146.3
20
15
0.9 0.2
11491.6
20
15
0.9 0.25 13327.9
20
15
0.85 0.05
4059.8
20
15
0.85 0.1
6656.5
20
15
0.85 0.15
9252.1
20
15
0.85 0.2
11176.5
20
15
0.85 0.25 13154.3
CCPivot-Cost
3572.3
6678.1
9434.8
11992.4
14564.9
3512.7
6567.7
9383.2
11959.3
14450.2
3522.8
6539.1
9293.3
11912.9
14317.8
3511.1
6518.6
9367.6
11896.9
14375.9
3444.5
6494.7
9232.6
11944.9
14339.6
4243.4
7226.8
10055.3
12725.3
14856.6
4263.1
7199
9991.5
12545.3
14887
4096.3
7266.8
9962.3
12510.3
14864.4
4113.6
7023.4
9877.5
12430.8
14724.5
4063
6992.3
9866.7
12394.1
14697.1
4944.8
8176.7
11051.2
13409.5
15793.6
4910.1
8070
11078.7
13556
15688
5030.6
8182.1
10985.6
13437.5
15467.9
Cluster-Size
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
25
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
100
Num-Cluster
15
15
15
15
15
15
15
15
15
15
15
15
15
15
15
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
12
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
p
0.85
0.85
0.85
0.85
0.85
0.8
0.8
0.8
0.8
0.8
0.75
0.75
0.75
0.75
0.75
0.95
0.95
0.95
0.95
0.95
0.9
0.9
0.9
0.9
0.9
0.85
0.85
0.85
0.85
0.85
0.8
0.8
0.8
0.8
0.8
0.75
0.75
0.75
0.75
0.75
0.95
0.95
0.95
0.95
0.95
0.9
0.9
0.9
0.9
0.9
0.85
0.85
0.85
0.85
0.85
0.8
0.8
0.8
0.8
0.8
0.75
0.75
0.75
0.75
0.75
q
Our-Cost
0.05
4059.8
0.1
6656.5
0.15
9252.1
0.2
11176.5
0.25 13154.3
0.05
4083.4
0.1
6646
0.15
8990.3
0.2
11377.2
0.25 13495.6
0.05
4101.3
0.1
6537.8
0.15
8870.2
0.2
11014.7
0.25 12939.7
0.05
3821.3
0.1
7223.5
0.15
9839
0.2
11844.5
0.25 13893.8
0.05
4197.4
0.1
6990.1
0.15
9518.3
0.2
11927.8
0.25 13799.1
0.05
4522.7
0.1
7100.2
0.15
9606.8
0.2
11796.6
0.25 13581.8
0.05
4529.9
0.1
7153.5
0.15
9302.7
0.2
11643.3
0.25 13444.6
0.05
4469
0.1
6963.6
0.15
9468.7
0.2
11842.3
0.25 13337.4
0.05
4749.2
0.1
9515
0.15 12120.9
0.2
13919.9
0.25
16212
0.05
5175
0.1
9331.2
0.15 11920.4
0.2
13891
0.25 15857.2
0.05
6226.9
0.1
9275.5
0.15
11717
0.2
13932.1
0.25 15593.1
0.05
6381.7
0.1
9265
0.15
11261
0.2
13492.5
0.25 15410.1
0.05
6576.4
0.1
9027.3
0.15 11228.3
0.2
13125.8
0.25 15157.4
CCPivot-Cost
5030.6
8182.1
10985.6
13437.5
15467.9
4976.6
8119.2
10848.8
13384.2
15559.7
4998
8026.4
10844.7
13187.4
15424.4
5070
8565.8
11350.4
13867.8
16104.6
5235.6
8606.9
11374.3
14014.4
16015.7
5331.2
8655.4
11300.4
13829
15996.5
5337.7
8534
11233
13611
15808.2
5342.1
8488.4
11228.4
13507.5
15783.5
5665.4
9353
12696.8
14970.2
17084.6
6190.6
9660.7
12796
14965.9
16886.1
6689
9905.3
12803.7
15177.5
17162.9
6909.4
9923.7
12886.4
15029.5
17112.2
6962.3
10262.8
12722
15018
16850.7
Table 10.2: All cluster have the same size. Here, n = 600 and m = 10 samples were taken
uniformly at random from G(n, p, q).
Cluster-Size
20
20
20
20
20
20
20
20
20
30
30
30
30
30
30
30
30
30
60
60
60
60
60
Num-Cluster
30
30
30
30
30
30
30
30
30
20
20
20
20
20
20
20
20
20
10
10
10
10
10
p
0.9
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.9
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.9
0.9
0.9
0.8
0.8
q
0.05
0.1
0.2
0.05
0.1
0.2
0.05
0.1
0.2
0.05
0.1
0.2
0.05
0.1
0.2
0.05
0.1
0.2
0.05
0.1
0.2
0.05
0.1
Our-Cost
13977.6
23232.2
41432
13442.5
22900.5
41042.9
13162.6
22660.1
40594
15523.7
25239.8
43827.3
15268.7
24626.5
42694.4
14806.5
23953.8
43804.8
22302.9
30677.2
49639.1
21195.4
29711.2
CCPivot-Cost
17569.8
29718.3
51590.5
17277.4
29426.5
51225.3
16861.7
28816.1
50626.8
19053.6
32383.3
53531.6
19123.2
31612.9
52860
18927.2
30778.5
51836.4
22466.3
37018.8
57397.6
23661.8
36251.1
Cluster-Size
60
60
60
60
75
75
75
75
75
75
75
75
75
100
100
100
100
100
100
100
100
100
101
Num-Cluster
10
10
10
10
8
8
8
8
8
8
8
8
8
6
6
6
6
6
6
6
6
6
p
0.8
0.7
0.7
0.7
0.9
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.9
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
q
0.2
0.05
0.1
0.2
0.05
0.1
0.2
0.05
0.1
0.2
0.05
0.1
0.2
0.05
0.1
0.2
0.05
0.1
0.2
0.05
0.1
0.2
Our-Cost
47510.2
19974.2
28827.9
47028
21775.2
33773.6
51431.9
24539.9
32683.4
49895.8
22970.4
31452.8
48284.7
27000.8
38898.6
55986.7
27914.3
37517.8
53975.2
27272.6
35612.8
51948.8
CCPivot-Cost
57262.2
23781
35849.5
55923.3
23835.4
37830.4
59096
25434.4
38469.7
58606
25653.1
38101.2
57460.1
25469.8
39732.2
61159.9
27993.8
40637.8
61695.5
28663
41015
60161
Table 10.3: All cluster have the same size. Here, n = 1000 and m = 10 samples were taken
uniformly at random from G(n, p, q).
Cluster-Size
25
25
25
25
25
25
25
25
25
50
50
50
50
50
50
50
50
50
100
100
100
100
100
100
100
100
100
125
125
125
125
125
125
125
125
125
Num-Cluster
40
40
40
40
40
40
40
40
40
20
20
20
20
20
20
20
20
20
10
10
10
10
10
10
10
10
10
8
8
8
8
8
8
8
8
8
p
0.9
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.9
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.9
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.9
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
q
0.05
0.1
0.2
0.05
0.1
0.2
0.05
0.1
0.2
0.05
0.1
0.2
0.05
0.1
0.2
0.05
0.1
0.2
0.05
0.1
0.2
0.05
0.1
0.2
0.05
0.1
0.2
0.05
0.1
0.2
0.05
0.1
0.2
0.05
0.1
0.2
102
Our-Cost
35171.9
62572.3
111486
34702.6
63333.5
111505
35055.2
60453.7
109507
42940.6
70140.5
120374
42513.5
69067.5
120766
40984.8
66568.8
114722
60183.6
86070.8
135560
58923
83953.2
131234
55848
80304
127532
69053
95253.4
143383
67542.6
91870.3
139450
63423.2
87507.7
132962
CCPivot-Cost
46816.5
81410.3
140666
45708.7
80068.7
140124
44997.6
78519
137924
54529.5
90096.8
149644
54419.9
89021
146962
53239.7
86969.2
145249
62712.6
100883
161025
65811.4
101732
159130
66380.2
100392
156990
66537.2
106629
166036
70890.6
107537
165462
72107.1
106861
161534
(a) Original community structure.
(b) Input graph to algorithm after permutation of
nodes.
Figure 10.1: Bitmap of the original graph.
10.2.1.2
Finding Communities
For finding communities in SBM, we use a scaled version of random walk as follows. Given
a realization of G(n, p, q), the weight of e = (v, u) ∈ E(G) is assigned by
w(e) , w(v, u) = f N (v) ∩ N (u) ,
(10.3)
where f (x) = xr for some r ∈ Z+ . Then, for all u ∈ N (v), the transition probability of the
walk is given by
P(u|v) =
w(v, u)
P
.
w(v, u0 )
(10.4)
u0 ∈N (v)
The above scaling can be interpreted how different similar neighbor nodes vote on transition
probabilities. A similar scaling method (with exponential function) was used in [76] for
image segmentation applications.
A simple simulation for a graph with n = 300 nodes with k = 5 communities, each community
has size |Vi | = 60. For this setup and p = 0.55 and q =
0.55
8
= 0.068, we obtain a sample as
depicted in Figure 10.1. The bitmap representation of adjacency graph is used to show the
ground truth communities within the graph. The input graph to both algorithms is depicted
in Figure 10.4b which the node labels are permuted uniformly.
Before Scaling: Figure 10.2 shows the result before doing any scaling. Both methods fail
to detect the community structure successfully.
103
(a) Our Algorithm without scaling.
(b) CC-PIVOT.
Figure 10.2: Comparing our algorithm with CC-PIVOT when no scaling is performed.
After Scaling: Figures 10.3 and 10.4 show the result after doing scaling for two different
values for r. As it can be seen, by increasing the value of r, the structure can be detected
by our method while CC-PIVOT is unable to detect the communities.
0
50
100
150
200
250
0
(a) Our Algorithm with r = 5
50
100
150
200
250
(b) CC-PIVOT.
Figure 10.3: Comparing scaled version of our algorithm with CC-PIVOT.
0
50
100
150
200
250
0
(a) Our Algorithm with r = 6
50
100
150
200
250
(b) CC-PIVOT.
Figure 10.4: Comparing scaled version of our algorithm with CC-PIVOT.
104
10.2.2
LFR Random Graphs
While Stochastic Block Models are not realistic in the sense that they usually have communities with approximately the same size and furthermore, all vertices have the same degree.
In realistic networks, the degree distributions are usually skewed and size of communities
varies. We consider another class of random graphs introduced in [107] which is known as
LFR benchmark graphs. In these models, the distributions of both node degrees and number
of communities admit power laws.
The construction is as follows: Let n be the number of nodes. First a sequence of community
sizes distributed by power law with exponent τ1 is generated. The degree of each node is
distributed by power law with exponent τ2 . Each node within a community of size k, shares a
fraction 1−µ of his corresponding edges with the members of his community and a fraction µ
of his edges with other communities. The connection of edges is done similar to Configuration
model in such a way that the degree sequence is maintained.
Figure 10.5 shows a realization of LFR graph with n = 200 nodes and 6 communities and
parameters τ1 = 2, τ2 = 3 and µ = 0.25 with average degree equals 30. The random walk is
constructed according to (10.3) with r = 4. Figures 10.5 and 10.7 show the output of our
algorithm and CC-PIVOT. The communities are showed with different colors in circular
layout for convenience. As it can be seen, the CC-PIVOT tends to create lots of singletons
compared to ours. Observe that in our algorithm if we wait and track the clusters up before
coalescence happens, we actually can recover communities as depicted in Figure 10.8.
105
Figure 10.5: A realization of LFR network with n = 200 nodes and 6 communities.
Figure 10.6: The output of our algorithm when the cost is optimized.
106
Figure 10.7: The output of CC-PIVOT algorithm.
107
Figure 10.8: The output of our algorithm before coalescence happens.
108
Figure 10.9: The schedule of American football games during season 2000. Nodes in the
represent teams and edges represent games between teams.
10.2.3
Performance on Benchmark Networks
In this section, we use the proposed method for community detection on real world benchmark networks which are widely used for testing algorithms [62].
10.2.3.1
American Football College
Network of American football games between Division IA colleges during regular season Fall
2000 [66]. There are in total 115 teams and 615 edges in the graph. Every edge represents
a game between corresponding end nodes. Figures 10.9, 10.10 and 10.11 show the original
network and clusters created by CC-PIVOT and proposed method. Teams that are in the
same conference are drawn with same colors. As it can be seen, the almost all teams are
clustered correctly in Figure 10.11 using our method.
109
Figure 10.10: The output of CC-PIVOT algorithm.
Figure 10.11: The output of our algorithm.
110
Bibliography
[1] B. Fittingoff, “Universal methods of coding for the case of unknown statistics,” in
Proceedings of the 5th Symposium on Information Theory. Moscow-Gorky, 1972, pp.
129—135.
[2] J. Shtarkov, “Coding of discrete sources with unknown statistics,” in Topics in Information Theory (Coll. Math. Soc. J. Bolyai, no. 16), I. Csiszár and P. Elias, Eds.
Amsterdam, The Netherlands: North Holland, 1977, pp. 559–574.
[3] R. Krichevsky and V. Trofimov, “The preformance of universal coding,” IEEE Transactions on Information Theory, vol. 27, no. 2, pp. 199—207, March 1981.
[4] P. Laplace, Philosphical essays on probabilities, Translated by A. Dale from the 5th
(1825) ed. Springer Verlag, New York, 1995.
[5] W. Gale and K. Church, “What is wrong with adding one?” in Corpus based research
into language, N. Oostdijk and P. de Haan, Eds.
Rodopi, Amsterdam, 1994, pp.
189—198.
[6] I. Good, “The population frequencies of species and the estimation of population parameters,” Biometrika, vol. 40, no. 3/4, pp. 237—264, December 1953.
[7] A. Orlitsky, N. Santhanam, and J. Zhang, “Always Good Turing: Asymptotically
111
optimal probability estimation,” in Proceedings of the 44th Annual Symposium on
Foundations of Computer Sciece, October 2003.
[8] D. A. Levin, Y. Peres, and E. L. Wilmer, Markov Chains and Mixing Times. Americam
Mathematical Society, 2009.
[9] D. J. K. Farzan, “Coding schemes for chip-to-chip interconnect applications,” IEEE
Transansactions on Very Large Scale Integration Systems, vol. 14, no. 4, pp. 393–406,
Apr 2006.
[10] D. J. Aldous, “Random walks on finite groups and rapidly mixing Markov chains,”
in Séminaire de Probabilités XVII - 1981/82, Springer Lecture Notes in Mathematics
986, 1983.
[11] J. G. Propp and D. B. Wilson, “Exact sampling with coupled markov chains and
applications to statistical mechanics,” Random structures and Algorithms, vol. 9, no.
1-2, pp. 223–252, 1996.
[12] R. Krichevsky, Universal Compression and Retrieval.
Kluwer Academic Publishers,
1993.
[13] B. Y. Ryabko, “Compression-based methods for nonparametric density estimation,
on-line prediction, regression and classification for time series,” in Information Theory
Workshop, Porto, Portugal, 2008, pp. 271–275.
[14] B. Ryabko, “Prediction of random sequences and universal coding,” Problemy
Peredachi Informatsii, vol. 24, no. 2, pp. 3–14, 1988.
[15] ——, “A fast adaptive coding algorithm,” Problemy Peredachi Informatsii, vol. 26,
no. 4, pp. 305–317, 1990.
[16] J. Rissanen, “Universal coding, information, prediction, and estimation,” IEEE Transactions on Information Theory, vol. 30, no. 4, pp. 629—636, July 1984.
112
[17] M. Weinberger, J. Rissanen, and M. Feder, “A universal finite memory source,” IEEE
Transactions on Information Theory, vol. 41, no. 3, pp. 643–652, May 1995.
[18] F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens, “The context-tree weighting
method: basic properties.” IEEE Transactions on Information Theory, vol. 41, no. 3,
pp. 653–664, 1995.
[19] T. J. Tjalkens, Y. Shtarkov, and F. Willems, “Sequential weighting algorithms for
multi-alphabet sources,” in 6th Joint Swedish-Russian International Workshop on Information Theory, August 1993, pp. 230–234.
[20] J. Kieffer, “A unified approach to weak universal source coding,” IEEE Transactions
on Information Theory, vol. 24, no. 6, pp. 674—682, November 1978.
[21] F. M. J. Willems, “The context-tree weighting method: Extensions,” IEEE Transactions on Information Theory, vol. 44, pp. 792–798, 1998.
[22] I. Csiszár and Z. Talata, “Context tree estimation for not necessarily finite memory
processes, via bic and mdl,” IEEE Transactions on Information theory, vol. 52, no. 3,
Mar 2006.
[23] A. Garivier, “Consistency of the unlimited BIC context tree estimator,” IEEE Transactions on Information theory, vol. 52, no. 10, pp. 4630–4635, Sep 2006.
[24] A. Galves, V. Maume-Deschamps, and B. Schmitt, “Exponential inequalities for
VLMC empirical trees,” ESAIM: Probability and Statistics, vol. 12, pp. 219–229, Jan
2008.
[25] A. Garivier and F. Leonardi, “Context tree selection: A unifying view,” Stochastic
Processes and their Applications, vol. 121, no. 11, pp. 2488–2506, Nov 2011.
[26] J. Rissanen, “A universal data compression system,” IEEE Transactions on Information theory, vol. 29, no. 5, pp. 656–664, Sep 1983.
113
[27] P. Bühlmann and A. Wyner, “Variable length Markov chains,” Annals of Statistics,
vol. 27, no. 2, pp. 480–583, 1999.
[28] G. Morvai and B. Weiss, “On sequential estimation and prediction for discrete time
series,” Stochastics and Dynamics, vol. 7, no. 4, pp. 417–437, 2007.
[29] I. Csiszár, “Large-scale typicality of Markov sample paths and consistency of MDL
order estimators,” IEEE Transactions on Information Theory, vol. 48, no. 6, pp. 1616–
1628, Jun 2002.
[30] I. Csiszár and P. C. Shields, “The consistency of the BIC Markov order estimator,”
Annals of Statistics, vol. 28, pp. 1601–1619, 2000.
[31] I. Csiszár and Z. Talata, “On rate of convergence of statistical estimation of stationary
ergodic processes,” IEEE Transactions on Information Theory, vol. 56, no. 8, pp.
3637–3641, Aug 2010.
[32] A. Galves and F. Leonardi, “Exponential inequalities for empirical unbounded context
trees,” In and Out of Equilibrium 2, vol. 60, pp. 257–269, 2008.
[33] G. Moravi and B. Weiss, “Order estimation of Markov chains,” IEEE Transactions on
Information Theory, vol. 51, no. 4, pp. 1496–1497, 2005.
[34] R. V. Handel, “On the minimal penalty for Markov order estimation,” Probability
theory and related fields, vol. 150, no. 3-4, pp. 709–738, 2011.
[35] N. Merhav, M. Gutman, and J. Ziv, “On the estimation of the order of a Markov chain
and universal data compression,” IEEE Transactions on Information Theory, vol. 35,
no. 5, pp. 1014–1019, 1989.
[36] L. Finesso, L. Chuang-Chun, and P. Narayan, “The optimal error exponent for Markov
order estimation,” IEEE Transactions on Information Theory, vol. 42, no. 5, pp. 1488–
1497, 1996.
[37] G. Ciuperca and V. Girardin, “Estimation of the entropy rate of a countable Markov
114
chain,” Communications in Statistics- Theory and Methods, vol. 36, no. 14, pp. 2543–
2557, 2007.
[38] V. Girardin and A. Sesboüé, “Comparative construction of plug-in estimators of the
entropy rate of two-state Markov chains,” Methodology and Computing in Applied
Probability, vol. 11, no. 2, pp. 181–200, 2009.
[39] Z. Zhiyi and X. Zhang, “A normal law for the plug-in estimator of entropy,” IEEE
Transactions on Information Theory, vol. 58, no. 5, pp. 2745–2747, 2012.
[40] H. Chang, “On convergence rate of the shannon entropy rate of ergodic Markov chains
via sample-path simulation,” Statistics & probability letters, vol. 76, no. 12, pp. 1261–
1264, 2006.
[41] G. H. Yari and Z. Nikooravesh, “Estimation of the entropy rate of ergodic Markov
chains,” Journal of Iranian Statistical Society, vol. 11, no. 1, pp. 75–85, 2012.
[42] A. D. Wyner and J. Ziv, “Some asymptotic properties of the entropy of a stationary
ergodic data source with applications to data compression,” IEEE Transactions on
Information Theory, vol. 35, no. 6, pp. 1250–1258, 1989.
[43] H. Cai, S. R. Kulkarni, and S. Verdú, “Universal entropy estimation via block sorting,”
IEEE Transactions on Information Theory, vol. 50, no. 7, pp. 1551–1561, 2004.
[44] Y. Gao, I. Kontoyiannis, and E. Bienenstock, “From the entropy to the statistical
structure of spike trains,” in Proceedings of IEEE Symposium on Information Theory,
Seattle, USA, 2006, pp. 645–649.
[45] I. Kontoyiannis, P. H. Algoet, Y. M. Suhov, and A. J. Wyner, “Nonparametric entropy
estimation for stationary processes and random fields, with applications to english
text,” IEEE Transactions on Information Theory, vol. 44, no. 3, pp. 1319–1327, 1998.
[46] I. Kontoyiannis and Y. M. Suhov, “Stationary entropy estimation via string matching,”
in Proceedings of the Data Compression Conference, 1996.
115
[47] M. Huber, “Perfect sampling using bounding chains,” Annals of Applied Probability,
vol. 4, no. 2, pp. 734–753, 2004.
[48] W. S. Kendall and J. Möller, “Perfect simulation using dominating processes on ordered spaces, with application to locally stable point processes,” Advances in Applied
Probability, vol. 32, no. 3, pp. 844–865, 2000.
[49] M. Luby, D. Randall, and A. Sinclair, “Markov chain algorithms for planar lattice
structures (extended abstract),” in 36th Annual Symposium on Foundations of Computer Science, 1995, pp. 150–159.
[50] D. B. Wilson, “How to couple from the past using a read-once source of randomness,”
Random Structures and Algorithms, vol. 16, no. 1, pp. 85–113, 2000.
[51] W. Feller, An Introduction to Probability Theory and Its Applications.
John Wiley
and Sons; 2nd Edition, 1957, vol. 1.
[52] P. Ferrari and A. Galves, “Coupling and regeneration for stochastic processes,” Notes
for a minicourse presented in XIII Escuela Venezolana de Matematicas, 2000.
[53] V. Guruswami, “Rapidly mixing Markov chains: A comparison of techniques,” Writeup
at MIT Laboratory for Computer Science, 2000.
[54] T. Cover and J. Thomas, Elements of Information Theory.
John Wiley and sons.,
1991.
[55] D. Duarte, A. Galves, and N. L. Garcia, “Markov approximation and consistent estimation of unbounded probabilistic suffix trees,” Bulletin of the Brazilian Mathematical
Society, vol. 37, no. 4, pp. 581–592, 2006.
[56] P. Collet, A. Galves, and F. Leonardi, “Random perturbations of stochastic processes
with unbounded variable length memory,” Electronic Journal of Probability, vol. 13,
pp. 1345–1361, 2008.
116
[57] B. Yu, “Rates of convergence for empirical processes of stationary mixing sequences,”
Annals of Probability, vol. 22, no. 1, pp. 94–116, 1994.
[58] J. R. Norris, Markov Chains. Cambridge Series in Statistical and Probabilistic Mathematics, 1998.
[59] J. G. Propp and D. Wilson, “Coupling from the past: a users guide,” Microsurveys in
Discrete Probability, vol. 41, pp. 181–192, 1998.
[60] J. S. Rosenthal, A first look at rigorous probability theory. World Scientific, 2006.
[61] E. Seneta, Non-negative matrices and Markov chains. Springer, 2006.
[62] S. Fortunato, “Community detection in graphs,” Physics Reports, vol. 486, no. 3, pp.
75–174, 2010.
[63] P. Erdös and A. Rényi, “On random graphs I,” Publ. Math. Debrecen, vol. 6, pp.
290–297, 1959.
[64] ——, “On the evolution of random graphs,” Publ. Math. Inst. Hung. Acad. Sci, vol. 5,
pp. 17–61, 1960.
[65] M. E. Newman, D. J. Watts, and S. H. Strogatz, “Random graph models of social
networks,” Proceedings of the National Academy of Sciences, vol. 99, no. suppl 1, pp.
2566–2572, 2002.
[66] M. Girvan and M. E. Newman, “Community structure in social and biological networks,” Proceedings of the national academy of sciences, vol. 99, no. 12, pp. 7821–7826,
2002.
[67] M. E. Newman and M. Girvan, “Finding and evaluating community structure in networks,” Physical review E, vol. 69, no. 2, p. 026113, 2004.
[68] J. Chen, O. R. Zaı̈ane, and R. Goebel, “Detecting communities in social networks
using max-min modularity,” in SDM, vol. 3, no. 1. SIAM, 2009, pp. 20–24.
117
[69] M. E. Newman, “Modularity and community structure in networks,” Proceedings of
the National Academy of Sciences, vol. 103, no. 23, pp. 8577–8582, 2006.
[70] ——, “Finding community structure in networks using the eigenvectors of matrices,”
Physical review E, vol. 74, no. 3, p. 036104, 2006.
[71] S. White and P. Smyth, “A spectral clustering approach to finding communities in
graph.” in SDM, vol. 5. SIAM, 2005, pp. 76–84.
[72] U. Von Luxburg, “A tutorial on spectral clustering,” Statistics and computing, vol. 17,
no. 4, pp. 395–416, 2007.
[73] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an
algorithm,” Advances in neural information processing systems, vol. 2, pp. 849–856,
2002.
[74] R. Kannan, S. Vempala, and A. Vetta, “On clusterings: Good, bad and spectral,”
Journal of the ACM (JACM), vol. 51, no. 3, pp. 497–515, 2004.
[75] J. Shi and J. Malik, “Normalized cuts and image segmentation,” Pattern Analysis and
Machine Intelligence, IEEE Transactions on, vol. 22, no. 8, pp. 888–905, 2000.
[76] M. Meila and J. Shi, “Learning segmentation by random walks,” in Neural Information
Processing Systems, 2001.
[77] P. W. Holland, K. B. Laskey, and S. Leinhardt, “Stochastic blockmodels: First steps,”
Social networks, vol. 5, no. 2, pp. 109–137, 1983.
[78] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová, “Asymptotic analysis of the
stochastic block model for modular networks and its algorithmic applications,” Physical
Review E, vol. 84, no. 6, p. 066106, 2011.
[79] E. Abbe, A. S. Bandeira, and G. Hall, “Exact recovery in the stochastic block model,”
arXiv preprint arXiv:1405.3267, 2014.
118
[80] E. Mossel, J. Neeman, and A. Sly, “Reconstruction and estimation in the planted
partition model,” Probability Theory and Related Fields, pp. 1–31, 2014.
[81] ——,
“A proof of the block model threshold conjecture,”
arXiv preprint
arXiv:1311.4115, 2013.
[82] L. Massoulié, “Community detection thresholds and the weak ramanujan property,”
in Proceedings of the 46th Annual ACM Symposium on Theory of Computing. ACM,
2014, pp. 694–703.
[83] P. Chin, A. Rao, and V. Vu, “Stochastic block model and community detection in
the sparse graphs: A spectral algorithm with optimal rate of recovery,” arXiv preprint
arXiv:1501.05021, 2015.
[84] E. Abbe and C. Sandon, “Community detection in general stochastic block
models:
fundamental limits and efficient recovery algorithms,” arXiv preprint
arXiv:1503.00609, 2015.
[85] B. Hajek, Y. Wu, and J. Xu, “Achieving exact cluster recovery threshold via semidefinite programming,” arXiv preprint arXiv:1412.6156, 2014.
[86] ——, “Achieving exact cluster recovery threshold via semidefinite programming: Extensions,” arXiv preprint arXiv:1502.07738, 2015.
[87] E. Mossel, J. Neeman, and A. Sly, “Consistency thresholds for the planted bisection
model,” Available online at arXiv: 1407.1591 v2 [math. PR], 2014.
[88] Y. Chen and J. Xu, “Statistical-computational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices,” arXiv preprint
arXiv:1402.1267, 2014.
[89] Y. Chen, S. Sanghavi, and H. Xu, “Clustering sparse graphs,” in Advances in neural
information processing systems, 2012, pp. 2204–2212.
119
[90] ——, “Improved graph clustering,” Information Theory, IEEE Transactions on,
vol. 60, no. 10, pp. 6440–6455, 2014.
[91] A. Coja-Oghlan, “Graph partitioning via adaptive spectral techniques,” Combinatorics, Probability and Computing, vol. 19, no. 02, pp. 227–284, 2010.
[92] F. McSherry, “Spectral partitioning of random graphs,” in Foundations of Computer
Science, Proceedings of 42nd IEEE Symposium on, 2001, pp. 529–537.
[93] E. Abbe and C. Sandon, “Recovering communities in the general stochastic block
model without knowing the parameters,” arXiv preprint arXiv:1506.03729, 2015.
[94] B. Hajek, Y. Wu, and J. Xu, “Computational lower bounds for community detection
on random graphs,” arXiv preprint arXiv:1406.6625, 2014.
[95] A. Montanari, “Finding one community in a sparse graph,” arXiv preprint
arXiv:1502.05680, 2015.
[96] H. Becker, “A survey of correlation clustering,” Advanced Topics in Computational
Learning Theory, pp. 1–10, 2005.
[97] R. Shamir, R. Sharan, and D. Tsur, “Cluster graph modification problems,” Discrete
Applied Mathematics, vol. 144, no. 1, pp. 173–182, 2004.
[98] N. Bansal, A. Blum, and S. Chawla, “Correlation clustering,” Machine Learning,
vol. 56, pp. 89–113, 2004.
[99] N. Ailon, M. Charikar, and A. Newman, “Aggregating inconsistent information: ranking and clustering,” in Proceedings of the 37th Annual ACM Symposium on Theory of
Computing, Baltimore, MD, USA, May 2005, pp. 684–693.
[100] V. V. Vazirani, Approximation algorithms. Springer-Verlag, 2002.
[101] M. Charikar, V. Guruswami, and A. Wirth, “Clustering with qualitative information,”
in Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science, Cambridge, MA, USA, October 2003, pp. 524–533.
120
[102] S. Chawla, K. Makarychev, T. Schramm, and G. Yaroslavtsev, “Near optimal lp rounding algorithm for correlation clustering on complete and complete k-partite graphs,”
arXiv preprint arXiv:1412.0681, 2014.
[103] E. D. Demaine, D. Emanuel, A. Fiat, and N. Immorlica, “Correlation clustering in
general weighted graphs,” Theoretical Computer Science, vol. 361, no. 2, pp. 172–187,
2006.
[104] G. J. Puleo and O. Milenkovic, “Correlation clustering with constrained cluster sizes
and extended weights bounds,” SIAM Journal on Optimization, vol. 25, no. 3, pp.
1857–1872, 2015.
[105] N. Amit, “The bicluster graph editing problem,” Ph.D. dissertation, Tel Aviv University, 2004.
[106] N. Ailon, N. Avigdor-Elgrabli, E. Liberty, and A. van Zuylen, “Improved approximation algorithms for bipartite correlation clustering,” SIAM Journal on Computing,
vol. 41, no. 5, pp. 1110–1121, 2012.
[107] A. Lancichinetti, S. Fortunato, and F. Radicchi, “Benchmark graphs for testing community detection algorithms,” Physical review E, vol. 78, no. 4, p. 046110, 2008.
121
© Copyright 2026 Paperzz