Convergence Analysis of Reweighted Sum

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 9, SEPTEMBER 2008
4293
Convergence Analysis of Reweighted
Sum-Product Algorithms
Tanya G. Roosta, Member, IEEE, Martin J. Wainwright, Member, IEEE, and Shankar S. Sastry, Member, IEEE
Abstract—Markov random fields are designed to represent
structured dependencies among large collections of random variables, and are well-suited to capture the structure of real-world
signals. Many fundamental tasks in signal processing (e.g.,
smoothing, denoising, segmentation etc.) require efficient methods
for computing (approximate) marginal probabilities over subsets
of nodes in the graph. The marginalization problem, though solvable in linear time for graphs without cycles, is computationally
intractable for general graphs with cycles. This intractability
motivates the use of approximate “message-passing” algorithms.
This paper studies the convergence and stability properties of the
family of reweighted sum-product algorithms, a generalization of
the widely used sum-product or belief propagation algorithm, in
which messages are adjusted with graph-dependent weights. For
pairwise Markov random fields, we derive various conditions that
are sufficient to ensure convergence, and also provide bounds on
the geometric convergence rates. When specialized to the ordinary
sum-product algorithm, these results provide strengthening of
previous analyses. We prove that some of our conditions are necessary and sufficient for subclasses of homogeneous models, but
not for general models. The experimental simulations on various
classes of graphs validate our theoretical results.
Index Terms—Approximate marginalization, belief propagation, convergence analysis, graphical models, Markov random
fields, sum-product algorithm.
I. INTRODUCTION
RAPHICAL models provide a powerful framework for
capturing the complex statistical dependencies exhibited
by real-world signals. Accordingly, they play a central role in
many disciplines, including statistical signal and image processing [1], [2], statistical machine learning [3], and computational biology. A core problem common to applications in all of
these domains is the marginalization problem—namely, to compute marginal distributions over local subsets of random variables. For graphical models without cycles, including Markov
chains and trees [see Fig. 1(b)], the marginalization problem is
G
Manuscript received July 29, 2007; revised April 14, 2008. Published August
13, 2008 (projected). The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Deniz Erdogmus. The work
of M. J. Wainwright was supported by NSF grants DMS-0528488 and CAREER CCF-0545862. The work of T. Roosta was supported by TRUST (The
Team for Research in Ubiquitous Secure Technology), which receives support
NSF grant CCF-0424422, and the following organizations: Cisco, ESCHER,
HP, IBM, Intel, Microsoft, ORNL, Qualcomm, Pirelli, Sun, and Symantec.
T. G. Roosta and S. S. Sastry are with the Department of Electrical and Computer Engineering, University of California at Berkeley, Berkeley, CA, 94720
USA (e-mail: [email protected]; [email protected]).
M. J. Wainwright is with the Department of Electrical and Computer Engineering and Department of Statistics, University of California at Berkeley,
Berkeley, CA, 94720 USA (e-mail: [email protected]).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TSP.2008.924136
Fig. 1. Examples of graphical models. (a) Marginalization can be performed
in linear time on chain-structured graphs, or more generally any tree (graph
without cycles), such as those used in multi-resolution signal processing [1].
(b) Lattice-based model frequently used in image processing [21], for which
the marginalization problem is intractable in general.
exactly solvable in linear-time via the sum-product algorithm, which operates in a distributed manner by passing
“messages” between nodes in the graph. This sum-product
framework includes many well-known algorithms as special
cases, among them the - or forward–backward algorithm for
Markov chains, the peeling algorithm in bioinformatics, and
the Kalman filter; see the review articles [1], [2], and [4] for
further background on the sum-product algorithm and its uses
in signal processing.
Although Markov chains/trees are tremendously useful,
many classes of real-world signals are best captured by
graphical models with cycles. [For instance, the lattice or
grid-structured model in Fig. 1(b) is widely used in computer
vision and statistical image processing.] At least in principle,
the nodes in any such graph with cycles can be clustered
into “supernodes,” thereby converting the original graph into
junction tree form [5], to which the sum-product algorithm can
be applied to obtain exact results. However, the cluster sizes
required by this junction tree formulation—and hence the computational complexity of the sum-product algorithm—grow
1053-587X/$25.00 © 2008 IEEE
4294
exponentially in the treewidth of the graph. For many classes of
graphs, among them the lattice model in Fig. 1(b), the treewidth
grows in an unbounded manner with graph size, so that the
junction tree approach rapidly becomes infeasible. Indeed,
the marginalization problem is known to be computationally
intractable for general graphical models.
This difficulty motivates the use of efficient algorithms
for computing approximations to the marginal probabilities.
In fact, one of the most successful approximate methods is
based on applying the sum-product updates to the graphs with
cycles. Convergence and correctness, though guaranteed for
tree-structured graphs, are no longer ensured when the underlying graph has cycles. Nonetheless, this “loopy” form of the
sum-product algorithm has proven very successful in many
applications [1], [2], [4]. However, there remain a variety of
theoretical questions concerning the use of sum-product and
related message-passing algorithms for approximate marginalization. It is well known that the standard form of sum-product
message-passing is not guaranteed to converge, and in fact
may have multiple fixed points in certain regimes. Recent
work has shed some light on the fixed points and convergence
properties of the ordinary sum-product algorithm. Yedidia et
al. [6] showed that sum-product fixed points correspond to
local minima of an optimization problem known as the Bethe
variational principle. Tatikonda and Jordan [7] established an
elegant connection between the convergence of the ordinary
sum-product algorithm and the uniqueness of Gibbs measures
on the associated computation tree, and provided several sufficient conditions for convergence. Wainwright et al. [8] showed
that the sum-product algorithm can be understood as seeking
an alternative reparameterization of the distribution, and used
this to characterize the error in the approximation. Heskes [9]
discussed convergence and its relation to stability properties of
the Bethe variational problem. Other researchers [10], [11] have
used contraction arguments to provide sharper sufficient conditions for convergence of the standard sum-product algorithm.
Finally, several groups [12]–[15] have proposed modified algorithms for solving versions of the Bethe variational problem
with convergence guarantees, albeit at the price of increased
complexity.
In this paper, we study the broader class of reweighted
sum-product algorithms [16]–[19], including the ordinary
sum-product algorithm as a special case, in which messages
are adjusted by edge-based weights determined by the graph
structure. For suitable choices of these weights, the reweighted
sum-product algorithm is known to have a unique fixed point
for any graph and any interaction potentials [16]. An additional desirable property of reweighted sum-product is that
the message-passing updates tend to be more stable, as confirmed by experimental investigation [16], [18], [19]. This
algorithmic stability should be contrasted with the ordinary
sum-product algorithm, which can be highly unstable due to
phase transitions in the Bethe variational problem [6], [7].
Despite these encouraging empirical results, current theoretical
understanding of the stability and convergence properties of
reweighted message-passing remains incomplete.
The main contributions of this paper are a number of theoretical results characterizing the convergence properties of
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 9, SEPTEMBER 2008
reweighted sum-product algorithms, including the ordinary
sum-product updates as a special case. Beginning with the
simple case of homogeneous binary models, we provide sharp
guarantees for convergence, and prove that there always exists
a choice of edge weights for which the associated reweighted
sum-product algorithm converges. We then analyze more general inhomogeneous models, both for binary variables and the
general multinomial model, and provide sufficient conditions
for convergence of reweighted algorithms. Relative to the
bulk of past work, a notable feature of our analysis is that
it incorporates the benefits of making observations, whether
partial or noisy, of the underlying random variables in the
Markov random field to which message-passing is applied.
Intuitively, the convergence of message-passing algorithms
should be function of both the strength of the interactions
between random variables, as well as the local observations,
which tend to counteract the interaction terms. Indeed, when
specialized to the ordinary sum-product algorithm, our results
provide a strengthening of the best previously known convergence guarantees for sum-product [7], [9]–[11]. As pointed
out after initial submission of this paper, independent work by
Mooij and Kappen [20] yields a similar refinement for the case
of the ordinary sum-product, and binary variables; the result
given here applies more generally to reweighted sum-product,
as well as higher-order state spaces. As we show empirically,
the benefits of incorporating observations into convergence
analysis can be substantial, particularly in the regimes most
relevant to applications.
The remainder of this paper is organized as follows. In
Section II, we provide basic background on graphical models
(with cycles), and the class of reweighted sum-product algorithms that we study. Section III provides convergence
analysis for binary models, which we then extend to general
discrete models in Section IV. In Section V, we describe experimental results that illustrate our findings, and we conclude
in Section VI.
II. BACKGROUND
In this section, we provide some background on Markov
random fields, and message-passing algorithms, including the
reweighted sum-product that is the focus of this paper.
A. Graphical Models
Undirected graphical models, also known as Markov random
fields, are based on associating a collection of random variables
with the vertices of a graph. More pre, where
cisely, an undirected graph consists of
are vertices, and
are edges joining pairs
is associated with node
of vertices. Each random variable
, and the edges in the graph (or more precisely, the absences of edges) encode Markov properties of the random vector
. These Markov properties are captured by a particular factorization of the probability distribution of the random vector ,
which is guaranteed to break into a product of local functions
on the cliques of the graph. (A graph clique is a subset of vertices that are all joined by edges.)
ROOSTA et al.: CONVERGENCE ANALYSIS OF REWEIGHTED SUM-PRODUCT ALGORITHMS
In this paper, we focus on discrete (multinomial) random variwith distribution specified
ables
according to a pairwise Markov random field. Any such model
has a probability distribution of the form
(1)
Here the quantities and
are potential functions that depend
only on the value
, and the pair values
respectively. Otherwise stated, each singleton potential
is a real-valued function of
, whose values
can be represented as an -vector, whereas each edge potential
is a real-valued mapping on the Cartesian product
,
matrix. With this
whose values can be represented as a
setup, the marginalization problem is to compute the singleton
, and possibly
marginal distributions
higher-order marginal distributions (e.g.,
) as well.
Note that if viewed naively, the summation defining
involves an exponentially growing number of terms (
to
be precise).
B. Sum-Product Algorithms
The sum-product algorithm is an iterative algorithm for
computing either exact marginals (on trees), or approximate
marginals (for graphs with cycles). It operates in a distributed
manner, with nodes in the graph exchanging statistical information via a sequence of “message-passing” updates. For
tree-structured graphical models, the updates can be derived as
a form of non-serial dynamic programming, and are guaranteed
to converge and compute the correct marginal distributions
at each node. However, the updates are routinely applied to
more general graphs with cycles, which is the application
of interest in this paper. Here we describe the more general
family of reweighted sum-product algorithms, which include
the ordinary sum-product updates as a particular case.
In any sum-product algorithm, one message is passed in
in the graph. The message
each direction of every edge
, is a function of the
from node to node , denoted by
at node . Consequently,
possible states
in the discrete case, the message can be represented by an
-vector of possible function values. The family of reweighted
sum-product algorithms is parameterized by a set of edge
associated with edge
. Various
weights, with
choices of these edge weights have been proposed [16], [18],
[19], and have different theoretical properties. The simplest
for all edges—recovers
case of all—namely, setting
the ordinary sum-product algorithm. Given some fixed set of
, the reweighted sum-product updates
edge weights
are given by the recursion
(2)
4295
denotes the
where
neighbors of node in the graph. Typically, the message
is normalized to unity after each iteration (i.e.,
vector
). Once the updates converge to some
, then the fixed point can be used to
message fixed point
compute (approximate) marginal probabilities at each node
.
via
are applied to a treeWhen the ordinary updates
structured graph, it can be shown by induction that the algorithm
converges after a finite number of steps. Moreover, a calculais equal to the desired
tion using Bayes’ rule shows that
marginal probability
. However, the sum-product algorithm is routinely applied to graphs with cycles, in which case
the message updates (2) are not guaranteed to converge, and the
represent approximations to the true marginal
quantities
distributions. Our focus in this paper is to determine conditions
under which the reweighted sum-product message updates (2)
are guaranteed to converge.
III. CONVERGENCE ANALYSIS
In this section, we describe and provide proofs of our main
results on the convergence properties of the reweighted sumproduct updates (2) when the messages belong to a binary state
. In this special case,
space, which we represent as
the general MRF distribution (1) can be simplified into the Ising
model form
(3)
so that the model is parameterized by a single real number
for each node, and a single real number
for each edge.
A. Convergence for Binary Homogeneous Models
We begin by stating and proving some convergence conditions for a particularly simple class of models: homogeneous
models on -regular graphs. A graph is -regular if each vertex
has exactly neighbors. Examples include single cycles
, and lattice models with toroidal boundary conditions
. In a homogeneous model, the edge weights
are all equal
to a common value
, and similarly the node parameters
are all equal to a common value
.
In order to state our convergence result, we first define, for
any real numbers and , the function
(4)
, the mapping
is symFor any fixed
metric about zero, and
. Moreover, the
is bounded in absolute value by
function
.
Proposition 1: Consider the reweighted sum-product algoapplied to a homogerithm with uniform weights
neous binary model on a -regular graph with arbitrary choice
.
of
4296
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 9, SEPTEMBER 2008
Fig. 2. Plots of the contraction coefficient R ( ; ; ) versus the edge strength . Each curve corresponds to a different choice of the observation potential
0:3466
. (a) For = 1, the updates reduces to the standard sum-product algorithm; note that the transition from convergence to nonconvergence occurs at in the case of no observations ( = 0). (b) Corresponding plots for reweighted sum-product with = 0:50. Since d = (0:50)4 = 2, the contraction coefficient
is always less than one in this case, as predicted by Proposition 1.
a) The reweighted updates (2) have a unique fixed point and
, where
converge as long as
if
(5)
otherwise
with
.
, the reweighted updates (2)
b) As a consequence, if
.
converge for all finite
, then there exist finite edge potenc) Conversely, if
tials
for which the reweighted updates (2) have multiple fixed points.
, correRemarks: Consider the choice of edge weight
sponding to the standard sum-product algorithm. If the graph is
, Proposition 1(b) implies that the stana single cycle
dard sum-product algorithm always converges, consistent with
previous work on the single cycle case [7], [22]. For more gen, convergence depends on the relation
eral graphs with
between the observation strength
and the edge strengths
. For the case
, corresponding for instance to a lattice
model with toroidal boundary as in Fig. 1(c), Fig. 2(a) provides
as a function of the edge
a plot of the coefficient
strength
, for different choices of the observation potential
. The curve marked with circles corresponds to
. Obfrom convergence to
serve that it crosses the threshold
,
non-convergence at the critical value
corresponding with the classical result due to Bethe [23], and
also confirmed in other analyses of standard sum-product [7],
[10], [11]. The other curves correspond to non-zero observa, respectively. Here, it is intion potentials
teresting to note with
, Proposition 1 reveals that the
standard sum-product algorithm continues to converge well beyond the classical breakdown point without observations
.
Fig.
2(b)
shows the corresponding curves of
, corresponding to the reweighted
. Note that
sum-product algorithm with
, so that as predicted by Proposition 1(b),
remains below one for all values
the contraction coefficient
and
, meaning that the reweighted sum-product
of
algorithm with
always converges for these graphs.
Proof of Proposition 1: Given the edge and node homogeneity of the model and the -regularity of the graph, the
message-passing updates can be completely characterized by a
, and the update
single log message
(6)
a) In order to prove the sufficiency of condition (5), we
begin by observing that for any choice of
,
, so that
we have
must belong to the admissible inthe message
. Next we compute
terval
and bound the derivative of
over this set of admissible messages. A straightforward calculation yields
,
that
was defined previously in (4).
where the function
Note that for any fixed
, the function
achieves its maximum at
. Consequently, the unis achieved at the point
constrained maximum of
satisfying
,
with
. Oth, then the constrained
erwise, if
maximum is obtained at the boundary point of the
admissible region closest to 0—namely, at the point
ROOSTA et al.: CONVERGENCE ANALYSIS OF REWEIGHTED SUM-PRODUCT ALGORITHMS
. Overall, we conclude that for all admissible messages , we have
if
otherwise
so that
as defined in the statement. Note that
, the update is an iterated contraction, and hence
if
converges [24].
, we have
b) Since for all finite
the condition
—or equivalently
—implies that
, showing that the updates are strictly contractive, which implies convergence
and uniqueness of the fixed point [24].
c) In our discussion below (following statement of Theorem
2), we establish that the condition (5) is actually necessary
. Given
for the special case of zero observations
and
, we can always find a finite
such
that the condition (5) is violated, so that the algorithm
does not converge.
B. Extension to Binary Inhomogeneous Models
We now turn to the generalization of the previous result to the
case of inhomogeneous models, in which the node parameters
and edge parameters
may differ across nodes and edges,
, define the quantity
respectively. For each directed edge
(7)
and the weight
if
.
otherwise
(8)
where the function was previously defined (4). Finally, define
matrix
, with entries indexed by
a
, and of the form
directed edges
if
and
if
and
otherwise.
(9)
Theorem 2: For an arbitrary pairwise Markov random field
over binary variables, if the spectral radius of
is less
than one, then the reweighted sum-product algorithm converges,
and the associated fixed point is unique.
When specialized to the case of uniform edge weights
, then Theorem 2 strengthens previous results, due
independently to Ihler et al. [10] and Mooij and Kappen [11],
4297
on the ordinary sum-product algorithm. This earlier work
provided conditions based on matrices that involved only terms
, as opposed to the smaller and observaof the form
tion-dependent weights
that our analysis
. As a consequence, Theorem 2 can
yields once
yield sharper estimates of convergence by incorporating the
benefits of having observations. For the ordinary sum-product
algorithm and binary Markov random fields, independent
work by Mooij and Kappen [20] has also made similar refinements of earlier results. In addition to these consequences
for the ordinary sum-product algorithm, our Theorem 2 also
provides sufficient conditions for convergence of reweighted
sum-product algorithms.
In order to illustrate the benefits of including observations in
the convergence analysis, we conducted experiments on gridstructured graphical models in which a binary random vector,
with a prior distribution of the form (3), is observed in Gaussian
noise (see Section V-A for the complete details of the experimental setup). Fig. 3 provides summary illustrations of our
in panel (a),
findings, for the ordinary sum-product
in panel (b). Each
and reweighted sum-product
plot shows the contraction coefficient predicted by Theorem 2
as a function of an edge strength parameter. Different curves
specifying the
show the effect of varying the noise variance
signal-to-noise ratio in the observation model (see Section V for
corresponds
the complete details). The extreme case
to the case of no observations. Notice how the contraction coefficient steadily decreases as the observations become more informative, both for the ordinary and reweighted sum-product
algorithms.
From past work, one sufficient (but not necessary) condition
for uniqueness of the reweighted sum-product fixed point is convexity of the associated free energy [9], [16]. The contraction
condition of Theorem 2 is also sufficient (but not necessary) to
guarantee uniqueness of the fixed point. In general, these two
sufficient conditions are incomparable, in that neither implies
(nor is implied by) the other. For instance, Fig. 3(b) shows that
the condition from Theorem 2 is in some cases weaker than the
convexity argument. In this example, with the edge weights
, the associated free energy is known to be convex [9], [16],
so that the fixed point is always unique. However, the contraction bound from Theorem 2 guarantees uniqueness only up to
some edge strength. On the other, the advantage of the contraction coefficient approach is illustrated by Figs. 2(a) and 3(a): for
the grid graph, the Bethe free energy is not convex for any edge
strength, yet the contraction coefficient still yields uniqueness
for suitably small couplings. An intriguing possibility raised
by Heskes [9] is whether uniqueness of the fixed point implies
convergence of the updates. It turns out that this implication
does not hold for the reweighted sum-product algorithm: in parfor which the optimization is
ticular, there are weights
strictly convex (and so has a unique global optimum), but the
reweighted sum-product updates do not converge.
A related question raised by a referee concerns the tightness
of the sufficient conditions in Proposition 1 and Theorem 2: that
is, to what extent are they also necessary conditions? Focusing
on the conditions of Proposition 1, we can analyze this issue by
studying the function defined in the message update (6), and
4298
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 9, SEPTEMBER 2008
= +1
Fig. 3. Illustration of the benefits of observations. Plots of the contraction coefficient versus the edge strength. Each curve corresponds to a different setting of
. Upper-most curve labeled corresponds to bounds taken from previous
the noise variance as indicated. (a) Ordinary sum-product algorithm work [7], [10], [11]. (b) Corresponding curves for the reweighted sum-product algorithm : .
=1
determining for what pairs
it has multiple fixed points.
For clarity, we state the following.
Corollary 3: Regarding uniqueness of fixed points, the conditions of Proposition 1 are necessary and sufficient for
,
but only sufficient for general
.
Proof: For
(or respectively
), the function
has (for any
) the following general properties: it
is monotonically increasing (decreasing) in , and converges to
as
, as shown in the curves marked with circles in Fig. 4. Its fixed point structure is most easily visualized
by plotting the difference function
, as shown by the
curves marked with -symbols. Note that the -curve in panel
(a) crosses the horizontal
three times, corresponding to
three fixed points, whereas the curve in panel (b) crosses only
once, showing that the fixed point is unique. In contrast, the sufficient conditions of Proposition 1 and Theorem 2 are based on
whether the magnitude of the derivative
exceeds one,
or equivalently whether the quantity
ever becomes zero. Since the -curves (corresponding to
) in
both panels (a) and (b) have flat portions, we see that the sufficient conditions do not hold in either case. In particular, panel
(b) is an instance where the fixed point is unique, yet the conditions of Proposition 1 fail to hold, showing that they are not necessary in general. Whereas panel (b) involves
,
panel (a) has zero observation potentials
. In this case,
the conditions of Proposition 1 are actually necessary and sufficient. To establish this claim rigorously, note that for
,
the point
is always a fixed point, and the maximum of
is achieved at
. If
(so that the condition in Proposition 1) is violated, then continuity implies that
for all
, for some suitably small
.
But since
, the curve must eventually cross the identity line again, implying that there is a second
fixed point
. By symmetry, there must also exist a third
fixed point
, as illustrated in Fig. 4(a). Therefore, the con-
= 0 50
dition of Proposition 1 is actually necessary and sufficient for
.
the case
C. Proof of Theorem 2
We begin by establishing a useful auxiliary result that plays
a key role in this proof, as well as other proofs in the sequel:
Lemma 4: For real numbers and , define the function
(10)
For each fixed , we have
Proof: Computing the derivative of
we have
.
with respect to ,
where the function was previously defined (4). Therefore, the
and strictly decreasing
function is strictly increasing if
if
. Consequently, the supremum is obtained by taking
, and is equal to
as claimed.
With this lemma in hand, we begin by rewriting the message update (2) in a form more amenable to analysis. For each
, define the log message ratio
directed edge
. From the standard form of the updates, a few lines of algebra show that it is equivalent to update
these log ratios via
(11)
where
(12)
ROOSTA et al.: CONVERGENCE ANALYSIS OF REWEIGHTED SUM-PRODUCT ALGORITHMS
4299
Fig. 4. Illustration of the conditions in Proposition 1. (a) For zero observation potentials = 0, the conditions are necessary and sufficient: here there are three
fixed points, and Proposition 1 is not satisfied. (b) For nonzero observation potentials > 0, Proposition 1 is sufficient but not necessary: in this case, the fixed
point is still unique, yet the conditions of Proposition 1 are not satisfied.
A key property of the message update function
is that
of the form (10), with
it can be written as a function
and
. Consequently, if we apply
for
Lemma 4, we may conclude that
all
, and consequently that
for all
. Consequently, we may assume that message
iterations
for all iterations
belongs to the box of admisvector
sible messages defined by
(13)
We now bound the derivative of the message-update equation
over this set of admissible messages.
, the elements of
are
Lemma 5: For all
bounded as
where the directed weights
were defined previously in (8).
All other gradient elements are zero.
See Appendix A for the proof. In order to exploit Lemma 5,
, let us use the mean-value theorem to
for any iteration
write
which is, in turn, upper bounded by
(15)
Since this bound holds for each directed edge, we have established that the vector of message differences obeys
, where the non-negative matrix
was defined previously in (9). By results on
non-negative matrix recursions [25], if the spectral radius of
is less than 1, then the sequence
converges to zero.
is a Cauchy sequence, and so must
Thus, the sequence
converge.
D. Explicit Conditions for Convergence
A drawback of Theorem 2 is that it requires computing the
spectral radius of the
matrix , which can be
a nontrivial computation for large problems. Accordingly, we
now specify some corollaries that are sufficient to ensure convergence of the reweighted sum-product algorithm. As in the
work of Mooij and Kappen [11], the first two conditions follow
by upper bounding the spectral norm by standard matrix norms.
Conditions (c) and (d) are refinements that require further work.
Corollary 6: Convergence of reweighted sum-product is
guaranteed by any of the following conditions.
a) Row sum condition:
(14)
where
for some
. Since
and
both belong to the convex set
, so does the
convex combination , and we can apply Lemma 5. Starting
from (14), we have
(16)
b) Column sum condition:
(17)
4300
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 9, SEPTEMBER 2008
c) Reweighted norm condition:
(18)
d) Pairwise neighborhood condition: the quantity
(19)
where
Remarks: To put these results in perspective, if we specialize
for all edges and use the weaker version of the
to
weights
that ignore the effects of observations, then
-norm condition (17) is equivalent to earlier results on
the
the ordinary sum-product algorithm [10], [11]. In addition,
one may observe that for the ordinary sum-product algorithm
for all edges), condition (18) is equivalent to
(where
-condition (17). However, for the general reweighted
the
algorithm, these two conditions are distinct.
Proof: Conditions (a) and (b) follows immediately from
is upper bounded by any
the fact that the spectral norm of
other matrix norm [25]. It remains to prove conditions (c) and
(d) in the corollary statement.
of successive
(c) Defining the vector
changes, from (15), we have
(20)
Finally, since the edge
was arbitrary, we can maximize
. Thereover it, which proves that
, the updates are an iterated contraction in the
fore, if
norm, and hence converge by standard contraction results
[24].
denote the
(d) Given a non-negative matrix , let
column sum indexed by some element . In general, it is known
, the spectral radius of is upper
[25] that for any
bounded by the quantity
, where
and range over all column indices. A more refined result
due to Kolotilina [26] asserts that if is a sparse matrix, then
one need only optimize over column index pairs , such that
. For our problem, the matrix
is indexed by directed edges
, and
is non-zero only if
. Consequently, we can reduce the maximization over
column sums to maximizing over directed edge pairs
with
, which yields the stated claim.
IV. CONVERGENCE FOR GENERAL DISCRETE MODELS
In this section, we describe how our results generalize
at
to multinomial random variables, with the variable
each node taking a total of
states in the space
. Given our Markov assumptions, the
distribution takes the factorized form
(22)
where each
is a vector of numbers, and each
is
an
matrix of numbers.
In our analysis, it will be convenient to work with an alternative parameter vector that represents the same Markov random
field as
, given by
The previous step of the updates yields a similar equation—namely
(21)
Now let us define
With
this
notation,
a norm on
the
This set of functions is a different parameterization of the dis; indeed, it can be verified that
tribution
by
.
bound (21) implies that
. Substituting this bound
into (20) yields that
where
for all
For any edge
, summing weighted versions of this equation over all neighbors of yields that
is upper bounded by
is a constant independent of . Moreover, note that
for all nodes
, and
.
A. Convergence Result and Some Consequences
In order to state a result about convergence for multinomial Markov random fields, we require some preliminary definitions. For each directed edge
and states
, define functions
and
of the
vector
(23a)
ROOSTA et al.: CONVERGENCE ANALYSIS OF REWEIGHTED SUM-PRODUCT ALGORITHMS
(23b)
4301
Proof: We begin by proving that
all, ignoring the box constraints (26), then certainly
. First of
where
(24a)
(24b)
since for any fixed , the function
is maximized at
, and
. Due to the monotonicity
in , it now suffices to maximize the absolute value
of
. Since is defined in terms of and , we
of
first bound the difference
.
In one direction, we have
With these definitions, define for each directed edge
the non-negative weight
(25)
where the function was defined previously (4), and the box
of admissible vectors is given by
and hence
(28)
(26)
in (25), we define the
Finally, using the choice of weights
matrix
as before (see (9)).
Theorem 7: If the spectral radius of
is less than one, then
the reweighted sum-product algorithm converges, and the associated fixed point is unique.
We provide the proof of Theorem 7 in the Appendix. Despite
its notational complexity, Theorem 7 is simply a natural general,
ization of our earlier results for binary variables. When
note that the functions
and
are identically zero (since
there are no states other than
and 0), so that the form of
and simplifies substantially. Moreover, as in our earlier devel, Theopment on the binary case, when specialized to
orem 7 provides a strengthening of previous results [10], [11].
In particular, we now show how these previous results can be recovered from Theorem 7 by ignoring the box constraints (26):
Corollary 8: The reweighted sum-product algorithm converges if
(27)
where the weight
is given by
In the other direction, we have
, and hence
(29)
Combining
(28)
by
equivalently by
and
(29),
we
is
conclude
that
upper bounded
, or
(30)
Therefore, we have proved that
, where
was defined in the corollary statement. Consequently, if we deusing the weights , we have
fine a matrix
in an elementwise sense, and therefore, the spectral rais an upper bound on the spectral radius of
dius of
(see Bertsekas and Tsitsiklis [25]).
A challenge associated with verifying the sufficient conditions in Theorem 7 is that in principle, it requires solving a set
of
optimization problems, each involving the
dimensional vector restricted to a box. As pointed out by one
of the referees, in the absence of additional structure (e.g., convexity), one might think of obtaining the global maximum by
discretizing the
-cube, albeit with prohibitive complexity
(in particularly, growing as
where
is the discretization accuracy). However, there are many natural weakenings of the bound in Theorem 7, still stronger than Corollary 8,
that can be computed efficiently. Below we state one such result.
4302
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 9, SEPTEMBER 2008
Fig. 5. Illustration of the benefits of observations. Plots of the contraction coefficient from Corollary 9 versus the edge strength for a 3-state Potts model. Each
curve corresponds to a different setting of the SNR parameter . (a) Ordinary sum-product algorithm = 1. Upper-most curve labeled = 0 corresponds to the
best bounds from previous work [10], [11], [20]. (b) Reweighted sum-product algorithm = 0:50.
Corollary 9: Define the edge weights
(31)
where the weights
are defined in Corollary 8.
Then the reweighted sum-product algorithm converges if
.
Proof: The proof of this result is analogous to Corollary 8:
We weaken the bound from Theorem 7 by optimizing separately over its arguments
which, in turn, is upper bounded by
where we have used the fact that
is increasing in
(for any fixed ), and the bound (28), which can be restated as
.
The advantage of Corollary 9 is that it is relatively straightforward to compute the weights
: in particular, for a fixed , ,
we need to solve the maximization problem over
in (31). Since
achieves its unconstrained maximum at
, and decreases as moves away from zero, it is equivalent to minimize the function
over the admissible
box. In order to so, it suffices by the definition of to minimize
and maximize the function
over the admissible box, and then take the minimum of the absolute values of these two quantities. Since both and are
convex and differentiable functions of , we see that is a concave and differentiable function of , so that the maximum can
be computed with standard gradient methods. On the other hand,
the minimum of a concave function over a convex set is always achieved at an extreme point [27], which in this case, are
simply the vertices of the admissible cube. Therefore, the weight
can be computed in time that is at worst
, corresponding to the number of vertices of the
-cube.
If the graphical model has very weak observations (i.e.,
uniformly for all states
), then
the observation-dependent conditions provided in Theorem
7 and Corollary 9 would be expected to provide little to no
benefit over Corollary 8. However, as with the earlier results on
binary models (see Fig. 3), the benefits can be substantial when
the model has stronger observations, as would be common in
applications. To be concrete, let us consider particular type of
multi-state discrete model, known as the Potts model, in which
the edge potentials are of the form
if
otherwise
for some edge strength parameter
. We then suppose that
the node potential vector
at each node takes the form
, for a signal-to-noise parameter to be chosen.
Fig. 5 illustrates the resulting contraction coefficients predicted
by Corollary 9, for both the ordinary sum-product updates
in panel (a), and the reweighted sum-product algorithm with
in panel (b). As would be expected, the results are
qualitatively similar to those from the binary state models in
Fig. 3.
V. EXPERIMENTAL RESULTS
In this section, we present the results of some additional simulations to illustrate and support our theoretical findings.
ROOSTA et al.: CONVERGENCE ANALYSIS OF REWEIGHTED SUM-PRODUCT ALGORITHMS
4303
Fig. 6. Empirical rates of convergence of the reweighted sum-product algorithm as compared to the rate predicted by the symmetric and asymmetric bounds from
Theorem 2.
A. Dependence on Signal-to-Noise Ratio
We begin by describing the experimental setup used to
generate the plots in Fig. 3, which illustrate the effect of increased signal-to-noise ratio (SNR) on convergence bounds.
In these simulations, the random vector
is posited to have a prior distribution
of the form
set uniformly to some
(3), with the edge parameters
, and symmetric node potentials
.
fixed number
Now suppose that the we make a noisy observation of the
, where
random vector , say of the form
, so that we have a conditional distribution
. We then
of the form
examined the convergence behavior of both ordinary and
reweighted sum-product algorithms for the posterior distribu.
tion
The results in Fig. 3 were obtained from a grid with
nodes, and by varying the observation noise
from
corresponding to
, down to
. For any
fixed setting of , each curve plots the average of the spectral
radius bound from Theorem 2 over 20 trials versus the edge
. Note how the convergence guarantees
strength parameter
are substantially strengthened, relative to the case of zero SNR,
as the benefits of observations are incorporated.
B. Convergence Rates
We now turn to a comparison of the empirical convergence
rates of the reweighted sum-product algorithm to the theoretical
upper bounds provided by the inductive norm (18) in Corollary
8. We have performed a large number of simulations for different values of number of nodes, edge weights , node potentials, edge potentials, and message space sizes. Fig. 6 compares
the convergence rates predicted by Theorem 2 to the empirical
rates in both the symmetric and asymmetric settings. The symwhile
metric case corresponds to computing the weights
ignoring the observation potentials, so that the overall matrix
is symmetric in the edges (i.e.,
). The asymmetric case explicitly incorporates the observation potentials,
and leads to bounds that are as good or better than the symmetric case. Fig. 6 illustrates the benefits of including observations in the convergence analysis. Perhaps most strikingly, panel
(b) both shows a case where the symmetric bound predicts divergence of the algorithm, whereas the asymmetric bound predicts
convergence.
VI. CONCLUSION AND FUTURE WORK
In this paper, we have studied the convergence and stability
properties of the family of reweighted sum-product algorithms
in general pairwise Markov random fields. For homogeneous
models as well as more general inhomogeneous models, we derived a set of sufficient conditions that ensure convergence, and
provide upper bounds on the (geometric) convergence rates. We
demonstrated that these sufficient conditions are necessary for
homogeneous models without observations, but not in general.
We also provided simulation results to complement the theoretical results presented.
There are a number of open questions associated with the
results presented here. Even though we have established the
benefits of including observation potentials, the conditions provided are still somewhat conservative, since they require that
the message updates be contractive at every update, as opposed
to requiring that they be attractive in some suitably averaged
sense—e.g., when averaged over multiple iterations, or over different updating sequences. An interesting direction would be to
derive sharper “average-case” conditions for message-passing
convergence. Another open direction concerns analysis of the
effect of damped updates, which (from empirical results) can
improve convergence behavior, especially for reweighted sumproduct algorithms.
4304
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 9, SEPTEMBER 2008
APPENDIX
Proof of Lemma 5: Setting
, we compute via chain rule
for
for
.
. Computing
so that it suffices to upper bound
this partial derivative from the message update (11) yields that
is equal to
which we recognize as
, where the
was previously defined in (4). Since the message
function
vector
must belong to the box (13) of admissible messages,
must satisfy the bound
the vector
For any fixed , the function
at
value
nition (7), we have
. Consequently, each vector
must belong the admissible box (26).
As in our earlier proof, we now seek to bound the partial
. By chain
derivatives of the message update
if
,
rule, we have
if
. Conseand
. Let us set
quently, it suffices to bound
.
with
Computing the partial derivative of component
yields
respect to message index
Isolating the term involving
, we have that
is equal to the equation shown at the bottom
is
of the page. Further simplifying yields that
equal to
achieves its maximal
. Noting that by its defi, we conclude that
if
where
fined.
Setting
otherwise.
Proof of Theorem 7: We begin by parameterizing the
reweighted sum-product messages in terms of the log ratios
. For each
,
the message updates (2) can be rewritten, following some
straightforward algebra, in terms of these log messages and the
modified potentials as
where
Analogously to the proof of Lemma 4, we have
(32)
.
and
were previously deand
, we have
where
was previously defined (4). Using the monotonicity
properties of , we have
(33)
ROOSTA et al.: CONVERGENCE ANALYSIS OF REWEIGHTED SUM-PRODUCT ALGORITHMS
The claim follows by noting that as defined, we have
ACKNOWLEDGMENT
The authors would like to thank the anonymous referees for
helpful comments that helped to improve the manuscript.
REFERENCES
[1] A. S. Willsky, “Multiresolution Markov models for signal and image
processing,” Proc. IEEE, vol. 90, no. 8, pp. 1396–1458, 2002.
[2] H. A. Loeliger, “An introduction to factor graphs,” IEEE Signal
Process. Mag., vol. 21, pp. 28–41, 2004.
[3] M. J. Wainwright and M. I. Jordan, “Graphical models, exponential
families, and variational inference,” Dept. Statistics, Univ. of California, Berkeley, Tech. Rep. 649, Sep. 2003.
[4] F. Kschischang, “Codes defined on graphs,” IEEE Signal Process.
Mag., vol. 41, pp. 118–125, Aug. 2003.
[5] S. L. Lauritzen and D. J. Spiegelhalter, “Local computations with probabilities on graphical structures and their application to expert systems
(with discussion),” J. Roy. Stat. Soc. B, vol. 50, pp. 155–224, Jan. 1988.
[6] J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Constructing free energy
approximations and generalized belief propagation algorithms,” IEEE
Trans. Inf. Theory, vol. 51, no. 7, pp. 2282–2312, Jul. 2005.
[7] S. Tatikonda and M. I. Jordan, “Loopy belief propagation and Gibbs
measures,” in Proc. Uncertainty in Artificial Intelligence, Aug. 2002,
vol. 18, pp. 493–500.
[8] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky, “Tree-based reparameterization framework for analysis of sum-product and related algorithms,” IEEE Trans. Inf. Theory, vol. 49, no. 5, pp. 1120–1146, May
2003.
[9] T. Heskes, “On the uniqueness of loopy belief propagation fixed
points,” Neural Comput., vol. 16, no. 11, pp. 2379–2413, 2004.
[10] A. T. Ihler, J. W. Fisher, III, and A. S. Wilsky, “Loopy belief propagation: Convergence and effects of message errors,” J. Mach. Learn.
Res., vol. 6, pp. 905–936, 2005.
[11] J. M. Mooij and H. J. Kappen, “Sufficient conditions for convergence
of loopy belief propagation,” in Proc. Uncertainty in Artificial Intelligence, Jul. 2005, vol. 21, pp. 396–403.
[12] A. Yuille, “CCCP algorithms to minimize the Bethe and Kikuchi
free energies: Convergent alternatives to belief propagation,” Neural
Comput., vol. 14, pp. 1691–1722, 2002.
[13] M. Welling and Y. Teh, “Belief optimization: A stable alternative to
loopy belief propagation,” in Proc. Uncertainty in Artificial Intelligence, Jul. 2001, pp. 554–561.
[14] T. Heskes, K. Albers, and B. Kappen, “Approximate inference and constrained optimization,” in Proc. Uncertainty in Artificial Intelligence,
Jul. 2003, vol. 13, pp. 313–320.
[15] A. Globerson and T. Jaakkola, “Convergent propagation algorithms
via oriented trees,” in Proc. Uncertainty in Artificial Intelligence, Vancouver, Canada, Jul. 2007.
[16] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky, “A new class of
upper bounds on the log partition function,” IEEE Trans. Inf. Theory,
vol. 51, no. 7, pp. 2313–2335, Jul. 2005.
[17] W. Wiegerinck and T. Heskes, “Fractional belief propagation,” in
NIPS, 2002, vol. 12, pp. 438–445.
[18] W. Wiegerinck, “Approximations with reweighted generalized belief
propagation,” presented at the Workshop Artificial Intelligence and Statistics, Savannah Hotel, Barbados, Jan. 6–8, 2005.
[19] A. Levin and Y. Weiss, “Learning to combine bottom-up and top-down
segmentation,” in Eur. Conf. Computer Vision (ECCV), Jun. 2006.
[20] J. M. Mooij and H. J. Kappen, “Sufficient conditions for convergence
of loopy belief propagation,” University of Nijmegen, Tübingen, Germany, Tech. Rep. arxiv.org/abs/cs/0504030v2, revised in May 2007.
[21] S. Geman and D. Geman, “Stochastic relaxation, Gibbs distributions,
and the Bayesian restoration of images,” IEEE Trans. Pattern. Anal.
Mach. Intell., vol. 6, no. 6, pp. 721–741, Nov. 1984.
[22] Y. Weiss, “Correctness of local probability propagation in graphical
models with loops,” Neural Comput., vol. 12, pp. 1–41, 2000.
4305
[23] H. A. Bethe, “Statistics theory of superlattices,” Proc. Royal Soc. Lond.,
Series A, vol. 150, no. 871, pp. 552–575, 1935.
[24] J. M. Ortega and W. C. Rheinboldt, Iterative Solution of Nonlinear
Equations in Several Variables, ser. Classics in Applied Mathematics. New York: SIAM, 2000.
[25] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods. Boston, MA: Athena Scientific, 1997.
[26] L. Y. Kolotilina, “Bounds for the singular values of a matrix involving
its sparsity pattern,” J. Math. Sci., vol. 137, pp. 4794–4800, 2006.
[27] G. Rockafellar, Convex Analysis. Princeton, NJ: Princeton Univ.
Press, 1970.
Tanya G. Roosta (M’07) received the B.S. degree in
electrical engineering and computer science from the
University of California at Berkeley (UC Berkeley)
in 2000, as well as the M.S. degree in electrical
engineering and computer science, the M.A. degree
in statistics, and the Ph.D. degree in electrical
engineering and computer science also from UC
Berkeley in 2004, 2008, and 2008, respectively.
She is currently a Software Engineer at Cisco Systems. Her research interests include sensor network
security, fault detection, data integrity, reputation systems, sensor correlation modeling, privacy issues associated with the application
of sensors at home and health care, and sensor networks used in critical infrastructure. Her other research interests include robust statistical methods, outlier
detection models, statistical modeling and model validation, and the application
of game theory to sensor network design.
Dr. Roosta received the three-year National Science Foundation fellowship
for her graduate studies.
Martin J. Wainwright (M’03) received the Ph.D.
degree in electrical engineering and computer
science (EECS) from the Massachusetts Institute of
Technology (MIT), Cambridge.
He is currently an Assistant Professor at the
University of California at Berkeley, with a joint
appointment between the Department of Statistics
and the Department of Electrical Engineering
and Computer Sciences. His research interests
include statistical signal processing, coding and
information theory, statistical machine learning, and
high-dimensional statistics.
Dr. Wainwright has been awarded an Alfred P. Sloan Foundation Fellowship,
an NSF CAREER Award, the George M. Sprowls Prize for his dissertation research (EECS Department, MIT), a Natural Sciences and Engineering Research
Council of Canada 1967 Fellowship, and several outstanding conference paper
awards.
Shankar S. Sastry (M’94) received the B.Tech.
degree from the Indian Institute of Technology,
Bombay, in 1977 and the M.S. degree in electrical
engineering and Computer Science, the M.A. degree
in mathematics, and the Ph.D. degree in electrical
engineering and computer science from the University of California, Berkeley, in 1979, 1980, and
1981, respectively.
Prior to joining the Electrical Engineering and
Computer Science (EECS) faculty in 1983, he
was a Professor at the Massachusetts Institute of
Technology, Cambridge. He was formerly the Director of CITRIS (Center for
Information Technology Research in the Interest of Society) and the Banatao
Institute @ CITRIS Berkeley. He served as Chair of the EECS Department at
the University of California, Berkeley, from January 2001 through June 2004.
In 2000, he served as Director of the Information Technology Office at DARPA.
From 1996 to 1999, he was the Director of the Electronics Research Laboratory
at Berkeley, an organized research unit on the Berkeley campus conducting
research in computer sciences and all aspects of electrical engineering. He
is the NEC Distinguished Professor of Electrical Engineering and Computer
Sciences and holds faculty appointments in the Departments of Bioengineering,
EECS, and Mechanical Engineering. He is currently Dean of the College of
Engineering.