A Tutorial On Graphical Models

A Tutorial On Graphical Models
Vincent Y. F. Tan ([email protected])
April 18, 2012
Abstract
This set of notes serves the objective of introducing the reader to the basics of probabilistic graphical
models. We begin with a brief review of elementary probability theory. We then discuss the various
classes of graphical models, namely Bayesian networks and Markov random fields. Following that, we
detail the main inference algorithm in graphical modeling—the sum-product (or belief propagation)
algorithm. Finally, we expose the reader to the simplest algorithm for approximating or learning the
structure of such models from data—the Chow Liu algorithm. The exposition in this set of notes is based
largely on Bishop’s excellent text on machine learning [Bis08].
1
Basic Probability Theory
Because all of modern statistical machine learning deals with uncertainty, it seems appropriate to start
of by reminding the reader of the definitions of events, probabilities, joint probabilities and conditional
probabilities. For more details, the reader is encouraged to consult Bertsekas and Tsitsiklis [BT02].
Let us motivate probability by considering an example from Bishop [Bis08].
Example 1. There is a red box and a blue box. In the red box there are a total of 8 fruits, 2 apples and 6
oranges. In the blue box, there are a total of 4 fruits, 3 apples and 1 orange. The probability of selecting the
red box is Pr(B = r) = 2/5 and probability of selecting the blue box is Pr(B = b) = 3/5. Having selected a
box, selecting any item within the box is equally likely. Some of the questions we would like to ask include:
What’s the probability that we select an orange?
Given that we’ve selected an orange, what’s the probability that we chose it from the blue box?
1.1
Joint and Conditional Probabilities
Now, let us consider a more general example involving two random variables X and Y . Suppose, as in the
example above, the random variables are only permitted to take on finitely many values. So X can only take
on values in the finite set X = {x1 , . . . , xM } and Y takes on values in Y = {y1 , . . . , yL }. Consider N trials
and let the number of trials for which X = xi and Y = yj be nij . Then, if N is large, we can assume that
Pr(X = xi , Y = yj ) =
nij
.
N
What’s the probability that X = xi ? Well, we simply sum up those n0ij s for which the first index equals to
i. In other words,
ci
Pr(X = xi ) =
N
PL
where clearly, ci := j=1 nij . Expressed slightly differently, we have
Pr(X = xi ) =
L
X
Pr(X = xi , Y = yj ).
j=1
1
This is the important sum rule in probability and will be used extensively in inference in graphical models
so the reader is urged to internalize this. Similarly, the marginal probability that Y = yj is
Pr(Y = yj ) =
M
X
Pr(X = xi , Y = yj ).
i=1
Now, we introduce the important notion of conditional probabilities. Given that X = xi , what is the
probability that Y = yj . Clearly,
Pr(Y = yj |X = xi ) =
nij
.
ci
But note also that
Pr(X = xi , Y = yj ) =
nij ci
nij
=
·
= Pr(Y = yj |X = xi ) Pr(X = xi ).
N
ci N
We have derived the important product-rule in probability. The rules of probability are summarized as
follows:
X
Pr(X = x) =
Pr(X = x, Y = y)
(1)
y∈Y
Pr(X = x, Y = y) = Pr(Y = y|X = x) Pr(X = x)
(2)
A note about notation. We will usually write pX (x) := Pr(X = x) or simply denote this value as p(x) when
the random variable is clear from the context. The function p(x) is known as the probability mass function
or pmf. Similarly, the joint pmf of random variables X and Y is denoted as p(x, y). Finally, the conditional
will be denoted as p(y|x).
By combining the sum rule in (1) and the product rule in (2), we can derive Bayes’ rule:
p(y|x) =
p(x|y)p(y)
p(x|y)p(y)
p(x, y)
=
=P
.
0
0
p(x)
p(x)
0
y ∈Y p(x|y )p(y )
(3)
This is a central relationship in pattern recognition, machine learning and statistical physics. Note that we
have “inverted the causal relationship” between X and Y . On the left, Y “depends on” X while on the
right, we have expressed the same causality relationship in terms of the causal dependence of X on Y .
Bayes’ theorem can be written alternatively as follows:
p(x|y) ∝ p(y|x)p(x),
where ∝ denotes equality up to a constant (not depending on x). If x designates an unknown variable,
something we would like to infer and p(x) is its prior probability, then p(x|y) denotes the posterior probability,
the belief we have about x after we know that Y = y. In the parlance of statistical inference,
posterior ∝ likelihood × prior
We now use this relation to solve:
Exercise 1. Let F be the random variable denoting the fruit chosen. Using the sum, product and Bayes’
rules, verify that the answers to the questions in Example 1 are Pr(F = o) = 9/20 and Pr(B = b|F = o) = 1/3
respectively.
Note the following: Prior to having any additional information about what fruit we chose, the prior
probability of choosing from a blue box is Pr(B = b) = 3/5. However, if we know the identity of the fruit we
chose, say orange, then the posterior probability of choosing from the blue box is Pr(B = b|F = o) = 1/3.
This is the simplest non-trivial example of statistical inference. Intuitively, this is true because the blue box
contains far fewer oranges so knowing that we chose an orange biases our belief about the box we chose from.
2
g
4
1
g
3
2
g
g
64
-g
3
g
1
g
2
-g
Figure 1: Two examples of graphs. On the left, we have an undirected graph while on the right, we have a
directed graph.
1.2
Independence and Conditional Independence
What does it mean for two random variables X and Y to be independent? This can be expressed in a variety
of ways. Two random variables X and Y are independent if their joint distribution pX,Y (x, y) := Pr(X =
x, Y = y) factorizes, i.e.,
pX,Y (x, y) = pX (x)pY (y),
(4)
for every x ∈ X and y ∈ Y. Note from (4) that if X and Y are independent if
pX|Y (x|y) = pX (x)
(5)
for every x ∈ X and y ∈ Y provided that pY (y) > 0. Intuitively, (5) means that knowledge that Y = y
tell you no additional information about X, hence the nomenclature independence. In the literature, X is
independent of Y is denoted as X ⊥
⊥Y.
Similarly, X and Y are conditionally independent given Z if the conditional distribution of X and Y
given Z factorizes
pX,Y |Z (x, y|z) = pX|Z (x|z)pY |Z (y|z),
(6)
for every x ∈ X , y ∈ Y and z ∈ Z. Equation (6) implies that X and Y are conditionally independent given
Z if
pX|Y,Z (x|y, z) = pX|Z (x|z)
(7)
for every x ∈ X , y ∈ Y and z ∈ Z provided pY |Z (y|z) > 0. Equation (7) implies that knowledge of Y = y,
together with Z = z, is just as good as knowing Z = z. If X is conditionally independent of Y given Z,
we denote this statement as X ⊥
⊥ Y |Z or, alluding to the later parts of this tutorial, the Markov chain
X − Z − Y . So Z separates X and Y .
Exercise 2. If X − Z − Y is a Markov chain, is it true that for all subsets A ⊂ X , B ⊂ Y, C ⊂ Z, we have
that Pr(X ∈ A|Y ∈ B, Z ∈ C) = Pr(X ∈ A|Z ∈ C)? Prove or provide a counterexample.
2
Graphical Models
Graphical models provide a simple way to visualize the structure of a probabilistic model. For example,
from the graph structure, we would like to easily read off the set of conditional independence relationships
between a set of random variables. How do we formalize this notion and how can we use the graph structure
to perform inference? For example, our bunch of random variables could be the set X1 , . . . , X9 , Given that
someone observed that X1 = x1 , what can we say about the posterior distribution of X9 ? Can we compute
this fairly easily? These are questions we will attempt to answer in the next 2 sections. But first we will
introduce the notion of graphs, Bayesian networks, undirected graphical models (Markov random fields) and
factor graphs.
2.1
Graphs
A graph G = (V, E) is a data structure consisting of a node (or vertex) set V and an edge set E. The nodes
are connected by links, or arcs, or edges in E.
3
g
g
X1
@
@
@
X
?
g3
@
@
RXg2
@
X2
g
X1
g
X3
?
?
g
g
X
X
4
5
Figure 2:
Left:
A simple Bayesian network reflecting the factorization of p(x1 , x2 , x3 )
p(x3 |x1 , x2 )p(x2 |x1 )p(x1 ). Right: A more complicated example reflecting the factorization in (8).
=
Let us consider the examples in Fig. 1. Both graphs have 4 nodes, labelled 1, 2, 3, and 4. The vertex
set is thus V = {1, 2, 3, 4}. For the undirected graph on the left, a cycle graph, the edge set is E =
{(1, 2), (2, 3), (3, 4), (1, 4)}. Note that since the graph is undirected, there is no need to order of the nodes
in each of the edges in E. Thus, we could very well have written E = {(2, 1), (3, 2), (4, 3), (4, 1)}. For the
directed graph on the right, we need to pay more attention. The edge set is E = {(1, 2), (1, 4), (4, 3)}. The
second coordinate denotes the node in which the arrow points towards. Thus (1, 2) is not the same as (2, 1).
2.2
Bayesian networks
We now introduce perhaps the most important class of graphical models known as Bayesian networks. These
class of models are better (than Markov random fields) at expressing causal relationships between random
variables. For more details, the reader may consult the excellent text by Turing award winner J. Pearl [Pea88].
Consider N random variables X1 , . . . , XN , each taking values in a common finite alphabet X . By repeated
applications of the product rule, we have
Pr(X1 = x1 , . . . , XN = xN ) = p(x1 , . . . , xN ) = p(xN |x1 , . . . , xN −1 ) . . . p(x2 |x1 )p(x1 ).
We can depict this relationship graphically as in the left of Fig. 2. In this example X1 and X2 are known as
the parents of X3 . X1 is the single parent of X2 . Conversely, X3 is the child of X1 and X2 and so on. This is
an example of a Bayesian network. For Bayesian networks, the absence of links conveys information about
conditional independence relationships. For the more complicated example shown in the right of Fig. 2, we
have the factorization
p(x1 , x2 , x3 , x4 , x5 ) = p(x1 )p(x2 )p(x3 )p(x4 |x1 , x2 , x3 )p(x5 |x1 , x3 ).
(8)
At this point, we introduce the following convenient notation: Given a subset of nodes U ⊂ V, XU = {Xj :
j ∈ U} is the collection of random variables indexed by U. Similarly, xU = {xj : j ∈ U} is a particular
realization of XU . Note that X = XV = (X1 , . . . , XN ). Using this notation, we can write the factorization
of the joint distribution of X1 , . . . , XK as
Pr(X = x) = p(x) = p(x1 , . . . , xN ) =
N
Y
p(xk |xpak )
(9)
k=1
where pak denotes the set of parents of node k. So for example in the right plot of Fig. 2, the set of parents
of 4 is {1, 2, 3} so that factor in (9) is p(x4 |x1 , x2 , x3 ).
2.2.1
Reduction in Number of Parameters
What’s the advantage of the factorization of the joint distribution in (9)? The primary advantage is the
reduction in the number of parameters to describe a potentially very complex joint distribution. Let us
consider the case where there are only two random variables X1 and X2 taking values in a common alphabet
X := {0, 1}. Then, the marginal distributions of X1 and X2 can be expressed as
Pr(X1 = x1 ) = µx1 1 (1 − µ1 )1−x1
Pr(X2 = x2 ) = µx2 2 (1 − µ2 )1−x2
4
(10)
X
X
X
g1 - g2 - g3- . . .
X
- gN
Figure 3: The Markov chain reflecting the factorization p(x) = p(x1 )p(x2 |x1 ) . . . p(xN |xN −1 ).
A
g
A
C
B
g - g -g
C
B
g -g
(a)
A
C
g - g
(b)
B
g
(c)
Figure 4: Three different factorizations of the distribution p(a, b, c).
where µj = Pr(Xj = 1) for j = 1, 2. Now, if we consider X1 − X2 (i.e., the joint distribution does not
factorize as p(x1 )p(x2 )), then,
Pr(X1 = x1 , X2 = x2 ) =
1 Y
1
Y
x1k x2k
µkl
(11)
k=0 l=0
P
where µkl = Pr(X1 = k, X2 = l). Because k,l µkl = 1, we require 3 parameters to describe the joint
distribution Pr(X1 = x1 , X2 = x2 ) in (11). Another way to see this is to write p(x1 , x2 ) as p(x1 )p(x2 |x1 ).
Then the marginal distribution is governed by a single parameter µ1 = Pr(X1 = 1) and the conditional
distribution by two parameters p(x2 = 0|x1 = 0) and p(x2 = 0|x1 = 1). This makes a total of 3 parameters.
In contrast if we were to describe X1 and X2 separately as in (10), then we only require 2 parameters.
More generally, for N variables, each represented by a node, if the graph is fully connected (no factorization properties implied), in general, we need 2N − 1 parameters to describe the joint distribution. This
is exponential in N making computations virtually infeasible if N is large. However, if the model factorizes
as in the chain in Fig. 3, then the reader is invited to check that only 2N − 1 parameters are required to
describe p(x). Now the number of parameters is linear in N , and as we will see, inference on a chain is
exceedingly easy!
2.2.2
Conditional Independence Revisited and Explaining Away
Recall that two random variables A and B are said to be conditionally independent given C if
Pr(A = a, B = b|C = c) = Pr(A = a|C = c) Pr(B = b|C = c).
In short, p(a, b|c) = p(a|c)p(b|c). Now, we are going to consider three different Bayesian networks describing
various factorizations of the joint distribution of A, B and C. See Fig. 4.
Consider the factorization in Fig. 4(a). According to (9), the joint distribution factorizes as
p(a, b, c) = p(c)p(a|c)p(b|c)
Does this imply that A ⊥
⊥ B? Consider
X
X
p(a, b) =
p(a, b, c) =
p(c)p(a|c)p(b|c) 6= p(a)p(b)
c
c
where the first equality follows from the sum rule. Now, clearly, we can’t claim that p(a, b) = p(a)p(b)
in general, hence A is not independent of B. By if we condition in C, we have
p(a, b|c) =
p(a, b, c)
= p(a|c)p(b|c)
p(c)
where the first equality follows from Bayes’ rule. Hence, A ⊥⊥ B|C. The key intuition here is that when
conditioned on C a tail-to-tail node, the path from A to B is blocked causing them to be independent.
5
B
G
g - g
F
g
Figure 5: The graph used for Exercise 3
Consider the factorization in Fig. 4(b). It is easy to check that A and B are dependent. However, we
also have the relation A ⊥
⊥ B|C. Here the intuition is that node C is head-to-tail with respect to the
path from A to B. If C is observed the path if blocked, causing A and B to be independent.
The most tricky case is that in Fig. 4(c). From (9), the joint distribution factorizes as
p(a, b, c) = p(a)p(b)p(c|a, b)
Summing both sides over c yields
X
X
p(a, b) =
p(a)p(b)p(c|a, b) = p(a)p(b)
p(c|a, b) = p(a)p(b).
c
c
where the second equality follows from the fact that p(a) and p(b) do not depend on c and the third
equality from the fact that all probabilities sum up to unity. Thus, we can conclude that A ⊥
⊥ B.
However, given C, we have that
p(a, b|c) =
p(a)p(b)p(c|a, b)
p(a, b, c)
=
6= p(a|c)p(b|c)
p(c)
p(c)
thus A is not conditionally of B given C. The intuition here is that node C is a head-to-head node.
When conditioned on C, the path becomes unblocked rendering A and B independent! When not
conditioned on C, the path is blocked, hence A and B are independent.
Exercise 3. Consider the Bayesian network in Fig. 5. Here B, F and G are binary random variables. The
variable B represents whether the battery is charged (1 is charged and 0 otherwise). The variable F represents
the state of the fuel tank (1 is full and 0 empty) and the variable G represents the state of the electric gauge
(1 is good and 0 otherwise). The following are known:
Pr(B = 1) = 0.9
Pr(F = 1) = 0.9
Pr(G = 1|B = 1, F = 1) = 0.8
Pr(G = 1|B = 0, F = 0) = 0.1
Pr(G = 1|B = 1, F = 0) = 0.2
Pr(G = 1|B = 0, F = 1) = 0.2
Suppose we observed that G = 0, compute the probability that posterior probability that the fuel tank is
empty, i.e., Pr(F = 0|G = 0). In addition to observing that G = 0, we also observed that B = 0, compute
the posterior probability that the battery is also flat, i.e., Pr(F = 0|G = 0, B = 0).
If we’ve done the computations correctly, we’d have found out that probability that the tank is empty
has decreased from 0.257 to 0.111, as a result of knowledge of the battery is flat (i.e., B = 0), in addition to
knowing that the gauge is bad (i.e., G = 0). Observe that finding out that the battery is flat explains away
the observation that the fuel gauge is empty. In other words, due to the structure as in Fig. 5, F and B have
become dependent as a result of observing G. This peculiar phenomenon is known as explaining away.
2.2.3
d-separation
We would like to be able to do a quick examination of the directed graph structure to tell whether a subset
of variables is conditionally independent of another subset given a particular subset. To do so, we need
6
A
g
E
- g
F
g
B
-g
A
g
E
- g
F
w
B
-g
?
gC
?
wC
Figure 6: Examples of d-separation
to introduce the notion of d-separation for general Bayesian networks which generalizes the observation we
made in the previous section.
The rules are as follows: Let A, B, C ⊂ V be non-intersecting subsets of the vertex set V. Their union
does not necessarily have to be V. Let us consider all paths from all nodes in A to all nodes in B. Any such
path is said to be blocked if
1. the arrows on the path meet either tail-to-tail or head-to-tail at a node and node belongs to C
2. the arrows meet head-to-head at a node and neither the node nor any of its descendants belong to C.
If all paths are blocked, the A is said to be d-separated from B by C. At this point, we introduce a new
convention. If a random variable is observed, then we shade its node in the graph; if it is unobserved, then it
remains uncolored. For example in the left graph of Fig. 6, node C is observed and the rest are unobserved.
Consider the left graphical model in Fig. 6. We would like to ask whether it is true that A is independent
of B given C? Now, the path from A to B passes through node E, a head-to-head node and node E is a
parent of an observed node (node C). Thus, by the second criterion in the definition of blockedness above,
the path from A to B is not blocked. Hence, we cannot say that A is independent of B given C in general.
But for the graphical model on the right side of Fig. 6, node F blocks the path from A to B since it is a
tail-to-tail node and F is observed. Thus, A ⊥⊥ B|F .
2.2.4
Markov Blanket
As a final remark on the topic of Bayesian networks, it would be insightful to know what is the minimal set
of nodes that separates a node from the rest of the graph. More precisely, for a node v ∈ V, we would like
to find the smallest set S such that
p(xv |xS ) = p(xv |xV\v ).
In other words, conditioned on the fact that XS = xS , Xv is independent of everything else in the graph.
The set S is known as the Markov blanket of node v.
Let us consider a joint distribution of the factorization in (9). Then, let v = 1 for convenience and
consider
QN
p(xk |xpak )
p(x1 , . . . , xN )
p(x1 |xV\1 ) =
= P k=1
QN
p(x2 , . . . , xN )
x1
k=1 p(xk |xpak )
Thus note that any factor that does not have a dependence on x1 can be taken out of the sum and will be
cancelled by a common term in the numerator. Which terms have dependence on x1 . Certainly p(x1 |xpa1 ).
Also, if x1 and x2 have a common child, say xch , then there will be a factor p(xch |x1 , x2 ). Thus, the Markov
blanket of node 1 is precisely the union of its parents, its children and co-parents (nodes which share common
children), i.e.,
[
[
MB1 = pa1 children(1) co-parents(1).
Example 2. Consider the graph in Fig. 7.The Markov blanket of node C is MBC = {A, E, D, B}. Note
that D is a child of F and D has another parent, namely A. Thus A is a co-parent of C and hence, A is
included in the Markov blanket of C. What is the Markov blanket of node F ?
7
A
g
D
?
g
E
- g
F
g
C
?
g
Gg
B
-g
Figure 7: Example for illustration of Markov blankets
a1
g
d
g
a
g2
c
g2
c
g1
b
g2
b1
g
e
g
Figure 8: A Markov random field. Here (Xa1 , Xa2 ) ⊥⊥ (Xb1 , Xb2 )|(Xc1 , Xc2 ).
2.3
Markov random fields or undirected graphical models
In contrast to Bayesian networks, the graph that encodes the set of conditional independencies in Markov
random fields is undirected. Thus, the links are undirected; they do not carry arrows. The conditional
independence properties can be read off more easily than in Bayesian networks. Let us again consider the
Markov chain A − C − B. As mentioned previously, for this Markov chain, A is conditionally independent of
B given C.
We can generalize this to arbitrary undirected graphs. Let us consider the graphs as shown in Fig. 8.
Suppose that A = {a1 , a2 }, B = {b1 , b2 } and C = {c1 , c2 }. Then because every path from any node in A to
any node in B passes through at least one node in C, it is true that XA is conditionally independent of XB
given XC . In this case we say that C separates A and B.
Another way to say the same thing is the following: Let us consider node 1 for simplicity. Let ne(1) :=
{j : (1, j) ∈ E} be the set of neighbors of 1, i.e., those nodes that are adjacent to node 1. Then, it is
true from the above observation that X1 is conditionally independent of XV\{1∪ne(1)} given Xne(1) . Simply
speaking, conditioned on its neighborhood, node 1 is independent of all other nodes in the graph. For
example, conditioned on Xa1 , Xc1 and Xc2 , random variable Xd is independent of Xa2 , Xb1 , Xb2 and Xe in
Fig. 8.
In other words, for undirected graphical models, the Markov blanket of any node is simply its neighborhood, those nodes that it is adjacent to and separates it from the rest of the graph.
2.3.1
Factorization Properties
We now describe how the joint distribution of an undirected graphical model factorizes. As a motivating
example, consider a model in which xi and xj are not neighbors in the graph. Then it is true that
p(xi , xj |xV\{i,j} ) = p(xi |xV\{i,j} )p(xj |xV\{i,j} ).
That is Xi and Xj are conditionally independent given the rest of the graph. The factorization of the joint
distribution of an undirected graphical model should reflect the fact that xi and xj do not appear in the
same factor. To proceed, we have to introduce some graph-theoretic notions.
Given an undirected graph, a clique is a subset of nodes so that there is a link between any two nodes
in the subset. A maximal clique is a clique such that the inclusion of another node into the subset of nodes
renders it not to be a clique. It is a fact that any clique is a subset of a maximal clique. Informally speaking,
maximal cliques are fully connected subsets of nodes that cannot be “enlarged” any further.
Example 3. For the undirected graph in Fig. 9, {1, 2} is a clique but not a maximal clique since the addition
of node 3 still results in a clique. The maximal cliques of this graph are C1 = {1, 2, 3} and C2 = {1, 3, 4}.
8
4 g
g3
1 g
g2
Figure 9: Cliques and maximal cliques
For an undirected graph G = (V, E), let C be the set of maximal cliques of G. The joint distribution of a
Markov random field can be written as the product of potential functions ψC : X |C| → [0, ∞) defined on the
maximal cliques of the graph, i.e.,
1 Y
ψC (xC ).
(12)
p(x) =
Z
C∈C
If p(x) takes the form in (12), then it is known as a Gibbs random field. The constant Z is the known as the
partition function, an object of central importance in the statistical physics. It is defined so that the pmf
p(x) sums to unity, i.e.,
X Y
Z=
ψC (xC ).
x∈X |V| C∈C
Note that the domain of the potential function ψC (xC ) is simply the Cartesian product of the domains of
the random variables in the maximal clique C. Note also that the potential function is not restricted to be
strictly positive. There can be states xC for which ψC (xC ) = 0. The evaluation of the partition function is
generally intractable. We need to sum over |X ||V| number of states. This is exponential in |V|! Researchers
have devoted lifetimes to evaluating good bounds on the partition function for graphical models with special
structure. The logarithm of the partition function is also intimately related to the cumulant generating
function.
A natural question is: What the relation between the factorization in (12) and the set of conditional
independence statements we can make? We have the following fundamental result in the theory of graphical
models.
Theorem 1 (Hammersley-Clifford [HC70]). Suppose that ψC (xC ) > 0 for all xC ∈ X |C| and for all C ∈ C .
Let UI be the set of distributions consistent with the set of conditional independence statements we can make
given an undirected graph G. Let UF be the set of distributions expressible as in the factorization in (12)
where C are the set of maximal cliques in the graph G. Then
UI = UF.
Since the Hammersley-Clifford theorem requires the assumption of positivity in all the potential functions,
it is convenient to parametrize them as
ψC (xC ) = exp[−EC (xC )]
where EC : X |C| → R are called energy functions. Parametrized in this fashion, the probability distribution
in (12) can be expressed as
"
#
X
1
p(x) = exp −
EC (xC )
Z
C∈C
and is called a Boltzmann distribution. The negative log-likelihood is the sum of the energies associated to
each maximal clique up to a constant, i.e.,
X
− log p(x) =
EC (xC ) + log Z
C∈C
Hence, the intuition is that if the energy of the system is low (or close to equilibrium or ground state), the
probability is high and vice versa.
9
w
yi w
g
xi g
w
w
g
gxj
Figure 10: Image processing example
g
g
g
g
@
@
@
@
R g
@
@g
g
g
@
@
@
@
@
@
@
@
R g
@
R g
@
@g
@g
Figure 11: Moralizing the undirected graph to form a Markov random field. The parents have been married.
Example 4. Markov random fields are used extensively in image processing. See a canonical model in
Fig. 10. One of the ways to model a black-and-white image is as follows. Each pixel xi can take on only one
of two values in X = {−1, +1}. These values are corrupted by noise and what we see are the yi ’s given as
xi with prob q
yi =
−xi with prob 1 − q
We observe the yi ’s (and that’s why they are colored in Fig. 10). Let xi and xj be neighbors and they are
unobserved. Clearly the maximal cliques are of the form {xi , xj } and {xi , yi }. Let’s specify plausible energy
functions that are compatible with the maximal cliques. We want neighbouring pixels to be similar with high
probability, so E(xi , xj ) = −βxi xj for some β > 0. We want the observation yi to be correlated with the
underlying pixel xj so we set E(xi , yi ) = −γxi yi for some γ > 0. We also perhaps believe that the image
consists of mostly −1’s so we set E(xi ) = hxi for some h > 0. In sum, an energy function to be minimized
for x can be expressed as
X
X
X
E(x) = h
xi − β
xi xj − γ
xi yi ,
i∈Vgrid
(i,j)∈Egrid
i∈Vgrid
where Vgrid and Egrid denote the vertex set and edge set of the grid respectively. Minimizing the energy
functional can be done using standard techniques such as Gibbs sampling [GG84]. This is out of the scope
of the tutorial.
2.3.2
Relation of undirected graphical models to directed graphical models
If I have a directed model as in Fig. 7, how do I come up with a Markov random field representation of it
that is in a sense minimal? We perform the following two operations.
Firstly, for every node having two or more parents, create an undirected edge between any parents not
already connected.
Secondly, turn each directed edge into an undirected one by removing the arrows.
This process is called moralizing the graph or marrying the parents. See an example of this procedure in
Fig. 11.
I claim that the undirected graph created by the moralization process is a Markov random field relative
to the same distribution. A proof sketch goes as follows. Consider the factorization of Bayesian networks in
10
X
g
Y
-g
X
g
Y
-w
X
g
Y
w
Figure 12: Simple inference example
(9) and let v be any node. As we have previously mentioned, xv appears in a factor p(xv |xpav ) and in other
factors with its co-parents and children. The node xv is already connected to its parents and its children
in the directed graph (but of course, we still need to remove the arrows). The only other case is if xv has
a child xu and xu also has xw as a parent. For example if pau = {v, w}, then the factor p(xu |xv , xw ) has
to be taken into consideration to “match” the two different factorizations in (9) and (12). Simply speaking,
p(xu |xv , xw ) may not reduce in the factorization and we must include an edge in the directed graph which
may not already exist. This is precisely what moralization does.
Note that the resulting undirected graph may not express the some information (independences and
conditional independences) as in the directed one. Let’s take the battery example in Fig. 5. If we moralize
this directed graph, we get an undirected triangle—a three node graph which is fully connected. However,
we would not be able to infer that in fact F and B are independent from the undirected triangle.
Exercise 4. Check that if p(x) is a joint distribution that can be represented as a tree-structured directed
graphical model (Bayesian network) where there are no head-to-head nodes, then p(x) can be represented by
an undirected tree-structured graph G = (V, E) and that
p(x) =
Y
p(xi )
i∈V
Y
(i,j)∈E
p(xi , xj )
.
p(xi )p(xj )
Hint: This comes directly from the Bayesian network factorization in (9) noting that each node, sans the
root, has only one parent.
3
Inference in Graphical Models
In this section, we detail the sum-product algorithm, the most common algorithm used in inference tasks.
Let us motivate the problem of inference using a simple example. We have an “unknown” random variable
X, which is correlated to another random variable Y . They have joint distribution p(x, y) = p(x)p(y|x)
drawn as the graphical model in the left of Fig. 12. Recall that p(x) is the prior distribution of X and p(y|x)
is the likelihood model. We observe the value of Y , say y ∈ Y (reflected in the middle graph of Fig. 12) and
would like to find out what the posterior distribution of X is. To do so, we apply Bayes’ rule to obtain
p(x|y) =
p(y|x)p(x)
.
p(y)
Hence, the joint distribution is now in terms of p(y), the evidence, and p(x|y), the posterior distribution.
This is reflected in the right graph of Fig. 12. Essentially the “causality” relationship between X and Y has
been “inverted”. The inference question is then: Given some variables, what are the posterior distributions
over the others? Can we compute these posterior distributions in as efficient a manner as possible?
3.1
Inference on a chain
Let us consider the Markov chain
X1 → X2 → X3 → . . . → XN −1 → XN
Note that the directed version above is equivalent to the undirected one
X1 − X2 − X3 − . . . − XN −1 − XN
11
since there are no co-parents to marry. Each child only has one parent. In the undirected form, the joint
distribution takes the form
1
p(x1 , . . . , xN ) = ψ1,2 (x1 , x2 )ψ2,3 (x2 , x3 ) . . . ψN −1,N (xN −1 , xN ).
(13)
Z
Consider the case where each random variable takes values in the same finite alphabet X = {1, . . . , K}.
Then the joint distribution requires roughly (N − 1)K 2 parameters to describe. As we have seen, this is a
substantial reduction from if we did not assume Markovianity between the variables.
Let’s consider the problem of finding p(xn ) for some 1 ≤ n ≤ N . That is, we are assuming that there
are no observations. The Y variable in the above simple example above is a deterministic random variable.
Then, by the sum-rule, we have
X
X X
X
p(xn ) =
...
...
p(x)
(14)
x1
xn−1 xn+1
xN
Clearly, this computation requires summing over O(K N ) states, i.e., the computational complexity is exponential in the length of the chain. Surely, there must be a more computationally efficient way to compute
p(xn ) given the Markov structure!
By appealing to the factorization in (13), we can rewrite (14) as follows:


X
X
1 X
ψn−1,n (xn−1 , xn ) . . .
ψ2,3 (x2 , x3 )
ψ1,2 (x1 , x2 )
p(xn ) =
Z x
x2
x1
n−1


X
X
×
ψn,n+1 (xn , xn+1 ) . . .
ψN −1,N (xN −1 , xN )
(15)
xn+1
xN
It is easy to check that the total cost is O(N K 2 ), which is linear in the length of the chain! Note that (15)
can be written as
1
(16)
p(xn ) = µα (xn )µβ (xn )
Z
where the messages
X
X
X
µα (xn ) :=
ψn−1,n (xn−1 , xn ) . . .
ψ2,3 (x2 , x3 )
ψ1,2 (x1 , x2 )
µβ (xn ) :=
xn−1
x2
X
X
ψn,n+1 (xn , xn+1 ) . . .
xn+1
x1
ψN −1,N (xN −1 , xN ).
xN
The function µα (xn ) is known as the forward message while µβ (xn ) is known as the backward message. Note
that the messages can be computed recursively as follows
X
µα (xn ) =
ψn−1,n (xn−1 , xn )µα (xn−1 )
xn−1
µβ (xn ) =
X
ψn,n+1 (xn , xn+1 )µβ (xn+1 ).
(17)
xn+1
Furthermore, note that the normalization constant (partition function) Z of the joint distribution in (13) is
generally intractable to compute. However, in (16), to compute Z, we only need perform O(K) computations
by summing µα (xn )µβ (xn ), which is tractable.
The recursion in (17) is known as the Chapman-Kolmorogov equations. These are encountered frequently
in Markov chain theory. It is easy to see that this idea extends to general tree-structured undirected graphical
models, i.e., models in which there are no loops. It can also be extended to compute pairwise marginals,
say p(x1 , x9 ). Lastly, note that to compute all marginals, we do not have to repeat the whole process since
many of the forward and backward messages can be recycled.
12
x1 g
@
x2 g
@
@
@
fa
x3 g
x1 g
@
@
@
@
@
fb
x2 g
x3 g
@
@
@
fc
fa
@
@
@
f
b
Figure 13: The factor graph of a fully connected distribution (left) and the Markov chain X2 − X1 − X3
(right).
3.2
Factor Graphs
We take a short detour here and describe a very useful representation of a probability distribution, in addition
to the two we have already encountered. Given an undirected graph, we can draw its corresponding factor
graph, which is a more convenient form
Q to present the sum-product algorithm below. Each maximal clique
ψC (xC ) in the factorization p(x) = Z1 C∈C ψC (xC ) is assigned a factor fC (xC ) and drawn as a square node
in the graph. This gives us a bipartite graph, one which there are two sets of nodes and there are only edges
going from one set to another and the joint probability distribution can be expressed as
Y
p(x) =
fC (xC ).
(18)
C∈C
The factor nodes (squares) are denoted using f ’s and the variable nodes (circles) by the original x’s.
Example 5. Consider the fully connected undirected graph consisting of three nodes X1 , X2 and X3 . The
joint distribution can be expressed as
p(x) = fa (x1 , x2 )fb (x1 , x3 )fc (x2 , x3 )
The corresponding factor graph is sketched in Fig. 13 (left). If for example X2 − X1 − X3 forms a Markov
chain in that order, then fc and its edges can be removed and we have the factor graph as in Fig. 13 (right).
Note that if the original undirected graph is a tree, the resulting bipartite factor graph does not have
loops, i.e., it is also a tree. Check!
3.3
Sum-Product algorithm
We come to perhaps the most important point of this set of lecture notes—the presentation of the sumproduct algorithm [KFL01]. This algorithm is also known in the machine learning community as the belief
propagation algorithm. Suppose, as with the inference-on-a-chain example, that Y = ∅, i.e., there are no
observations. We discuss how to relax this assumption at the end of this section.
Suppose we want to compute p(x). We consider the factor graph of the joint distribution p(x), a part of
which containing x is shown in Fig. 14. Then, by the sum-rule,
X
p(x) =
p(x).
(19)
x\x
Now, we can use the factor graph representation in (18) to rewrite the joint distribution of X as
Y
p(x) =
Fs (x, xs )
(20)
s∈ne(x)
where ne(x) is the set of neighbors of x (they are factor nodes) and xs is the set of variables in the subtree
connected to variable x via factor node s. See Fig. 14. We use a different notation Fs to differentiate from
the factors in (18) and to emphasize that the factor graph is rooted at x, the first argument of the factor
13
... g
x1@
@
ft
µfs →x (x)
gx
..
.
@
@
xM
@
. . . g µxM →fs (xM )
f@
u
@
@
fs
Figure 14: Illustration of (20). Note that xs := {x1 , xs1 . . . , xM , xsM } (a recursive definition), ne(x) =
{fs , ft , fu } and ne(fs ) = {x, x1 , . . . , xM }.
Fs . More precisely, Fs (x, xs ) is the product of all the factors in the group associated with factor fs in the
subtree connected to variable x via factor node s. See (23) below.
Now substituting (20) into (19) yields
"
#
Y
X
X Y
Y
Fs (x, xs ) =
Fs (x, xs ) =
µfs →x (x)
(21)
p(x) =
x\x s∈ne(x)
xs
s∈ne(x)
s∈ne(x)
where the message from factor fs to node x is defined as
X
µfs →x (x) :=
Fs (x, xs )
(22)
xs
Now let
Fs (x, xs ) := fs (x, x1 , . . . , xM )G1 (x1 , xs1 ) . . . GM (xM , xsM )
(23)
This is simply the product of the local factor fs (x, x1 , . . . , xM ) and the factors associated to the subtree
rooted at xm , namely Gm (xm , xsm ) for 1 ≤ m ≤ M . Note that the subtree rooted at x is excluded from the
product in (23). Substituting (23) into (22) yields,
X
Y
X
µfs →x (x) =
fs (x, x1 , . . . , xM )
Gm (xm , xsm )
x1 ,...,xM
X
=
m∈ne(fs )\x xsm
Y
fs (x, x1 , . . . , xM )
x1 ,...,xM
µxm →fs (xm )
(24)
m∈ne(fs )\x
where the message from node xm to factor fs is defined as
X
µxm →fs (xm ) :=
Gm (xm , xsm ).
(25)
xsm
We can perform a recursion and write the Gm ’s in terms of the Fl ’s as follows:
Y
Gm (xm , xsm ) =
Fl (xm , xml ).
(26)
l∈ne(xm )\s
The Fl (xm , xml )’s again represent a subtree of the original tree graph. See Fig. 15.
Now let’s express the message sent from variable node xm to factor node fs more explicitly. By combining
(25) and (26), we have
X
Y
Y
X
Y
µxm →fs (xm ) =
Fl (xm , xml ) =
Fl (xm , xml ) =
µfl →xm (xm )
(27)
xsm l∈ne(xm )\s
l∈ne(xm )\s xml
See the message passing in Fig. 15.
14
l∈ne(xm )\s
f1
@ µf →x (xm )
.. @ @ 1 m
. @@
Rx
fs
@
@ gm
µxm →fs (xm )
..
.
fl
fL
Figure 15: Illustration of (26) and (27)
g
x1
g
fa
x2
g
x3
fb
fc
x4 g
Figure 16: A simple factor graph used to illustrate the sum-product algorithm
If we put (24) and (27) together, we see that the sum-product algorithm proceeds by recursively applying
the following message passing rule until convergence.
Y
µx→fs (x) =
µfl →x (x)
(28)
l∈ne(x)\s
µfs →x (x) =
X
Y
fs (x, x1 , . . . , xM )
x1 ,...,xM
µxm →fs (xm )
(29)
m∈ne(fs )\x
The messages are passed from leafs to the root and then back from the root to all the leafs. The marginal
of x can then be obtained by taking the product of all the messages in (29) for which the fs ’s are neighbors
of node x, i.e.,
Y
p(x) ∝
µfs →x (x).
(30)
s∈ne(x)
Phew... It seems apt to now provide an example to see how all these equations are to be used in a real
example. Let us consider the factor graph as shown in Fig. 16. Fig. 16 shows a 4-node graph whose joint
distribution factorizes as p(x) = fa (x1 , x2 )fb (x2 , x3 )fc (x2 , x4 ). In order to apply the sum-product algorithm,
let us designate that x3 is the root of the tree. Relative to node x3 , the tree has two leaf nodes x1 and x4 .
Starting with the leaf nodes, we have the sequence of messages
µx1 →fa (x1 ) = 1
X
µfa →x2 (x2 ) =
fa (x1 , x2 )
x1
µx4 →fc (x4 ) = 1
X
µfc →x2 (x2 ) =
fc (x2 , x4 )
x4
µx2 →fb (x2 ) = µfa →x2 (x2 )µfc →x2 (x2 )
X
µfb →x3 (x3 ) =
fb (x2 , x3 )µx2 →fb (x2 )
x2
15
Note that these equations are specializations of (28) and (29). Once these messages are passed from the
leafs to the root, we send messages from the root back to the leafs to complete the inference. The messages
from the root to the leaves are given as
µx3 →fb (x3 ) = 1
X
µfb →x2 (x2 ) =
fb (x2 , x3 )
x3
µx2 →fa (x2 ) = µfb →x2 (x2 )µfc →x2 (x2 )
X
µfa →x1 (x1 ) =
fa (x1 , x2 )µx2 →fa (x2 )
x3
µx2 →fc (x2 ) = µfa →x2 (x2 )µfb →x2 (x2 )
X
µfc →x4 (x4 ) =
fc (x2 , x4 )µx2 →fc (x2 )
x2
Let us see whether the marginal p(x2 ) is evaluated correctly using the above equations:
p(x2 ) = µfa →x2 (x2 )µfb →x2 (x2 )µfc →x2 (x2 )
"
#"
#"
#
X
X
X
=
fa (x1 , x2 )
fb (x2 , x3 )
fc (x2 , x4 )
x1
X
=
x3
fa (x1 , x2 )
x1 ,x3 ,x4
X
=
X
x4
fb (x2 , x3 )
x3
X
fc (x2 , x4 )
x4
p(x),
x1 ,x3 ,x4
which is the sum-rule as required.
Exercise 5. Check that p(x1 ), p(x3 ) and p(x4 ) can also be computed correctly using the above procedure.
Exercise 6. If the undirected graph is a tree, what is the complexity of belief propagation in terms of the
number of states of each random variable |X | and the number of nodes N ?
It is a well-known and easily verifiable fact that if the original graph is a tree (as in the example above),
the sum-product algorithm converges and gives the right marginals. If the original graph is not a tree,
and we run the sum-product algorithm disregarding the presence of loops (a procedure called loopy belief
propagation), then it may not converge and even if it does, it may not give the correct marginals. This is
a research area of intense interest even as we speak. The jointly Gaussian case is very well-studied and, in
fact, it is easy to check whether belief propagation “works” [MJW06]. There is a huge body of literature on
approximate inference using, for example, variational methods [WJ08] or sampling methods [GG84].
If there are some variables that are observed, say v = v̂, belief propagation can be modified accordingly.
Suppose x = (h, v) (where h are the hidden variables), then by multiplying the joint distribution by
Q
1{v
j = v̂j }, we can consider
j
Y
p̃(x) = p(x)
1{vj = v̂j }.
j
This is simply p(h, v = v̂), an unnormalized version of p(h|v = v̂). Using the sum-product algorithm
detailed previously, we can easily compute any marginal p(hi |v = v̂) up to a normalization constant.
3.4
Variants of the Sum-Product Algorithm
It can also be seen that the only property that we have used in the above simplification of the computation
of p(x) is the distributivity of multiplication over addition. More precisely, we used the fact that for three
real numbers a, b, c,
a · (b + c) = a · b + a · c.
16
It turns out that any operation that satisfies the above distributivity property has an algorithm that is
analogous to the above sum-product algorithm. For example, the max-product or min-sum algorithm solves
the related problem of maximization, or most probable explanation. Instead of attempting to solve the
marginal, the goal here is to find the values x that maximises the global function (i.e., most probable values
in a probabilistic setting), and it can be defined using the arg max:
x̂ = arg max log p(x).
x
An algorithm that solves this problem is nearly identical to belief propagation, with the sums replaced by
maxima in the definitions. This is because
max{ab, ac} = a max{b, c}.
4
Learning of Graphical Models
In this section, we address another important question in the study of graphical models. Suppose we have
a very complex joint distribution q(x) that we believe is “close” to a simpler distribution p(x), perhaps
one that has few edges (i.e., a sparse graphical model). How do we approximate q(x) to get p(x) in a
principled manner? This brings us to the realm of learning graphical models. We can either learn a simpler,
sparser model from a complex model or from data, which is more relevant in practice. I will focus on the
former but toward the end, detail how to use fit data to a simple tree-structured model using the Chow-Liu
algorithm [CL68].
4.1
A Crash Course in Information Theory
It was alluded to previously that we would like to learn a simple model that is, in some sense, close to a
complex model. To measure closeness of distributions, we need some tools from information theory [CT06].
The exposition here cannot do justice to the beauty of information theory. For a flavor of the results in
information theory, the reader is strongly encouraged to skim through Cover and Thomas [CT06]. For a
random variable X taking values on a finite set X with distribution p(x), the entropy is defined as
X
H(X) = −
p(x) log p(x)
(31)
x∈X
where the log is to the base 2 so the unit of entropy is bits. The entropy measures the amount of uncertainty
there is in X. For two random variables X and Y with joint distribution pX,Y , the joint entropy is
X
H(X, Y ) = −
pX,Y (x, y) log pX,Y (x, y).
(32)
(x,y)∈X ×Y
The conditional entropy of X given Y is the remaining uncertainty in X given Y . It is defined as
X
H(X|Y ) = H(X, Y ) − H(Y ) = −
pX,Y (x, y) log pX|Y (x|y).
(33)
(x,y)∈X ×Y
The mutual information between X and Y with joint distribution pX,Y is denoted as
I(X; Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X) =
X
(x,y)∈X ×Y
pX,Y (x, y) log
pX,Y (x, y)
.
pX (x)pY (y)
(34)
The mutual information measures the amount of reduction in uncertainty in X once Y is known. When we
want to make the dependence of these information-theoretic quantities on the distribution explicit, we will
append the notation with a subscript. For example, we use the notation IpX,Y (X; Y ) to denote the mutual
information in (34) computed based on the joint distribution pX,Y .
17
g
@
@
g
@
@
g
g
g
@g
g
g
Figure 17: Approximating a complex distribution, say fully connected (left) with a simpler tree-structured
one (right)
Lemma 2. The entropies defined in (31)–(33) are non-negative.
Lemma 3. The mutual information defined in (34) is non-negative.
Exercise 7. Determine when the entropies in (31)–(33) are equal to zero. Determine when mutual information in (34) equals to zero.
Given two distributions p and q with common support X , the Kullback-Leibler divergence or relative
entropy is defined as
X
p(x)
.
(35)
D(p||q) =
p(x) log
q(x)
x∈X
The relative entropy is a fundamental measure of the distance between the two distributions p and q. In fact
we have
Lemma 4. The relative entropy in (35) is non-negative. It is zero if and only if p(x) = q(x) for all x ∈ X .
It is not too hard to see that the mutual information in (34) is equal to the relative entropy between the
joint distribution pX,Y and the product of the marginals pX ◦ pY , i.e.,
I(X; Y ) = D(pX,Y ||pX ◦ pY ).
This shows that mutual information measures the degree of dependence between the random variables X
and Y . If they are independent I(X; Y ) = 0. Else mutual information I(X; Y ) must be positive.
4.2
The Chow-Liu Algorithm
We have a very complicated multivariate undirected graphical model that is perhaps fully connected q(x),
where x = (x1 , . . . , xN ). We would like to, for the sake of tractability, approximate it using a simpler
undirected graphical model. As we have seen, the family of tree-structured distributions is tractable for
exact inference. So it is natural approximate q(x) with a tree-structured distribution by minimizing the
relative entropy between the q(x) and the approximated distribution p̂(x):
p̂(x) = min D(q||p)
p∈TN
(36)
where TN is the set of all tree-structured undirected graphical models with N nodes. See Fig. 17. The
problem in (36) could be potentially very computationally intensive the the number of undirected trees on
N nodes is N N −2 . This number is super-exponential in the number of nodes N . So clearly a naı̈ve search
for the tree-structured distribution that minimizes the relative entropy as in (36) is infeasible.
Enter Chow-Liu [CL68] in 1968. Through some fairly rudimentary manipulations of information-theoretic
quantities (detailed in the next subsection), they showed that the tree-structure of p̂ in (36), denoted as Tp̂ ,
is given by
X
Tp̂ = max
Iq (Xi ; Xj ),
(37)
T ∈TN
(i,j)∈T
where TN is the set of all undirected tree graphs on N nodes and Iq (Xi ; Xj ) is the mutual information
between Xi and Xj computed based on the dense graphical model q. The optimization problem in (37) is a
18
simple maximum-weight spanning tree problem which can be solved efficiently using, for example, Kruskal’s
algorithm [Kru56]. So instead of having to search through all N N −2 tree structures using (36), we can simply
solve a maximum-weight spanning tree problem to find the optimal tree structure in polynomial time in N .
What if we have data D := {x1 , . . . , xn }, each drawn independently and identically from some distribution
q(x) and we would like to learn (an approximate version of) q(x)? Suppose the random variables take on
values in some finite alphabet X . Then a natural thing to do is to first compute the empirical distribution
(or the type) of the data
n
1X
1{x = xk }.
q̂(x; D) =
n
k=1
Note that this is the normalized frequencies of the occurrences of x in the data. Second, we perform the
optimization in (37) with q̂ in place of q to obtain a sparser tree-structured distribution that approximates
the generating distribution q(x). It turns out that computing the empirical distribution, then running the
maximum-weight spanning tree procedure results in the tree distribution that is maximum-likelihood but we
will not derive this here. Recent work by the author in his Ph.D. thesis answered the question as to how
many samples n one needs to get a “good” fit of the data to the model [TATW11]. The Chow-Liu algorithm
can also be applied to continuous data and the performance can be analyzed as in [TAW10].
We note that (37) only provides us with the approximating structure. To complete the process of finding
an approximating distribution, we have to fit the parameters as well. That is, we must find p̂(xi , xj ) for
all (i, j) in the edge set. This is an easy process and can be done using maximum likelihood procedures.
See [TATW11] for more details.
4.3
Derivation of the Chow-Liu Algorithm
From Exercise 4, we know that given that p(x) ∈ TN , it can be expressed as
p(x) =
Y
p(xi )
i∈V
Y
(i,j)∈E
p(xi , xj )
.
p(xi )p(xj )
(38)
Also note that the minimization over p in (36) is equivalent to the maximization
X
p̂(x) = max
q(x) log p(x)
p∈TN
(39)
x∈X N
This is due to the definition of the relative entropy in (35). Now, we substitute the tree factorization in (38)
into the objective function in (39) giving


X
X
X
X
p(x
,
x
)
i
j 
q(x) log p(x) =
q(x) 
log p(xi ) +
log
p(x
)p(x
i
j)
N
N
x∈X
i∈V
x∈X
=
X X
(i,j)∈E
q(xi ) log p(xi ) +
i∈V xi ∈X
X
X
(i,j)∈E (xi ,xj )∈X 2
q(xi , xj ) log
p(xi , xj )
p(xi )p(xj )
(40)
Now it can easily be seen that for a given structure E, the p that minimizes the above expression is the one
that matches the marginals, i.e., p(xi ) = q(xi ) for all i ∈ V and p(xi , xj ) = q(xi , xj ) for all (i, j) ∈ E. As
such the expression in (40) can be written in terms of information-theoretic quantities as
X
X
−
Hq (Xi ) +
Iq (Xi ; Xj ),
i∈V
(i,j)∈E
where we used the definitions of entropy and mutual information. Now, the first term above is constant over
all possible tree structures. Thus, maximizing the sum of mutual information quantities Iq (Xi ; Xj ) subject
to the constraint that (V, E) is a tree that minimizes the Kullback-Leibler divergence in (36).
19
5
Other Textbooks
This set of notes has hopefully given the reader an insight into the modeling powers of graphical models.
We largely followed the exposition in Chapter 8 of Bishop [Bis08]. This is the most basic text in graphical
modeling. The reader who is interested in learning more can consult the monograph by Wainwright and
Jordan [WJ08] or the recently published book by Koller and Friedman [KF09]. These textbooks require
significantly more mathematical maturity but they would expose the reader to the state-of-the-art in the
research in graphical modeling.
References
[Bis08]
C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2008.
[BT02]
D. P. Bertsekas and J. N. Tsitsiklis. Introduction to Probability. Athena Scientific, 1st, 2002.
[CL68]
C. K. Chow and C. N. Liu. Approximating discrete probability distributions with dependence
trees. IEEE Trans. on Inf. Th., 14(3):462–467, May 1968.
[CT06]
T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-Interscience, 2nd edition,
2006.
[GG84]
S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration
of images. IEEE Trans. on Pattern Analysis and Machine Intelligence, 5:721–741, Jun 1984.
[HC70]
J. M. Hammersley and M. S. Clifford. Markov fields on finite graphs and lattices. Unpublished,
1970.
[KF09]
D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques (Adaptive
Computation and Machine Learning). The MIT Press, 2009.
[KFL01]
F. Kschischang, B. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm.
IEEE Trans. on Inf. Th., 47(2):498–519, Feb 2001.
[Kru56]
J. B. Kruskal. On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem.
Proceedings of the American Mathematical Society, 7(1), Feb 1956.
[MJW06]
D. M. Malioutov, J. K. Johnson, and A. S. Willsky. Walk-sums and belief propagation in Gaussian
graphical models. Journal of Machine Learning Research, pages 2031–2064, Jul 2006.
[Pea88]
J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan
Kaufmann, 2nd edition, 1988.
[TATW11] V. Y. F. Tan, A. Anandkumar, L. Tong, and A. S. Willsky. A large-deviation analysis for the
maximum likelihood learning of Markov tree structures. IEEE Trans. on Inf. Th., 57(3):1714–35,
Mar 2011.
[TAW10]
V. Y. F. Tan, A. Anandkumar, and A. S. Willsky. Learning Gaussian tree models: Analysis
of error exponents and extremal structures. IEEE Trans. on Sig. Proc., 58(5):2701–2714, May
2010.
[WJ08]
M. J. Wainwright and M. I. Jordan. Graphical Models, Exponential Families, and Variational
Inference, volume 1 of Foundations and Trends in Machine Learning. Now Publishers Inc, 2008.
20

Download Report

A Tutorial On Graphical Models

Paperzz.com

Your Paperzz