A Tutorial On Graphical Models Vincent Y. F. Tan ([email protected]) April 18, 2012 Abstract This set of notes serves the objective of introducing the reader to the basics of probabilistic graphical models. We begin with a brief review of elementary probability theory. We then discuss the various classes of graphical models, namely Bayesian networks and Markov random fields. Following that, we detail the main inference algorithm in graphical modeling—the sum-product (or belief propagation) algorithm. Finally, we expose the reader to the simplest algorithm for approximating or learning the structure of such models from data—the Chow Liu algorithm. The exposition in this set of notes is based largely on Bishop’s excellent text on machine learning [Bis08]. 1 Basic Probability Theory Because all of modern statistical machine learning deals with uncertainty, it seems appropriate to start of by reminding the reader of the definitions of events, probabilities, joint probabilities and conditional probabilities. For more details, the reader is encouraged to consult Bertsekas and Tsitsiklis [BT02]. Let us motivate probability by considering an example from Bishop [Bis08]. Example 1. There is a red box and a blue box. In the red box there are a total of 8 fruits, 2 apples and 6 oranges. In the blue box, there are a total of 4 fruits, 3 apples and 1 orange. The probability of selecting the red box is Pr(B = r) = 2/5 and probability of selecting the blue box is Pr(B = b) = 3/5. Having selected a box, selecting any item within the box is equally likely. Some of the questions we would like to ask include: What’s the probability that we select an orange? Given that we’ve selected an orange, what’s the probability that we chose it from the blue box? 1.1 Joint and Conditional Probabilities Now, let us consider a more general example involving two random variables X and Y . Suppose, as in the example above, the random variables are only permitted to take on finitely many values. So X can only take on values in the finite set X = {x1 , . . . , xM } and Y takes on values in Y = {y1 , . . . , yL }. Consider N trials and let the number of trials for which X = xi and Y = yj be nij . Then, if N is large, we can assume that Pr(X = xi , Y = yj ) = nij . N What’s the probability that X = xi ? Well, we simply sum up those n0ij s for which the first index equals to i. In other words, ci Pr(X = xi ) = N PL where clearly, ci := j=1 nij . Expressed slightly differently, we have Pr(X = xi ) = L X Pr(X = xi , Y = yj ). j=1 1 This is the important sum rule in probability and will be used extensively in inference in graphical models so the reader is urged to internalize this. Similarly, the marginal probability that Y = yj is Pr(Y = yj ) = M X Pr(X = xi , Y = yj ). i=1 Now, we introduce the important notion of conditional probabilities. Given that X = xi , what is the probability that Y = yj . Clearly, Pr(Y = yj |X = xi ) = nij . ci But note also that Pr(X = xi , Y = yj ) = nij ci nij = · = Pr(Y = yj |X = xi ) Pr(X = xi ). N ci N We have derived the important product-rule in probability. The rules of probability are summarized as follows: X Pr(X = x) = Pr(X = x, Y = y) (1) y∈Y Pr(X = x, Y = y) = Pr(Y = y|X = x) Pr(X = x) (2) A note about notation. We will usually write pX (x) := Pr(X = x) or simply denote this value as p(x) when the random variable is clear from the context. The function p(x) is known as the probability mass function or pmf. Similarly, the joint pmf of random variables X and Y is denoted as p(x, y). Finally, the conditional will be denoted as p(y|x). By combining the sum rule in (1) and the product rule in (2), we can derive Bayes’ rule: p(y|x) = p(x|y)p(y) p(x|y)p(y) p(x, y) = =P . 0 0 p(x) p(x) 0 y ∈Y p(x|y )p(y ) (3) This is a central relationship in pattern recognition, machine learning and statistical physics. Note that we have “inverted the causal relationship” between X and Y . On the left, Y “depends on” X while on the right, we have expressed the same causality relationship in terms of the causal dependence of X on Y . Bayes’ theorem can be written alternatively as follows: p(x|y) ∝ p(y|x)p(x), where ∝ denotes equality up to a constant (not depending on x). If x designates an unknown variable, something we would like to infer and p(x) is its prior probability, then p(x|y) denotes the posterior probability, the belief we have about x after we know that Y = y. In the parlance of statistical inference, posterior ∝ likelihood × prior We now use this relation to solve: Exercise 1. Let F be the random variable denoting the fruit chosen. Using the sum, product and Bayes’ rules, verify that the answers to the questions in Example 1 are Pr(F = o) = 9/20 and Pr(B = b|F = o) = 1/3 respectively. Note the following: Prior to having any additional information about what fruit we chose, the prior probability of choosing from a blue box is Pr(B = b) = 3/5. However, if we know the identity of the fruit we chose, say orange, then the posterior probability of choosing from the blue box is Pr(B = b|F = o) = 1/3. This is the simplest non-trivial example of statistical inference. Intuitively, this is true because the blue box contains far fewer oranges so knowing that we chose an orange biases our belief about the box we chose from. 2 g 4 1 g 3 2 g g 64 -g 3 g 1 g 2 -g Figure 1: Two examples of graphs. On the left, we have an undirected graph while on the right, we have a directed graph. 1.2 Independence and Conditional Independence What does it mean for two random variables X and Y to be independent? This can be expressed in a variety of ways. Two random variables X and Y are independent if their joint distribution pX,Y (x, y) := Pr(X = x, Y = y) factorizes, i.e., pX,Y (x, y) = pX (x)pY (y), (4) for every x ∈ X and y ∈ Y. Note from (4) that if X and Y are independent if pX|Y (x|y) = pX (x) (5) for every x ∈ X and y ∈ Y provided that pY (y) > 0. Intuitively, (5) means that knowledge that Y = y tell you no additional information about X, hence the nomenclature independence. In the literature, X is independent of Y is denoted as X ⊥ ⊥Y. Similarly, X and Y are conditionally independent given Z if the conditional distribution of X and Y given Z factorizes pX,Y |Z (x, y|z) = pX|Z (x|z)pY |Z (y|z), (6) for every x ∈ X , y ∈ Y and z ∈ Z. Equation (6) implies that X and Y are conditionally independent given Z if pX|Y,Z (x|y, z) = pX|Z (x|z) (7) for every x ∈ X , y ∈ Y and z ∈ Z provided pY |Z (y|z) > 0. Equation (7) implies that knowledge of Y = y, together with Z = z, is just as good as knowing Z = z. If X is conditionally independent of Y given Z, we denote this statement as X ⊥ ⊥ Y |Z or, alluding to the later parts of this tutorial, the Markov chain X − Z − Y . So Z separates X and Y . Exercise 2. If X − Z − Y is a Markov chain, is it true that for all subsets A ⊂ X , B ⊂ Y, C ⊂ Z, we have that Pr(X ∈ A|Y ∈ B, Z ∈ C) = Pr(X ∈ A|Z ∈ C)? Prove or provide a counterexample. 2 Graphical Models Graphical models provide a simple way to visualize the structure of a probabilistic model. For example, from the graph structure, we would like to easily read off the set of conditional independence relationships between a set of random variables. How do we formalize this notion and how can we use the graph structure to perform inference? For example, our bunch of random variables could be the set X1 , . . . , X9 , Given that someone observed that X1 = x1 , what can we say about the posterior distribution of X9 ? Can we compute this fairly easily? These are questions we will attempt to answer in the next 2 sections. But first we will introduce the notion of graphs, Bayesian networks, undirected graphical models (Markov random fields) and factor graphs. 2.1 Graphs A graph G = (V, E) is a data structure consisting of a node (or vertex) set V and an edge set E. The nodes are connected by links, or arcs, or edges in E. 3 g g X1 @ @ @ X ? g3 @ @ RXg2 @ X2 g X1 g X3 ? ? g g X X 4 5 Figure 2: Left: A simple Bayesian network reflecting the factorization of p(x1 , x2 , x3 ) p(x3 |x1 , x2 )p(x2 |x1 )p(x1 ). Right: A more complicated example reflecting the factorization in (8). = Let us consider the examples in Fig. 1. Both graphs have 4 nodes, labelled 1, 2, 3, and 4. The vertex set is thus V = {1, 2, 3, 4}. For the undirected graph on the left, a cycle graph, the edge set is E = {(1, 2), (2, 3), (3, 4), (1, 4)}. Note that since the graph is undirected, there is no need to order of the nodes in each of the edges in E. Thus, we could very well have written E = {(2, 1), (3, 2), (4, 3), (4, 1)}. For the directed graph on the right, we need to pay more attention. The edge set is E = {(1, 2), (1, 4), (4, 3)}. The second coordinate denotes the node in which the arrow points towards. Thus (1, 2) is not the same as (2, 1). 2.2 Bayesian networks We now introduce perhaps the most important class of graphical models known as Bayesian networks. These class of models are better (than Markov random fields) at expressing causal relationships between random variables. For more details, the reader may consult the excellent text by Turing award winner J. Pearl [Pea88]. Consider N random variables X1 , . . . , XN , each taking values in a common finite alphabet X . By repeated applications of the product rule, we have Pr(X1 = x1 , . . . , XN = xN ) = p(x1 , . . . , xN ) = p(xN |x1 , . . . , xN −1 ) . . . p(x2 |x1 )p(x1 ). We can depict this relationship graphically as in the left of Fig. 2. In this example X1 and X2 are known as the parents of X3 . X1 is the single parent of X2 . Conversely, X3 is the child of X1 and X2 and so on. This is an example of a Bayesian network. For Bayesian networks, the absence of links conveys information about conditional independence relationships. For the more complicated example shown in the right of Fig. 2, we have the factorization p(x1 , x2 , x3 , x4 , x5 ) = p(x1 )p(x2 )p(x3 )p(x4 |x1 , x2 , x3 )p(x5 |x1 , x3 ). (8) At this point, we introduce the following convenient notation: Given a subset of nodes U ⊂ V, XU = {Xj : j ∈ U} is the collection of random variables indexed by U. Similarly, xU = {xj : j ∈ U} is a particular realization of XU . Note that X = XV = (X1 , . . . , XN ). Using this notation, we can write the factorization of the joint distribution of X1 , . . . , XK as Pr(X = x) = p(x) = p(x1 , . . . , xN ) = N Y p(xk |xpak ) (9) k=1 where pak denotes the set of parents of node k. So for example in the right plot of Fig. 2, the set of parents of 4 is {1, 2, 3} so that factor in (9) is p(x4 |x1 , x2 , x3 ). 2.2.1 Reduction in Number of Parameters What’s the advantage of the factorization of the joint distribution in (9)? The primary advantage is the reduction in the number of parameters to describe a potentially very complex joint distribution. Let us consider the case where there are only two random variables X1 and X2 taking values in a common alphabet X := {0, 1}. Then, the marginal distributions of X1 and X2 can be expressed as Pr(X1 = x1 ) = µx1 1 (1 − µ1 )1−x1 Pr(X2 = x2 ) = µx2 2 (1 − µ2 )1−x2 4 (10) X X X g1 - g2 - g3- . . . X - gN Figure 3: The Markov chain reflecting the factorization p(x) = p(x1 )p(x2 |x1 ) . . . p(xN |xN −1 ). A g A C B g - g -g C B g -g (a) A C g - g (b) B g (c) Figure 4: Three different factorizations of the distribution p(a, b, c). where µj = Pr(Xj = 1) for j = 1, 2. Now, if we consider X1 − X2 (i.e., the joint distribution does not factorize as p(x1 )p(x2 )), then, Pr(X1 = x1 , X2 = x2 ) = 1 Y 1 Y x1k x2k µkl (11) k=0 l=0 P where µkl = Pr(X1 = k, X2 = l). Because k,l µkl = 1, we require 3 parameters to describe the joint distribution Pr(X1 = x1 , X2 = x2 ) in (11). Another way to see this is to write p(x1 , x2 ) as p(x1 )p(x2 |x1 ). Then the marginal distribution is governed by a single parameter µ1 = Pr(X1 = 1) and the conditional distribution by two parameters p(x2 = 0|x1 = 0) and p(x2 = 0|x1 = 1). This makes a total of 3 parameters. In contrast if we were to describe X1 and X2 separately as in (10), then we only require 2 parameters. More generally, for N variables, each represented by a node, if the graph is fully connected (no factorization properties implied), in general, we need 2N − 1 parameters to describe the joint distribution. This is exponential in N making computations virtually infeasible if N is large. However, if the model factorizes as in the chain in Fig. 3, then the reader is invited to check that only 2N − 1 parameters are required to describe p(x). Now the number of parameters is linear in N , and as we will see, inference on a chain is exceedingly easy! 2.2.2 Conditional Independence Revisited and Explaining Away Recall that two random variables A and B are said to be conditionally independent given C if Pr(A = a, B = b|C = c) = Pr(A = a|C = c) Pr(B = b|C = c). In short, p(a, b|c) = p(a|c)p(b|c). Now, we are going to consider three different Bayesian networks describing various factorizations of the joint distribution of A, B and C. See Fig. 4. Consider the factorization in Fig. 4(a). According to (9), the joint distribution factorizes as p(a, b, c) = p(c)p(a|c)p(b|c) Does this imply that A ⊥ ⊥ B? Consider X X p(a, b) = p(a, b, c) = p(c)p(a|c)p(b|c) 6= p(a)p(b) c c where the first equality follows from the sum rule. Now, clearly, we can’t claim that p(a, b) = p(a)p(b) in general, hence A is not independent of B. By if we condition in C, we have p(a, b|c) = p(a, b, c) = p(a|c)p(b|c) p(c) where the first equality follows from Bayes’ rule. Hence, A ⊥⊥ B|C. The key intuition here is that when conditioned on C a tail-to-tail node, the path from A to B is blocked causing them to be independent. 5 B G g - g F g Figure 5: The graph used for Exercise 3 Consider the factorization in Fig. 4(b). It is easy to check that A and B are dependent. However, we also have the relation A ⊥ ⊥ B|C. Here the intuition is that node C is head-to-tail with respect to the path from A to B. If C is observed the path if blocked, causing A and B to be independent. The most tricky case is that in Fig. 4(c). From (9), the joint distribution factorizes as p(a, b, c) = p(a)p(b)p(c|a, b) Summing both sides over c yields X X p(a, b) = p(a)p(b)p(c|a, b) = p(a)p(b) p(c|a, b) = p(a)p(b). c c where the second equality follows from the fact that p(a) and p(b) do not depend on c and the third equality from the fact that all probabilities sum up to unity. Thus, we can conclude that A ⊥ ⊥ B. However, given C, we have that p(a, b|c) = p(a)p(b)p(c|a, b) p(a, b, c) = 6= p(a|c)p(b|c) p(c) p(c) thus A is not conditionally of B given C. The intuition here is that node C is a head-to-head node. When conditioned on C, the path becomes unblocked rendering A and B independent! When not conditioned on C, the path is blocked, hence A and B are independent. Exercise 3. Consider the Bayesian network in Fig. 5. Here B, F and G are binary random variables. The variable B represents whether the battery is charged (1 is charged and 0 otherwise). The variable F represents the state of the fuel tank (1 is full and 0 empty) and the variable G represents the state of the electric gauge (1 is good and 0 otherwise). The following are known: Pr(B = 1) = 0.9 Pr(F = 1) = 0.9 Pr(G = 1|B = 1, F = 1) = 0.8 Pr(G = 1|B = 0, F = 0) = 0.1 Pr(G = 1|B = 1, F = 0) = 0.2 Pr(G = 1|B = 0, F = 1) = 0.2 Suppose we observed that G = 0, compute the probability that posterior probability that the fuel tank is empty, i.e., Pr(F = 0|G = 0). In addition to observing that G = 0, we also observed that B = 0, compute the posterior probability that the battery is also flat, i.e., Pr(F = 0|G = 0, B = 0). If we’ve done the computations correctly, we’d have found out that probability that the tank is empty has decreased from 0.257 to 0.111, as a result of knowledge of the battery is flat (i.e., B = 0), in addition to knowing that the gauge is bad (i.e., G = 0). Observe that finding out that the battery is flat explains away the observation that the fuel gauge is empty. In other words, due to the structure as in Fig. 5, F and B have become dependent as a result of observing G. This peculiar phenomenon is known as explaining away. 2.2.3 d-separation We would like to be able to do a quick examination of the directed graph structure to tell whether a subset of variables is conditionally independent of another subset given a particular subset. To do so, we need 6 A g E - g F g B -g A g E - g F w B -g ? gC ? wC Figure 6: Examples of d-separation to introduce the notion of d-separation for general Bayesian networks which generalizes the observation we made in the previous section. The rules are as follows: Let A, B, C ⊂ V be non-intersecting subsets of the vertex set V. Their union does not necessarily have to be V. Let us consider all paths from all nodes in A to all nodes in B. Any such path is said to be blocked if 1. the arrows on the path meet either tail-to-tail or head-to-tail at a node and node belongs to C 2. the arrows meet head-to-head at a node and neither the node nor any of its descendants belong to C. If all paths are blocked, the A is said to be d-separated from B by C. At this point, we introduce a new convention. If a random variable is observed, then we shade its node in the graph; if it is unobserved, then it remains uncolored. For example in the left graph of Fig. 6, node C is observed and the rest are unobserved. Consider the left graphical model in Fig. 6. We would like to ask whether it is true that A is independent of B given C? Now, the path from A to B passes through node E, a head-to-head node and node E is a parent of an observed node (node C). Thus, by the second criterion in the definition of blockedness above, the path from A to B is not blocked. Hence, we cannot say that A is independent of B given C in general. But for the graphical model on the right side of Fig. 6, node F blocks the path from A to B since it is a tail-to-tail node and F is observed. Thus, A ⊥⊥ B|F . 2.2.4 Markov Blanket As a final remark on the topic of Bayesian networks, it would be insightful to know what is the minimal set of nodes that separates a node from the rest of the graph. More precisely, for a node v ∈ V, we would like to find the smallest set S such that p(xv |xS ) = p(xv |xV\v ). In other words, conditioned on the fact that XS = xS , Xv is independent of everything else in the graph. The set S is known as the Markov blanket of node v. Let us consider a joint distribution of the factorization in (9). Then, let v = 1 for convenience and consider QN p(xk |xpak ) p(x1 , . . . , xN ) p(x1 |xV\1 ) = = P k=1 QN p(x2 , . . . , xN ) x1 k=1 p(xk |xpak ) Thus note that any factor that does not have a dependence on x1 can be taken out of the sum and will be cancelled by a common term in the numerator. Which terms have dependence on x1 . Certainly p(x1 |xpa1 ). Also, if x1 and x2 have a common child, say xch , then there will be a factor p(xch |x1 , x2 ). Thus, the Markov blanket of node 1 is precisely the union of its parents, its children and co-parents (nodes which share common children), i.e., [ [ MB1 = pa1 children(1) co-parents(1). Example 2. Consider the graph in Fig. 7.The Markov blanket of node C is MBC = {A, E, D, B}. Note that D is a child of F and D has another parent, namely A. Thus A is a co-parent of C and hence, A is included in the Markov blanket of C. What is the Markov blanket of node F ? 7 A g D ? g E - g F g C ? g Gg B -g Figure 7: Example for illustration of Markov blankets a1 g d g a g2 c g2 c g1 b g2 b1 g e g Figure 8: A Markov random field. Here (Xa1 , Xa2 ) ⊥⊥ (Xb1 , Xb2 )|(Xc1 , Xc2 ). 2.3 Markov random fields or undirected graphical models In contrast to Bayesian networks, the graph that encodes the set of conditional independencies in Markov random fields is undirected. Thus, the links are undirected; they do not carry arrows. The conditional independence properties can be read off more easily than in Bayesian networks. Let us again consider the Markov chain A − C − B. As mentioned previously, for this Markov chain, A is conditionally independent of B given C. We can generalize this to arbitrary undirected graphs. Let us consider the graphs as shown in Fig. 8. Suppose that A = {a1 , a2 }, B = {b1 , b2 } and C = {c1 , c2 }. Then because every path from any node in A to any node in B passes through at least one node in C, it is true that XA is conditionally independent of XB given XC . In this case we say that C separates A and B. Another way to say the same thing is the following: Let us consider node 1 for simplicity. Let ne(1) := {j : (1, j) ∈ E} be the set of neighbors of 1, i.e., those nodes that are adjacent to node 1. Then, it is true from the above observation that X1 is conditionally independent of XV\{1∪ne(1)} given Xne(1) . Simply speaking, conditioned on its neighborhood, node 1 is independent of all other nodes in the graph. For example, conditioned on Xa1 , Xc1 and Xc2 , random variable Xd is independent of Xa2 , Xb1 , Xb2 and Xe in Fig. 8. In other words, for undirected graphical models, the Markov blanket of any node is simply its neighborhood, those nodes that it is adjacent to and separates it from the rest of the graph. 2.3.1 Factorization Properties We now describe how the joint distribution of an undirected graphical model factorizes. As a motivating example, consider a model in which xi and xj are not neighbors in the graph. Then it is true that p(xi , xj |xV\{i,j} ) = p(xi |xV\{i,j} )p(xj |xV\{i,j} ). That is Xi and Xj are conditionally independent given the rest of the graph. The factorization of the joint distribution of an undirected graphical model should reflect the fact that xi and xj do not appear in the same factor. To proceed, we have to introduce some graph-theoretic notions. Given an undirected graph, a clique is a subset of nodes so that there is a link between any two nodes in the subset. A maximal clique is a clique such that the inclusion of another node into the subset of nodes renders it not to be a clique. It is a fact that any clique is a subset of a maximal clique. Informally speaking, maximal cliques are fully connected subsets of nodes that cannot be “enlarged” any further. Example 3. For the undirected graph in Fig. 9, {1, 2} is a clique but not a maximal clique since the addition of node 3 still results in a clique. The maximal cliques of this graph are C1 = {1, 2, 3} and C2 = {1, 3, 4}. 8 4 g g3 1 g g2 Figure 9: Cliques and maximal cliques For an undirected graph G = (V, E), let C be the set of maximal cliques of G. The joint distribution of a Markov random field can be written as the product of potential functions ψC : X |C| → [0, ∞) defined on the maximal cliques of the graph, i.e., 1 Y ψC (xC ). (12) p(x) = Z C∈C If p(x) takes the form in (12), then it is known as a Gibbs random field. The constant Z is the known as the partition function, an object of central importance in the statistical physics. It is defined so that the pmf p(x) sums to unity, i.e., X Y Z= ψC (xC ). x∈X |V| C∈C Note that the domain of the potential function ψC (xC ) is simply the Cartesian product of the domains of the random variables in the maximal clique C. Note also that the potential function is not restricted to be strictly positive. There can be states xC for which ψC (xC ) = 0. The evaluation of the partition function is generally intractable. We need to sum over |X ||V| number of states. This is exponential in |V|! Researchers have devoted lifetimes to evaluating good bounds on the partition function for graphical models with special structure. The logarithm of the partition function is also intimately related to the cumulant generating function. A natural question is: What the relation between the factorization in (12) and the set of conditional independence statements we can make? We have the following fundamental result in the theory of graphical models. Theorem 1 (Hammersley-Clifford [HC70]). Suppose that ψC (xC ) > 0 for all xC ∈ X |C| and for all C ∈ C . Let UI be the set of distributions consistent with the set of conditional independence statements we can make given an undirected graph G. Let UF be the set of distributions expressible as in the factorization in (12) where C are the set of maximal cliques in the graph G. Then UI = UF. Since the Hammersley-Clifford theorem requires the assumption of positivity in all the potential functions, it is convenient to parametrize them as ψC (xC ) = exp[−EC (xC )] where EC : X |C| → R are called energy functions. Parametrized in this fashion, the probability distribution in (12) can be expressed as " # X 1 p(x) = exp − EC (xC ) Z C∈C and is called a Boltzmann distribution. The negative log-likelihood is the sum of the energies associated to each maximal clique up to a constant, i.e., X − log p(x) = EC (xC ) + log Z C∈C Hence, the intuition is that if the energy of the system is low (or close to equilibrium or ground state), the probability is high and vice versa. 9 w yi w g xi g w w g gxj Figure 10: Image processing example g g g g @ @ @ @ R g @ @g g g @ @ @ @ @ @ @ @ R g @ R g @ @g @g Figure 11: Moralizing the undirected graph to form a Markov random field. The parents have been married. Example 4. Markov random fields are used extensively in image processing. See a canonical model in Fig. 10. One of the ways to model a black-and-white image is as follows. Each pixel xi can take on only one of two values in X = {−1, +1}. These values are corrupted by noise and what we see are the yi ’s given as xi with prob q yi = −xi with prob 1 − q We observe the yi ’s (and that’s why they are colored in Fig. 10). Let xi and xj be neighbors and they are unobserved. Clearly the maximal cliques are of the form {xi , xj } and {xi , yi }. Let’s specify plausible energy functions that are compatible with the maximal cliques. We want neighbouring pixels to be similar with high probability, so E(xi , xj ) = −βxi xj for some β > 0. We want the observation yi to be correlated with the underlying pixel xj so we set E(xi , yi ) = −γxi yi for some γ > 0. We also perhaps believe that the image consists of mostly −1’s so we set E(xi ) = hxi for some h > 0. In sum, an energy function to be minimized for x can be expressed as X X X E(x) = h xi − β xi xj − γ xi yi , i∈Vgrid (i,j)∈Egrid i∈Vgrid where Vgrid and Egrid denote the vertex set and edge set of the grid respectively. Minimizing the energy functional can be done using standard techniques such as Gibbs sampling [GG84]. This is out of the scope of the tutorial. 2.3.2 Relation of undirected graphical models to directed graphical models If I have a directed model as in Fig. 7, how do I come up with a Markov random field representation of it that is in a sense minimal? We perform the following two operations. Firstly, for every node having two or more parents, create an undirected edge between any parents not already connected. Secondly, turn each directed edge into an undirected one by removing the arrows. This process is called moralizing the graph or marrying the parents. See an example of this procedure in Fig. 11. I claim that the undirected graph created by the moralization process is a Markov random field relative to the same distribution. A proof sketch goes as follows. Consider the factorization of Bayesian networks in 10 X g Y -g X g Y -w X g Y w Figure 12: Simple inference example (9) and let v be any node. As we have previously mentioned, xv appears in a factor p(xv |xpav ) and in other factors with its co-parents and children. The node xv is already connected to its parents and its children in the directed graph (but of course, we still need to remove the arrows). The only other case is if xv has a child xu and xu also has xw as a parent. For example if pau = {v, w}, then the factor p(xu |xv , xw ) has to be taken into consideration to “match” the two different factorizations in (9) and (12). Simply speaking, p(xu |xv , xw ) may not reduce in the factorization and we must include an edge in the directed graph which may not already exist. This is precisely what moralization does. Note that the resulting undirected graph may not express the some information (independences and conditional independences) as in the directed one. Let’s take the battery example in Fig. 5. If we moralize this directed graph, we get an undirected triangle—a three node graph which is fully connected. However, we would not be able to infer that in fact F and B are independent from the undirected triangle. Exercise 4. Check that if p(x) is a joint distribution that can be represented as a tree-structured directed graphical model (Bayesian network) where there are no head-to-head nodes, then p(x) can be represented by an undirected tree-structured graph G = (V, E) and that p(x) = Y p(xi ) i∈V Y (i,j)∈E p(xi , xj ) . p(xi )p(xj ) Hint: This comes directly from the Bayesian network factorization in (9) noting that each node, sans the root, has only one parent. 3 Inference in Graphical Models In this section, we detail the sum-product algorithm, the most common algorithm used in inference tasks. Let us motivate the problem of inference using a simple example. We have an “unknown” random variable X, which is correlated to another random variable Y . They have joint distribution p(x, y) = p(x)p(y|x) drawn as the graphical model in the left of Fig. 12. Recall that p(x) is the prior distribution of X and p(y|x) is the likelihood model. We observe the value of Y , say y ∈ Y (reflected in the middle graph of Fig. 12) and would like to find out what the posterior distribution of X is. To do so, we apply Bayes’ rule to obtain p(x|y) = p(y|x)p(x) . p(y) Hence, the joint distribution is now in terms of p(y), the evidence, and p(x|y), the posterior distribution. This is reflected in the right graph of Fig. 12. Essentially the “causality” relationship between X and Y has been “inverted”. The inference question is then: Given some variables, what are the posterior distributions over the others? Can we compute these posterior distributions in as efficient a manner as possible? 3.1 Inference on a chain Let us consider the Markov chain X1 → X2 → X3 → . . . → XN −1 → XN Note that the directed version above is equivalent to the undirected one X1 − X2 − X3 − . . . − XN −1 − XN 11 since there are no co-parents to marry. Each child only has one parent. In the undirected form, the joint distribution takes the form 1 p(x1 , . . . , xN ) = ψ1,2 (x1 , x2 )ψ2,3 (x2 , x3 ) . . . ψN −1,N (xN −1 , xN ). (13) Z Consider the case where each random variable takes values in the same finite alphabet X = {1, . . . , K}. Then the joint distribution requires roughly (N − 1)K 2 parameters to describe. As we have seen, this is a substantial reduction from if we did not assume Markovianity between the variables. Let’s consider the problem of finding p(xn ) for some 1 ≤ n ≤ N . That is, we are assuming that there are no observations. The Y variable in the above simple example above is a deterministic random variable. Then, by the sum-rule, we have X X X X p(xn ) = ... ... p(x) (14) x1 xn−1 xn+1 xN Clearly, this computation requires summing over O(K N ) states, i.e., the computational complexity is exponential in the length of the chain. Surely, there must be a more computationally efficient way to compute p(xn ) given the Markov structure! By appealing to the factorization in (13), we can rewrite (14) as follows: X X 1 X ψn−1,n (xn−1 , xn ) . . . ψ2,3 (x2 , x3 ) ψ1,2 (x1 , x2 ) p(xn ) = Z x x2 x1 n−1 X X × ψn,n+1 (xn , xn+1 ) . . . ψN −1,N (xN −1 , xN ) (15) xn+1 xN It is easy to check that the total cost is O(N K 2 ), which is linear in the length of the chain! Note that (15) can be written as 1 (16) p(xn ) = µα (xn )µβ (xn ) Z where the messages X X X µα (xn ) := ψn−1,n (xn−1 , xn ) . . . ψ2,3 (x2 , x3 ) ψ1,2 (x1 , x2 ) µβ (xn ) := xn−1 x2 X X ψn,n+1 (xn , xn+1 ) . . . xn+1 x1 ψN −1,N (xN −1 , xN ). xN The function µα (xn ) is known as the forward message while µβ (xn ) is known as the backward message. Note that the messages can be computed recursively as follows X µα (xn ) = ψn−1,n (xn−1 , xn )µα (xn−1 ) xn−1 µβ (xn ) = X ψn,n+1 (xn , xn+1 )µβ (xn+1 ). (17) xn+1 Furthermore, note that the normalization constant (partition function) Z of the joint distribution in (13) is generally intractable to compute. However, in (16), to compute Z, we only need perform O(K) computations by summing µα (xn )µβ (xn ), which is tractable. The recursion in (17) is known as the Chapman-Kolmorogov equations. These are encountered frequently in Markov chain theory. It is easy to see that this idea extends to general tree-structured undirected graphical models, i.e., models in which there are no loops. It can also be extended to compute pairwise marginals, say p(x1 , x9 ). Lastly, note that to compute all marginals, we do not have to repeat the whole process since many of the forward and backward messages can be recycled. 12 x1 g @ x2 g @ @ @ fa x3 g x1 g @ @ @ @ @ fb x2 g x3 g @ @ @ fc fa @ @ @ f b Figure 13: The factor graph of a fully connected distribution (left) and the Markov chain X2 − X1 − X3 (right). 3.2 Factor Graphs We take a short detour here and describe a very useful representation of a probability distribution, in addition to the two we have already encountered. Given an undirected graph, we can draw its corresponding factor graph, which is a more convenient form Q to present the sum-product algorithm below. Each maximal clique ψC (xC ) in the factorization p(x) = Z1 C∈C ψC (xC ) is assigned a factor fC (xC ) and drawn as a square node in the graph. This gives us a bipartite graph, one which there are two sets of nodes and there are only edges going from one set to another and the joint probability distribution can be expressed as Y p(x) = fC (xC ). (18) C∈C The factor nodes (squares) are denoted using f ’s and the variable nodes (circles) by the original x’s. Example 5. Consider the fully connected undirected graph consisting of three nodes X1 , X2 and X3 . The joint distribution can be expressed as p(x) = fa (x1 , x2 )fb (x1 , x3 )fc (x2 , x3 ) The corresponding factor graph is sketched in Fig. 13 (left). If for example X2 − X1 − X3 forms a Markov chain in that order, then fc and its edges can be removed and we have the factor graph as in Fig. 13 (right). Note that if the original undirected graph is a tree, the resulting bipartite factor graph does not have loops, i.e., it is also a tree. Check! 3.3 Sum-Product algorithm We come to perhaps the most important point of this set of lecture notes—the presentation of the sumproduct algorithm [KFL01]. This algorithm is also known in the machine learning community as the belief propagation algorithm. Suppose, as with the inference-on-a-chain example, that Y = ∅, i.e., there are no observations. We discuss how to relax this assumption at the end of this section. Suppose we want to compute p(x). We consider the factor graph of the joint distribution p(x), a part of which containing x is shown in Fig. 14. Then, by the sum-rule, X p(x) = p(x). (19) x\x Now, we can use the factor graph representation in (18) to rewrite the joint distribution of X as Y p(x) = Fs (x, xs ) (20) s∈ne(x) where ne(x) is the set of neighbors of x (they are factor nodes) and xs is the set of variables in the subtree connected to variable x via factor node s. See Fig. 14. We use a different notation Fs to differentiate from the factors in (18) and to emphasize that the factor graph is rooted at x, the first argument of the factor 13 ... g x1@ @ ft µfs →x (x) gx .. . @ @ xM @ . . . g µxM →fs (xM ) f@ u @ @ fs Figure 14: Illustration of (20). Note that xs := {x1 , xs1 . . . , xM , xsM } (a recursive definition), ne(x) = {fs , ft , fu } and ne(fs ) = {x, x1 , . . . , xM }. Fs . More precisely, Fs (x, xs ) is the product of all the factors in the group associated with factor fs in the subtree connected to variable x via factor node s. See (23) below. Now substituting (20) into (19) yields " # Y X X Y Y Fs (x, xs ) = Fs (x, xs ) = µfs →x (x) (21) p(x) = x\x s∈ne(x) xs s∈ne(x) s∈ne(x) where the message from factor fs to node x is defined as X µfs →x (x) := Fs (x, xs ) (22) xs Now let Fs (x, xs ) := fs (x, x1 , . . . , xM )G1 (x1 , xs1 ) . . . GM (xM , xsM ) (23) This is simply the product of the local factor fs (x, x1 , . . . , xM ) and the factors associated to the subtree rooted at xm , namely Gm (xm , xsm ) for 1 ≤ m ≤ M . Note that the subtree rooted at x is excluded from the product in (23). Substituting (23) into (22) yields, X Y X µfs →x (x) = fs (x, x1 , . . . , xM ) Gm (xm , xsm ) x1 ,...,xM X = m∈ne(fs )\x xsm Y fs (x, x1 , . . . , xM ) x1 ,...,xM µxm →fs (xm ) (24) m∈ne(fs )\x where the message from node xm to factor fs is defined as X µxm →fs (xm ) := Gm (xm , xsm ). (25) xsm We can perform a recursion and write the Gm ’s in terms of the Fl ’s as follows: Y Gm (xm , xsm ) = Fl (xm , xml ). (26) l∈ne(xm )\s The Fl (xm , xml )’s again represent a subtree of the original tree graph. See Fig. 15. Now let’s express the message sent from variable node xm to factor node fs more explicitly. By combining (25) and (26), we have X Y Y X Y µxm →fs (xm ) = Fl (xm , xml ) = Fl (xm , xml ) = µfl →xm (xm ) (27) xsm l∈ne(xm )\s l∈ne(xm )\s xml See the message passing in Fig. 15. 14 l∈ne(xm )\s f1 @ µf →x (xm ) .. @ @ 1 m . @@ Rx fs @ @ gm µxm →fs (xm ) .. . fl fL Figure 15: Illustration of (26) and (27) g x1 g fa x2 g x3 fb fc x4 g Figure 16: A simple factor graph used to illustrate the sum-product algorithm If we put (24) and (27) together, we see that the sum-product algorithm proceeds by recursively applying the following message passing rule until convergence. Y µx→fs (x) = µfl →x (x) (28) l∈ne(x)\s µfs →x (x) = X Y fs (x, x1 , . . . , xM ) x1 ,...,xM µxm →fs (xm ) (29) m∈ne(fs )\x The messages are passed from leafs to the root and then back from the root to all the leafs. The marginal of x can then be obtained by taking the product of all the messages in (29) for which the fs ’s are neighbors of node x, i.e., Y p(x) ∝ µfs →x (x). (30) s∈ne(x) Phew... It seems apt to now provide an example to see how all these equations are to be used in a real example. Let us consider the factor graph as shown in Fig. 16. Fig. 16 shows a 4-node graph whose joint distribution factorizes as p(x) = fa (x1 , x2 )fb (x2 , x3 )fc (x2 , x4 ). In order to apply the sum-product algorithm, let us designate that x3 is the root of the tree. Relative to node x3 , the tree has two leaf nodes x1 and x4 . Starting with the leaf nodes, we have the sequence of messages µx1 →fa (x1 ) = 1 X µfa →x2 (x2 ) = fa (x1 , x2 ) x1 µx4 →fc (x4 ) = 1 X µfc →x2 (x2 ) = fc (x2 , x4 ) x4 µx2 →fb (x2 ) = µfa →x2 (x2 )µfc →x2 (x2 ) X µfb →x3 (x3 ) = fb (x2 , x3 )µx2 →fb (x2 ) x2 15 Note that these equations are specializations of (28) and (29). Once these messages are passed from the leafs to the root, we send messages from the root back to the leafs to complete the inference. The messages from the root to the leaves are given as µx3 →fb (x3 ) = 1 X µfb →x2 (x2 ) = fb (x2 , x3 ) x3 µx2 →fa (x2 ) = µfb →x2 (x2 )µfc →x2 (x2 ) X µfa →x1 (x1 ) = fa (x1 , x2 )µx2 →fa (x2 ) x3 µx2 →fc (x2 ) = µfa →x2 (x2 )µfb →x2 (x2 ) X µfc →x4 (x4 ) = fc (x2 , x4 )µx2 →fc (x2 ) x2 Let us see whether the marginal p(x2 ) is evaluated correctly using the above equations: p(x2 ) = µfa →x2 (x2 )µfb →x2 (x2 )µfc →x2 (x2 ) " #" #" # X X X = fa (x1 , x2 ) fb (x2 , x3 ) fc (x2 , x4 ) x1 X = x3 fa (x1 , x2 ) x1 ,x3 ,x4 X = X x4 fb (x2 , x3 ) x3 X fc (x2 , x4 ) x4 p(x), x1 ,x3 ,x4 which is the sum-rule as required. Exercise 5. Check that p(x1 ), p(x3 ) and p(x4 ) can also be computed correctly using the above procedure. Exercise 6. If the undirected graph is a tree, what is the complexity of belief propagation in terms of the number of states of each random variable |X | and the number of nodes N ? It is a well-known and easily verifiable fact that if the original graph is a tree (as in the example above), the sum-product algorithm converges and gives the right marginals. If the original graph is not a tree, and we run the sum-product algorithm disregarding the presence of loops (a procedure called loopy belief propagation), then it may not converge and even if it does, it may not give the correct marginals. This is a research area of intense interest even as we speak. The jointly Gaussian case is very well-studied and, in fact, it is easy to check whether belief propagation “works” [MJW06]. There is a huge body of literature on approximate inference using, for example, variational methods [WJ08] or sampling methods [GG84]. If there are some variables that are observed, say v = v̂, belief propagation can be modified accordingly. Suppose x = (h, v) (where h are the hidden variables), then by multiplying the joint distribution by Q 1{v j = v̂j }, we can consider j Y p̃(x) = p(x) 1{vj = v̂j }. j This is simply p(h, v = v̂), an unnormalized version of p(h|v = v̂). Using the sum-product algorithm detailed previously, we can easily compute any marginal p(hi |v = v̂) up to a normalization constant. 3.4 Variants of the Sum-Product Algorithm It can also be seen that the only property that we have used in the above simplification of the computation of p(x) is the distributivity of multiplication over addition. More precisely, we used the fact that for three real numbers a, b, c, a · (b + c) = a · b + a · c. 16 It turns out that any operation that satisfies the above distributivity property has an algorithm that is analogous to the above sum-product algorithm. For example, the max-product or min-sum algorithm solves the related problem of maximization, or most probable explanation. Instead of attempting to solve the marginal, the goal here is to find the values x that maximises the global function (i.e., most probable values in a probabilistic setting), and it can be defined using the arg max: x̂ = arg max log p(x). x An algorithm that solves this problem is nearly identical to belief propagation, with the sums replaced by maxima in the definitions. This is because max{ab, ac} = a max{b, c}. 4 Learning of Graphical Models In this section, we address another important question in the study of graphical models. Suppose we have a very complex joint distribution q(x) that we believe is “close” to a simpler distribution p(x), perhaps one that has few edges (i.e., a sparse graphical model). How do we approximate q(x) to get p(x) in a principled manner? This brings us to the realm of learning graphical models. We can either learn a simpler, sparser model from a complex model or from data, which is more relevant in practice. I will focus on the former but toward the end, detail how to use fit data to a simple tree-structured model using the Chow-Liu algorithm [CL68]. 4.1 A Crash Course in Information Theory It was alluded to previously that we would like to learn a simple model that is, in some sense, close to a complex model. To measure closeness of distributions, we need some tools from information theory [CT06]. The exposition here cannot do justice to the beauty of information theory. For a flavor of the results in information theory, the reader is strongly encouraged to skim through Cover and Thomas [CT06]. For a random variable X taking values on a finite set X with distribution p(x), the entropy is defined as X H(X) = − p(x) log p(x) (31) x∈X where the log is to the base 2 so the unit of entropy is bits. The entropy measures the amount of uncertainty there is in X. For two random variables X and Y with joint distribution pX,Y , the joint entropy is X H(X, Y ) = − pX,Y (x, y) log pX,Y (x, y). (32) (x,y)∈X ×Y The conditional entropy of X given Y is the remaining uncertainty in X given Y . It is defined as X H(X|Y ) = H(X, Y ) − H(Y ) = − pX,Y (x, y) log pX|Y (x|y). (33) (x,y)∈X ×Y The mutual information between X and Y with joint distribution pX,Y is denoted as I(X; Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X) = X (x,y)∈X ×Y pX,Y (x, y) log pX,Y (x, y) . pX (x)pY (y) (34) The mutual information measures the amount of reduction in uncertainty in X once Y is known. When we want to make the dependence of these information-theoretic quantities on the distribution explicit, we will append the notation with a subscript. For example, we use the notation IpX,Y (X; Y ) to denote the mutual information in (34) computed based on the joint distribution pX,Y . 17 g @ @ g @ @ g g g @g g g Figure 17: Approximating a complex distribution, say fully connected (left) with a simpler tree-structured one (right) Lemma 2. The entropies defined in (31)–(33) are non-negative. Lemma 3. The mutual information defined in (34) is non-negative. Exercise 7. Determine when the entropies in (31)–(33) are equal to zero. Determine when mutual information in (34) equals to zero. Given two distributions p and q with common support X , the Kullback-Leibler divergence or relative entropy is defined as X p(x) . (35) D(p||q) = p(x) log q(x) x∈X The relative entropy is a fundamental measure of the distance between the two distributions p and q. In fact we have Lemma 4. The relative entropy in (35) is non-negative. It is zero if and only if p(x) = q(x) for all x ∈ X . It is not too hard to see that the mutual information in (34) is equal to the relative entropy between the joint distribution pX,Y and the product of the marginals pX ◦ pY , i.e., I(X; Y ) = D(pX,Y ||pX ◦ pY ). This shows that mutual information measures the degree of dependence between the random variables X and Y . If they are independent I(X; Y ) = 0. Else mutual information I(X; Y ) must be positive. 4.2 The Chow-Liu Algorithm We have a very complicated multivariate undirected graphical model that is perhaps fully connected q(x), where x = (x1 , . . . , xN ). We would like to, for the sake of tractability, approximate it using a simpler undirected graphical model. As we have seen, the family of tree-structured distributions is tractable for exact inference. So it is natural approximate q(x) with a tree-structured distribution by minimizing the relative entropy between the q(x) and the approximated distribution p̂(x): p̂(x) = min D(q||p) p∈TN (36) where TN is the set of all tree-structured undirected graphical models with N nodes. See Fig. 17. The problem in (36) could be potentially very computationally intensive the the number of undirected trees on N nodes is N N −2 . This number is super-exponential in the number of nodes N . So clearly a naı̈ve search for the tree-structured distribution that minimizes the relative entropy as in (36) is infeasible. Enter Chow-Liu [CL68] in 1968. Through some fairly rudimentary manipulations of information-theoretic quantities (detailed in the next subsection), they showed that the tree-structure of p̂ in (36), denoted as Tp̂ , is given by X Tp̂ = max Iq (Xi ; Xj ), (37) T ∈TN (i,j)∈T where TN is the set of all undirected tree graphs on N nodes and Iq (Xi ; Xj ) is the mutual information between Xi and Xj computed based on the dense graphical model q. The optimization problem in (37) is a 18 simple maximum-weight spanning tree problem which can be solved efficiently using, for example, Kruskal’s algorithm [Kru56]. So instead of having to search through all N N −2 tree structures using (36), we can simply solve a maximum-weight spanning tree problem to find the optimal tree structure in polynomial time in N . What if we have data D := {x1 , . . . , xn }, each drawn independently and identically from some distribution q(x) and we would like to learn (an approximate version of) q(x)? Suppose the random variables take on values in some finite alphabet X . Then a natural thing to do is to first compute the empirical distribution (or the type) of the data n 1X 1{x = xk }. q̂(x; D) = n k=1 Note that this is the normalized frequencies of the occurrences of x in the data. Second, we perform the optimization in (37) with q̂ in place of q to obtain a sparser tree-structured distribution that approximates the generating distribution q(x). It turns out that computing the empirical distribution, then running the maximum-weight spanning tree procedure results in the tree distribution that is maximum-likelihood but we will not derive this here. Recent work by the author in his Ph.D. thesis answered the question as to how many samples n one needs to get a “good” fit of the data to the model [TATW11]. The Chow-Liu algorithm can also be applied to continuous data and the performance can be analyzed as in [TAW10]. We note that (37) only provides us with the approximating structure. To complete the process of finding an approximating distribution, we have to fit the parameters as well. That is, we must find p̂(xi , xj ) for all (i, j) in the edge set. This is an easy process and can be done using maximum likelihood procedures. See [TATW11] for more details. 4.3 Derivation of the Chow-Liu Algorithm From Exercise 4, we know that given that p(x) ∈ TN , it can be expressed as p(x) = Y p(xi ) i∈V Y (i,j)∈E p(xi , xj ) . p(xi )p(xj ) (38) Also note that the minimization over p in (36) is equivalent to the maximization X p̂(x) = max q(x) log p(x) p∈TN (39) x∈X N This is due to the definition of the relative entropy in (35). Now, we substitute the tree factorization in (38) into the objective function in (39) giving X X X X p(x , x ) i j q(x) log p(x) = q(x) log p(xi ) + log p(x )p(x i j) N N x∈X i∈V x∈X = X X (i,j)∈E q(xi ) log p(xi ) + i∈V xi ∈X X X (i,j)∈E (xi ,xj )∈X 2 q(xi , xj ) log p(xi , xj ) p(xi )p(xj ) (40) Now it can easily be seen that for a given structure E, the p that minimizes the above expression is the one that matches the marginals, i.e., p(xi ) = q(xi ) for all i ∈ V and p(xi , xj ) = q(xi , xj ) for all (i, j) ∈ E. As such the expression in (40) can be written in terms of information-theoretic quantities as X X − Hq (Xi ) + Iq (Xi ; Xj ), i∈V (i,j)∈E where we used the definitions of entropy and mutual information. Now, the first term above is constant over all possible tree structures. Thus, maximizing the sum of mutual information quantities Iq (Xi ; Xj ) subject to the constraint that (V, E) is a tree that minimizes the Kullback-Leibler divergence in (36). 19 5 Other Textbooks This set of notes has hopefully given the reader an insight into the modeling powers of graphical models. We largely followed the exposition in Chapter 8 of Bishop [Bis08]. This is the most basic text in graphical modeling. The reader who is interested in learning more can consult the monograph by Wainwright and Jordan [WJ08] or the recently published book by Koller and Friedman [KF09]. These textbooks require significantly more mathematical maturity but they would expose the reader to the state-of-the-art in the research in graphical modeling. References [Bis08] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2008. [BT02] D. P. Bertsekas and J. N. Tsitsiklis. Introduction to Probability. Athena Scientific, 1st, 2002. [CL68] C. K. Chow and C. N. Liu. Approximating discrete probability distributions with dependence trees. IEEE Trans. on Inf. Th., 14(3):462–467, May 1968. [CT06] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-Interscience, 2nd edition, 2006. [GG84] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. on Pattern Analysis and Machine Intelligence, 5:721–741, Jun 1984. [HC70] J. M. Hammersley and M. S. Clifford. Markov fields on finite graphs and lattices. Unpublished, 1970. [KF09] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques (Adaptive Computation and Machine Learning). The MIT Press, 2009. [KFL01] F. Kschischang, B. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE Trans. on Inf. Th., 47(2):498–519, Feb 2001. [Kru56] J. B. Kruskal. On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem. Proceedings of the American Mathematical Society, 7(1), Feb 1956. [MJW06] D. M. Malioutov, J. K. Johnson, and A. S. Willsky. Walk-sums and belief propagation in Gaussian graphical models. Journal of Machine Learning Research, pages 2031–2064, Jul 2006. [Pea88] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 2nd edition, 1988. [TATW11] V. Y. F. Tan, A. Anandkumar, L. Tong, and A. S. Willsky. A large-deviation analysis for the maximum likelihood learning of Markov tree structures. IEEE Trans. on Inf. Th., 57(3):1714–35, Mar 2011. [TAW10] V. Y. F. Tan, A. Anandkumar, and A. S. Willsky. Learning Gaussian tree models: Analysis of error exponents and extremal structures. IEEE Trans. on Sig. Proc., 58(5):2701–2714, May 2010. [WJ08] M. J. Wainwright and M. I. Jordan. Graphical Models, Exponential Families, and Variational Inference, volume 1 of Foundations and Trends in Machine Learning. Now Publishers Inc, 2008. 20
© Copyright 2026 Paperzz