Exact Inference on Graphical Models

Exact Inference on
Graphical Models
Samson Cheung
Outline




What is inference?
Overview
Preliminaries
Three general algorithms for inference



Elimination Algorithm
Belief Propagation
Junction Tree
What is inference?
Given a fully specified joint distribution (database),
inference is to query information about some
random variables, given knowledge about other
random variables.
Query about XF?
Information about XF
Evidence: xE
Conditional/Marginal Prob.
Conditional of XF?
Evidence: xE
Ex. Visual Tracking – you want to compute the conditional
to quantify the uncertainty in your tracking
Maximum A Posterior Estimate
Most likely value
of XF?
Evidence: xE
Error Control – Care about the decoded symbol. Difficult to
compute the error probability in practice due to high
bandwidth.
Inferencing is not easy
Potential:
(p,q)
= exp(-|p-q|)
Evidence
Marginal:
P(p,q)=G\{p,q} p(G)

Computing marginals or MAP requires global
communication!
Outline




What is inference?
Overview
Preliminaries
Three general algorithms for inference



Elimination Algorithm
Belief Propagation
Junction Tree
Inference Algorithms
10-100 nodes:
Expert systems
Diagnostics
Simulation
EXACT
General Graph
ELIMINATION
ALGORITHM
>1000 nodes:
Image Processing
Vision
Physics
General Inference
Algorithms
JUNCTION
TREE
NP -hard
APPROXIMATE
Polytrees
BELIEF
PROPAGATION
1.
2.
3.
4.
5.
Iterative Conditional Modes
EM
Mean field
Variational techniques
Structural Variational
techniques
6. Monte-Carlo
7. Expectation Propagation
8. Loopy belief propagation
Outline




What is inferencing?
Overview
Preliminaries
Three general algorithms for inferencing



Elimination Algorithm
Junction Tree
Probability Propagation
Introducing evidence

Inferencing : summing or maxing “part” of the
joint distribution

In order not to be sidetrack by the evidence
node, we roll them into the joint by considering

Hence we will be summing or maxing the entire
joint distribution
Calculating Marginal
Moralization
X1
X 2 X3 X4
X5 X6


P(X1)
π
P(X2|X1)
π
P(X3|X1)
P(X4|X1)
P(X5|X2,X3)
P(X6|X3,X4)
(X1,X2,X3)
(X1,X3,X4)
(X2,X3,X5)
(X3,X4,X6)
X1
X2 X3 X4
X5 X6
Every directed graph can be represented as
an undirected by linking up parents who have
the same child.
Deal only with undirected graph
Adding edges is “okay”
X1
X2 X3
(X1,X2,X3)
(X1,X3,X4)
X4 (X ,X ,X )
2 3 5
(X3,X4,X6)
X5 X6
X1
π
(X1,X2,X3,X4)
(X2,X3,X5)
(X3,X4,X6)
X2 X3 X4
X5 X6

The pdf of an undirected graph can ALWAYS be
expressed by the same graph with extra edges added.
 A graph with more edge


Lose important conditional independence information
(okay for inferencing, not good for parameter est.)
Use more storage (why?)
Undirected graph and Clique graph
C1(X1,X2,X3)
X9
X7
X8
X2
X3
X5

C2(X1,X3,X4)
C3(X2,X3,X5)
X1
X4
X6
Clique graph


C5
C4(X3,X4,X6)
Separator:
C1 C3={X2,X3}
C6
C1
C2
C3
C4
C5(X7,X8,X9)
C6(X1,X7)
Each node is a clique from the parametrization
An edge between two nodes (cliques) if the two
nodes (cliques) share common variables
Outline




What is inference?
Overview
Preliminaries
Three general algorithms for inference



Elimination Algorithm
Belief Propagation
Junction Tree
Computing Marginal
X2
X1
X4

Need to marginalize x2,x3,x4,x5

We need to sum N5 terms (N is
the number of symbols for
each r.v.)

Can we do better?
X5
X3
Elimination (Marginalization) Order

Try to marginalize in this order: x5, x4, x3, x2
C:
O(N3)
S:
C: O(N2)
O(N2)
C:
S: O(N)
O(N3)
C: O(N2)
S:
O(N2)
S: O(N)

Complexity: O(KN3), Storage: O(N2) K=# r.v.s
MAP is the same

Just replace summation with max

Note


All the m’s are different from marginal
Need to remember the best configuration as you go
Graphical Interpretation
List of active potential functions:
X2
X4
X1
X5
X3
Kill X2
Kill X4
Kill X3
Kill X5
m2(X1)
C1(X1,X2) C1(X1,X2) C1(X1,X2) C1(X1,X2)
C2(X1,X3) C2(X1,X3) C2(X1,X3) m4(X2)
m3(X1,X2)
C3(X2,X5) C5(X2,X4) m4(X2)
C4(X3,X5) m5(X2,X3) m5(X2,X3)
C5(X2,X4)
First real link to graph theory


Reconstituted Graph = the graph that contain
all the extra edges after the elimination
Depends on the elimination order!
X2
X4
X1
X5
X3
The complexity of graph elimination is O(NW), where W is
the size of the largest clique in the reconstituted graph
Proof : Exercise
Finding the optimal order


To minimize the clique size turns out to be NP-hard1
Greedy algorithm2:
1.
2.
3.

1
Find the node v in G that connects to the
least number of neighbors
Eliminate v and connect all its neighbors
Go back to 1 until G becomes a clique
Current best techniques use other simulated
annealing3 or approximated algorithm4
S. Arnborg, D.G. Corneil, A. Proskurowski, Complexity of finding embeddings in a k-tree, SIAM J.Algebraic and Discrete Methods
8 (1987) 277–284.
2 D. Rose, Triangulated graphs and the elimination process, J. Math. Anal. Appl. 32 (1974) 597–609.
3U. Kjærulff, Triangulation of graph-algorithms giving small total state space, Technical Report R 90-09, Department of Mathematics
and Computer Science, Aalborg University, Denmark, 1990.
4A. Becker, D. Geiger, “A sufficiently fast algorithm for finding close to optimal clique trees,” Arificial Intelligence 125 (2001) 3-17
This is serious


One of the most commonly used graphical
model in vision is Markov Random Field
Try to find a elimination order of this model.
(p,q)
= exp(-|p-q|)
Pixel: I(x,y)
Largest clique: 4
Grow linearly with dimension (?)
Outline




What is inference?
Overview
Preliminaries
Three general algorithms for inference



Elimination Algorithm
Belief Propagation
Junction Tree
What about other marginals?

X2
X4

X1
X5

X3
We have just computed
P(X1).
What if I need to compute
P(X1) or P(X5) ?
Definitely, some part of
the calculation can be
reused! Ex. m5(X2,X3) is
the same for both!
Focus on trees

Focus on tree like structures:
Undirected
Tree

Why trees?
Directed Tree
= undirected tree after
moralization
Why trees?

No moralization is necessary

There is a natural elimination ordering with
query node as root


Depth first search : all children before parent
All sub-trees with no evidence nodes can be
ignored (Why? Exercise for the undirected
graph)
Elimination on trees
When we eliminate node j, the new
potential function must be
 A function of xi
 Any other nodes?



nothing in the sub-tree below j
(already eliminated)
nothing from other sub-trees, since
the graph is a tree
only i, from ij which relates i and j
Think of the new potential functions as a message
mji(xi) from node j to node i
mji(xi)
What is in the message?
This message is created by
summing over j the product of
all earlier messages mkj(xj)
sent to j as well as E(xj) (if j
is an evidence node).
• c(j) = children of node j
• E(xj) = δ(xj=xj) if j is an evidence node; 1 otherwise
Elimination = Passing message
upward

After passing the message up to the query (root)
node, we compute the conditional:

What about answering other queries?
= query node (need 3 messages)
Messages are reused!
Even though the naive approach (rerun
Elimination) needs to compute N(N-1)
messages to find marginals for all N
query nodes, there are only 2(N-1)
possible messages.


We can compute all possible messages in only
double the amount of work it takes to do one query.
Then we take the product of relevant messages to
get marginals.
Computing all possible
messages



Idea: respect the following Message-PassingProtocol:
A node can send a message to a neighbour only
when it has received messages from all its other
neighbours.
Protocol is realizable: designate one node
(arbitrarily) as the root.
Collect messages inward to root then distribute
back out to leaves.
Belief Propagation
i
mij
mji
mjk
mkj
k
j
mjl
mlj
l
Belief Propagation (sum-product)
1.
2.
3.
4.
Choose a root node (arbitrarily or as first query
node).
If j is an evidence node, E(xj) = (xj=xj), else
E(xj) = 1
Pass messages from leaves up to root and then
back down using:
Given messages, compute marginals using:
MAP is the same (max-product)
1.
2.
3.
4.
5.
6.
Choose a root node arbitrarily.
If j is an evidence node, E(xj) = (xj=xj), else E(xj) = 1
Pass messages from leaves up to root using:
Remember which choice of xj = xj* yielded maximum.
Given messages, compute max value using any node i:
Retrace steps from root back to leaves recalling best xj to
get the maximizing argument (configuration) x.
“Tree”-like graphs work too
This is not a
directed tree

After moralization
Corresponding
factor graph IS A
TREE
Pearl (1988) shows that BP works for factor tree
 See Jordan Chapter 4 for more details
Outline




What is inference?
Overview
Preliminaries
Three general algorithms for inference



Elimination Algorithm
Belief Propagation
Junction Tree
What about arbitrary graphs?



BP only works on tree-like graphs
Question: Is there an algorithm for general
graph?
Also, after BP, we get the marginal for each
INDVIDUAL random variables


But the graph is characterized by cliques
Question: Can we get the marginal for every
clique?
Mini-outline


Back to Reconstituted Graph
Three equivalent concepts
Triangulated graph – easy to validate
 Decomposable graph – link to probability
 Junction Tree – computational inference



Junction Tree Algorithm
Example
Back to Reconstituted graph
The reconstituted graph is a very important type
of graph: triangulated (chordal) graph
 Definition: A graph is triangulated if any loop
with 4 or more nodes will have a chord.
All trees are
triangulated
All cliques are
triangulated
triangulated
Nontriangulated
Proof


Prove for any N-node graph, the
reconstituted graph after elimination is
triangulated.
Proof: By induction
1.
2.
3.
N=1 : trivial
Assume N=k is true.
N=k+1 case:
v = first node
eliminated
Reconstituted
graph with k
nodes 
triangulated
v
Added during
eliminationchordal
Lessons from graph theory

Graph coloring problem: find the smallest
number of vertex colors so that adjacent
colors are different = chromatic number
 Sample application 1: Scheduling




Node = tasks
Edge = two tasks are not compatible
Coloring = Number of parallel tasks
Sample application 2 : Communication



Node = symbols
Edge = two symbols may produce the same
output due to transmission error
Largest set of vertices with the same color =
number of symbols that can be reliably sent
Lesson from graph theory
Determining the chromatic number  is NP-hard
 Not so for a general type of graph called Perfect
Graph




Definition:  = the size of the largest clique
Triangulated graph is an important type of perfect
graphs.
Strong Perfect Graph Conjecture was proved in 2002
(148-page!)
 Bottom line: Triangulated graph is “algorithmically
friendly” – very easy to check whether a graph is
triangulated and to compute properties from such a
graph.
Link to Probability: Graph
Decomposition

Definition: Given a graph G, a triple (A,B,S) with
Vertex(G) = ABS is a decomposition G if
1.
2.

S separates A and B (i.e. every path from aA to bB
must past through S.
S is a clique
Definition: G is decomposable if
1.
2.
G is complete or
There exist a decomposition (A,B,S) of G such that
AS and BS are decomposable.
A
S
B
What’s the big deal?
Decomposable graph can be parametrized by marginals!
If G is decomposable, then
where C1,C2, …,CN are cliques in G, and S1,S2, …,SN-1 are
(special) separators between cliques. Notice there are
one less separators than cliques.
Equivalently, we can say that G can parameterized by
marginals p(xC) and ratios of marginals, p(xC)/p(xS)
This is not true in general
D
A

If the graph can be expressed in terms of a product
marginals or ratio of marginals, at least one of the
potentials is a marginal.
 However, f(XAB) is not a constant
C
B
Proof :
A S
B
Proof by induction:
G can be decomposed into A,B, and S, where AS and B S
are decomposable; S separates A and B and is complete.
All cliques are subsets of
either AS or B S
Continue
Recursively apply on AS
and BS based on
induction assumption.
So what?
Triangulated
Graph
Nice algorithmically
Decomposable
Graph
Parametrized by marginals
It turns out that
Triangulated Graph  Decomposable Graph
Decomposable Triangulation
Prove by induction:
If G is complete, it is triangulated. Otherwise
A S
B
By IA, GAS and GBS are triangulated and thus all cycles
in them have a chord.
The case we need to consider is the cycle that span A, B
and S.
But S is complete, so it must have a chord!
QED
TriangulationDecomposable
A
a
S
B
b
Prove by induction. Let G be a triangulated graph with N
nodes. Show is G can be decomposed into (A,B,S).
If G’s complete, done. If not, choose non-adjacent a and b.
S = smallest set that intersects with all paths between a and b.
A = all nodes in G\S reached by a
B = all nodes in G\S reached by b
Cleary A and B are separated by S.
TriangulationDecomposable
a1
A
c
b1
S
a
a2
d b2
B
b
Need to prove S is complete. Consider arbitrary c,dS.
There is a path acb such that c is the only node in S. If not, then S is
not minimum as c can be put into either A or B.
Similarly, there is a path adb. Now we a cycle.
Since G is triangulated, this cycle must have a chord. Since S separates A
and B, the chord must be entirely in AS or BS.
Keep shrinking the cycle and eventually there must be a chord between c
and d, hence S must be complete.
Recap




Reconstituted graph is triangulated.
Triangulated graph = decomposable
Joint probability in decomposable graph can
be factorized into marginals and ratios of
marginals.
Not very constructive so far: How can we get
from LOCAL POTENTIALS to GLOBAL
MARGINAL parametrization?
How to get from a local description to
a global description?
A decomposable graph (V\S,W\S,S):
V
S
W

At the beginning, we have local representations:

We want
Message passing

Initialization
(XS)=1

V
S
W
S
W
*(XS)
*(XW)
(XS)
Phase 1: Collect
*(XS)=V\S(XV)
 *(XW)= (XW)*(XS)/(XS)
Why?
P(XW)  V\S(XW) (XV)/(XS)
= (XW)V\S(XV)/(XS)
= (XW)*(XS)/(XS)
= *(XW)
V
*(XW)(XV)/*(XS)
= [(XW)*(XS)/(XS)](XV)/*(XS)
= (XW)(XV)/(XS)
= Joint distribution
Message Passing

Phase 2: Distribute
**(XS)=W\S*(XW) P(XS)
S
V
*(XV)= (XV)**(XS)/*(XS) *(XV)
Why?
P(XV)  W\S(XW) (XV)/(XS)
= W\S*(XW) (XV)/*(XS)
= (XV)**(XS)/*(XS)
= *(XV)
**(XS)
*(XW)*(XV)/**(XS)
= *(XW)(XV)/*(XS)
= (XW)(XV)/(XS)
= Joint distribution
W
Relating Local Description to
Message Passing

How to extend message passing to general
graph (in terms of cliques)?
To extend the previous message passing
algorithm to general graph, we need a
recursive decomposition in terms of cliques.

Answer: Junction Tree

Decomposable graph induces
a tree on the clique graph
V
Cj S
S
S Ck
W
Let C1, C2, …., CN be all the maximal cliques in G


Every Ci must either be in V or W
Since all Ci are maximal, there is an CjV and CkW
such that SCj and SCk
 Put an edge between Cj and Ck
 Recursively decompose V and W  no loop will form
because of the separation property.
 The final clique graph is a tree called a Junction Tree
Properties of a Junction Tree
Ci

S
B
Cj
For any two Ci and Cj, every clique on the unique path on
the junction tree between them must contain CiCj



A
Each branch along the path decompose the graph.
So the separator S on the branch must contain CiCj, so
must the clique nodes on either side of the branch
Equivalently, for any variable X, all the clique nodes
containing X induces a sub-tree from the junction tree.
Junction Tree  Decomposable Graph
Definition: A Junction Tree is a sub-tree of the clique graph
such that all the nodes along the path between any two
cliques C, D contain CD.
If a graph has a junction tree, it must be decomposable.
Prove by induction. Simple base case.
For any separator S, the right and left sub-trees to S, R and L,
are JT’s so they must be decomposable by IA.
S is complete so it remains to show that S separates R and L.
If not, there exists an edge (X,Y) with XR and YL but X,YS.
However, (X,Y) must belong to some clique  YR or XL.
Thus by the junction tree property, YS or XS.
Contradiction.
How to find a junction tree?



Not easy from either definition or decomposition.
Define edge weight w(s) = number of variables in
the separator s. Let C1 and C2 be the end clique
nodes
Total weight of a junction tree
= X [C1{XC}-1]
Each variable
= X C1{XC}-N
induces a
subtree in a
= C X1{XC}-N
junction tree
= C |C|-N
Junction Tree is a maximal
spanning clique tree
Consider any clique tree, its
total weight
= S|S|
= S X1{XS}
= X S1{XS}
 X [C1{XC}-1]
= C X1{XC}-N
= C |C|-N
= weight of a Junction Tree
X
All separators containing X must
belong to one of the above
edges. Any clique tree can thus
contain at most C1{XC}-1
edges from this subgraph.
Example
C6
X7
X1
C1 C2
X2 X3 X4
C3 C4
X5
X6
X9
C5
X8
C5
C5
C6
C6
C1
C2
C3
C4
X3
C1
C2
C3
C4
So what? Junction Tree Algorithm
1.
2.
3.
4.
5.
6.
Moralize if needed
Triangulate using any triangulation algorithm
Formulate the clique graph (clique nodes and
separator nodes)
Compute the junction tree
Initialize all separator potentials to be 1.
Phase 1: Collect from children
Message from children C: *(XS)=C\S(XC)
Update at parent P: *(XP)= (XP) S *(XS)/(XS)
7.
Phase 2: Distribute to children
Message from parent P: **(XP)=P\S**(XP)
Update at child C: *(XC)= (XC) S **(XS)/*(XS)
CHILD Network
Step 1: Moralization
Step 2: Triangulation
Step 3: Form Junction Tree
Step 5: Two phase propagation
Evidence : LVH report = Yes
Conclusions


Inference: marginals and MAP
Elimination – one node at a time




Complexity is a function of the size of the
largest clique
Triangulate that results into small cliques is
NP-hard
Belief Propagation – all nodes, exact on trees
Junction Tree

 Decomposable graph  Triangulated graph