Exact Inference Algorithms: Conditioning

Readings: K&F 9.5, 10.1, 10.2, 10.3 (10.4)
Exact Inference Algorithms:
Conditioning, Clique Trees
Lecture 7 – Apr 18, 2011
CSE 515, Statistical Methods, Spring 2011
Instructor: Su-In Lee
University of Washington, Seattle
Announcement
„
Problem Set #2 is ready.
„
„
„
Check course website or pick it up.
7 Questions. Hard. Please start working on it today.
Discussion OK! Check collaboration policy.
CSE 515 – Statistical Methods – Spring 2011
2
1
Variable Elimination Algorithm
„
Goal: P(J) → query variable(s) can be anything
I
D
∑ P(C , D, I , H , G, S , L)
P( J ) =
C
L , S ,G , H , I , D ,C
=
∑φ
G
( J , L, S )φ L ( L, G )φS ( S , I )φG (G , I , D)
J
L , S ,G , H , I , D ,C
φH ( H , G, J )φI ( I )φD (C , D)φC (C )
„
Eliminate ordering: C,D,I,H,G,S,L
„
Compute:
S
L
J
H
P( J ) = ∑ φ J ( J , L, S )∑ φ L ( L, G )∑ φH ( H , G, J )∑ φ I ( I )φS ( S , I )∑ φG (G, I , D)∑ φD (C , D)φC (C )
L,S
„
G
H
I
D
C
Computational complexity:
„
O(n maxi|Val(Xi)|), where n is the number of variables
3
Part I
EXACT INFERENCE:
CONDITIONING
4
2
Inference By Conditioning
„
„
Goal: compute
P( J ) =
General idea
„
„
„
∑ P( J , I = i)
i∈Val ( I )
Enumerate the possible values i of a variable I
Apply Variable Elimination in a simplified network P( J , I = i )
Aggregate the results P ( J ) = ∑ P ( J , I = i )
i∈Val ( I )
C
C
D
I
G
D
(G⊥S | I)
S
Transform CPDs of G and S
to eliminate I as parent
I
G
Observe I=i
L
S
L
J
J
5
Cutset Conditioning
„
„
Select a subset of nodes X⊂U
X is a cutset in G if GX=x is a polytree (no loop)
I is not a cutset
C
C
D
I
G
D
I
G
L
S
S
L
J
H
I
G
L
J
H
C
D
S
G is a cutset
J
H
6
3
Cutset Conditioning
„
Select a subset of nodes X⊂U
X is a cutset in G if GX=x is a polytree
„
Define the conditional Bayesian network GX=x
„
„
„
GX=x has the same variables as G
GX=x has the same structure as G except that all outgoing
edges of nodes in X are deleted, and CPDs of nodes in
which edges were deleted are updated to
PG X = x (Y | Pa(Y ) − X ) = PG (Y | Pa(Y ), X = x )
„
Compute original P(Y) query by
„
PG (Y ) =
Exponential in cutset
∑P
GX =x
x∈Val ( X )
( X = x, Y )
7
Computational Complexity
„
Variable elimination
P ( J ) = ∑∑∑∑∑∑∑ P(C , D, I , S , G, L, H , J ) (∗)
C
D
I
S
G
L
H
= ∑∑ P( J | L, S )∑ P ( L | G )∑ P ( H | G , J )∑ P ( I ) P ( S | I )∑ P(G | D, I )∑ P (C ) P ( D | C )
L
„
S
G
H
I
D
C
Conditioning (U=u)
Reordering the expression (*) slightly, we have that:
⎡
⎤
P ( J ) = ∑ ⎢∑∑∑∑∑∑ P (C , D, I , S , G = g , L, H , J )⎥
g ⎣ C D I
S
L H
⎦
„
„
„
In general, both algorithms are performing the same set of basic
operations (sums and products).
Any advantages?
„
Memory gain
„
Forms the basis for a useful approximate inference algorithms (later)
8
4
Part II
EXACT INFERENCE:
CLIQUE TREES
9
Inference with Clique Trees
„
„
„
Exploits factorization of the distribution for
efficient inference, similar to variable elimination
Uses global data structures (cluster graphs)
Deals with a distribution given by (possibly unnormalized) measure
PF (U ) = ∏ φ '
φ '∈F
„
„
For Bayesian networks, factors are CPDs
For Markov networks, factors are clique potentials
10
5
Variable Elimination & Clique Trees
„
Variable elimination
„
„
„
Each step creates a factor πi through multiplication
A variable is then eliminated in πi to generate new factor τi
Process repeated until product contains only query variables
P( J ) = ∑∑ P ( J | L, S )∑ P( L | G )∑ P( H | G , J )∑ P ( I ) P ( S | I )∑ P(G | D, I )∑ P (C ) P ( D | C )
L
„
S
G
H
I
D
C
Clique tree inference
„
„
Another view of the above computation
General idea: πj is a computational data structure which
takes “messages” τi generated by other factors πi and
generates a message τj which is used by another factor πk
11
Cluster Graph
„
„
Data structure providing flowchart of the factor
manipulation process
A cluster graph K for factors F is an undirected graph
„
„
„
„
Nodes are associated with a subset of variables Ci⊆U
The graph is family preserving: each factor φ∈F is
associated with one node Ci such that Scope[φ]⊆Ci
Each edge Ci–Cj is associated with a sepset Si,j = Ci ∩ Cj
Key: variable elimination defines a cluster graph
„
„
Cluster Ci for each factor πi used in the computation
Draw edge Ci–Cj if the factor generated from πi is used in
the computation of πj
12
6
Simple Example
X1
X2
X3
Variable elimination
Cluster graph
P( X 3 ) = ∑∑ P( X 1 , X 2 , X 3 )
C1 = {X1,X2}
X1 X 2
= ∑∑ P( X 1 ) P( X 2 | X 1 ) P( X 3 | X 2 )
C2 = {X2,X3}
= ∑ P ( X 3 | X 2 )∑ P ( X 1 ) P( X 2 | X 1 )
S1,2= {X2}
X1 X 2
X2
X1
= ∑ P( X 3 | X 2 )τ 1 ( X 2 )
X2
X1,X2
X2
= τ2(X3)
X2,X3
P(X1)
P(X2|X1)
P(X3|X2)
family preservation?
13
A More Complex Example
P( J ) = ∑ φ J ( J , L, S )∑ φ L ( L, G )∑ φH ( H , G, J )∑ φ I ( I )φS ( S , I )∑ φG (G, I , D)∑ φD (C , D)φC (C )
L,S
„
G
H
I
D
C
Goal: P(J), Eliminate: C,D,I,H,G,S,L
„
„
„
„
„
„
„
C:
D:
I:
H:
G:
S:
L:
τ 1 ( D) = ∑ φC (C )φD (C , D)
C
C
τ 2 (G, I ) = ∑ φG (G, I , D)τ 1 ( D)
D
τ 3 (G, S ) = ∑ φI ( I )φS ( S , I )τ 2 (G, I )
I
D
I
τ 4 (G, J ) = ∑τ H ( H , G, J )
H
G
τ 5 ( J , L, S ) = ∑ φL ( L, G)τ 3 (G, S )τ 4 (G, J )
G
S
τ 6 ( J , L) = ∑ φ ( J , L, S )τ 5 ( J , L, S )
S
L
τ 7 ( J ) = ∑τ 6 ( J , L )
L
C,D
1
D
G,I,D
2
G,I
3
G,S,I
G,S
G,J,S,L
G,J
H,G,J
J
H
J,S,L
5
6
J,S,L
J,L
J,L
7
4
14
7
Properties of Cluster Graphs
„
Cluster graphs are trees
„
„
„
In VE, each intermediate factor πi is used only once
Hence, each cluster “passes” an edge (message τi) to exactly
one other cluster
Cluster graphs obey the running intersection property
„
If X∈Ci and X∈Cj then X is in each cluster in the (unique) path
between Ci and Cj
C,D
1
D
G,I,D
2
Verify:
ƒ Tree and family preserving
ƒ Running intersection property
G,I
3
G,S,I
G,S
G,J,S,L
G,J
H,G,J
J,S,L
5
J,S,L
6
J,L
J,L
7
4
15
Running Intersection Property
„
„
Theorem: If T is a cluster tree induced by VE over factors
F, then T obeys the running intersection property
Proof:
„
„
„
Let C and C’ be two clusters that contain X
Let CX be the cluster where X is eliminated
Æ X must be present on each cluster on C to CX path
„
„
„
„
„
Computation at CX must be after computation at C
X is in C by assumption and since X is not eliminated in C, then X is in the
factor generated by C
By definition, C’s neighbor multiplies factor generated by C and thus
(multiplies X and) has X in its scope
By induction for all other nodes on the path
Æ X appears in all clusters between C and CX
16
8
Clique Tree
„
„
„
„
„
A cluster graph over factors F that satisfies the
running intersection property is called a clique tree
Clusters Ci in a clique tree are also called cliques
We saw, variable elimination Æ clique tree
Now we will see clique tree Æ variable elimination
Clique tree advantage: data structure for caching
computations allowing multiple VE runs to be
performed more efficiently than separate VE runs
17
We begin with an example and
then describe the general
algorithm …
18
9
Clique Tree Inference
„
Goal: Compute P(J)
C
I
D
G
Verify:
L
ƒ Tree and family preserving
C,D
D
G,I,D
2
G,I
G,S,I
3
P(G|I,D)
P(C)
P(D|C)
J
H
ƒ Running intersection property
1
S
G,S
G,J,S,L
5
P(I)
P(S|I)
G,J
H,G,J
4
P(H|G,J)
P(L|G)
P(J|L,S)
19
Clique Tree Inference
„
Goal: Compute P(J) – define root clique Cr=C5
„
„
„
„
„
„
Set initial factors (CPDs) at each cluster as products πi0 C
C1: Eliminate C, sending a message δ1Æ2(D) to C2
D
C2: Eliminate D, sending δ2Æ3(G,I) to C3
C3: Eliminate I, sending δ3Æ5(G,S) to C5
C4: Eliminate H, sending δ4Æ5(G,J) to C5
C5: Obtain P(J) by summing out G,S,L from π0(C5)δ3Æ5δ4Æ5
I
G
S
L
J
H
C,D
1
P(C)
P(D|C)
D
G,I,D
2
P(G|I,D)
G,I
G,S,I
3
P(I)
P(S|I)
G,S
G,J,S,L
5
P(L|G)
P(J|L,S)
G,J
H,G,J
4
P(H|G,J)
20
10
Clique Tree Inference
„
Goal: Compute P(J) – define root clique Cr=C4
„
„
„
„
„
„
Set initial factors (CPDs) at each cluster as products πi0 C
C1: Eliminate C, sending a message δ1Æ2(D) to C2
D
C2: Eliminate D, sending δ2Æ3(G,I) to C3
C3: Eliminate I, sending δ3Æ5(G,S) to C5
C5: Eliminate S,L, sending δ5Æ4(G,J) to C4
C4: Obtain P(J) by summing out H,G from π0(C4)δ5Æ4
I
G
S
L
J
H
δ3→5 (G, S) :
∑ π (C ) ×δ
I
C,D
1
D
G,I,D
2
G,I
3
G,S,I
3
P(G|I,D)
P(C)
P(D|C)
0
2→3
(G, I)
G,S
δ 5→4 (G, S) :
∑
S ,L
π 0 (C5 ) ×δ 3→5 (G, S)
G,J,S,L
5
P(I)
P(S|I)
G,J
H,G,J
4
P(H|G,J)
P(L|G)
P(J|L,S)
21
Clique Tree Inference
C5 as the root
C
I
D
G
S
L
δ3→5 (G, S) :
∑I π 0 (C3 ) ×δ2→3 (G, I)
δ 5→4 (G, S) :
∑S ,L π 0 (C5 ) ×δ3→5 (G, S)
J
H
C4 as the root
C,D
1
P(C)
P(D|C)
D
G,I,D
2
P(G|I,D)
G,I
G,S,I
3
P(I)
P(S|I)
G,S
G,J,S,L
5
P(L|G)
P(J|L,S)
G,J
H,G,J
4
P(H|G,J)
22
11
Legal ordering
„
The only constraint is that a clique gets all of its incoming
messages from its downstream neighbors before it sends
its outgoing message toward its upstream neighbor.
„
„
We say that Ci is ready to transmit to a neighbor Cj when Ci has
messages from all of its neighbors except for Cj.
Example
„
Root C6
„
„
„
Legal ordering I: 1,2,3,4,5,6
Legal ordering II: 2,5,1,3,4,6
Illegal ordering: 3,4,1,2,5,6
C1
C4
C6
C5
C3
C2
23
Here is the general algorithm …
24
12
Clique Tree Message Passing
„
Let T be a clique tree and C1,...Ck its cliques
„
Multiply factors (CPDs) assigned to each clique, resulting in initial
potentials as each factor is assigned to some clique α(φ):
k
π 0j [C j ] = ∏ φ and ∏ φ = ∏ π 0j [C j ]
φ :α (φ ) = j
„
„
If our goal is to compute P(J), any clique containing J can be Cr
Use the clique-tree data structure to pass messages between
neighboring cliques, sending all messages toward Cr
„
„
j =1
Define Cr as the root clique
„
„
φ
Start from tree leaves and move inward
Let pr(i) be the upstream neighbor of i (on the path to Cr)
Each Ci performs a computation that sends message δi to Cpr(i)
„
„
Multiply all incoming messages from downstream neighbors with the initial clique
potential resulting in a factor whose scope is the clique
Sum out all variables except those in the sepset Ci—Cpr(i)
δ i → j ( Si , j ) =
∑
Ci − S i , j
π i0 [Ci ]
∏δ
k →i
k∈{ neighbors of i except for j}
25
Clique Tree Message Passing
„
Let T be a clique tree and C1,...Ck its cliques
„
Multiply factors (CPDs) assigned to each clique, resulting in initial
potentials as each factork is assigned to some clique α(φ):
π 0j [C j ] = ∏ φ and ∏ φ = ∏ π 0j [C j ]
φ :α (φ ) = j
„
„
If our goal is to compute P(J), any cluster containing J can be Cr
Use the clique-tree data structure to pass messages between
neighboring cliques, sending all messages toward Cr
„
„
j =1
Define Cr as the root clique
„
„
φ
Start from tree leaves and move inward
Let pr(i) be the upstream neighbor of i (on the path to Cr)
Each Ci performs a computation that sends message δi to Cpr(i)
δ i → j ( Si , j ) = ∑ π i0 [Ci ]
∏ δ k →i
Ci − S i , j
„
k∈{ neighbors of i except for j}
When the root clique Cr has received all messages, it multiplies them
with its own initial potential, resulting in a factor called the belief
„
π r [Cr ] = π r0 [Cr ]
∏δ
i →r
i∈{neighbors of r }
representing
P (Cr ) =
φ
∑∏
φ
U −Cr
26
13
Clique Tree Inference Correctness
„
Theorem
„
„
„
Let Cr be the root clique in a clique tree
If πr is computed as above, then π r [Cr ] =
∑ P (U)
U −C r
F
Algorithm applies to Bayesian and Markov networks
„
For Bayesian network G, if F consists of the CPDs reduced
with some evidence e then πr[Cr] = PG(Cr,e)
„
„
Probability obtained by normalizing the factor over Cr to sum to 1
For Markov network H, if F consists of a set of clique
potentials, then πr[Cr] = PH(Cr)
„
„
Probability obtained by normalizing the factor over Cr to sum to 1
Partition function obtained by summing up all entries in πr[Cr]
27
Clique Tree Calibration
„
Assume we want to compute marginal distributions
over n variables: P(X1),…,P(Xn)
„
„
With variable elimination, we perform n separate VE runs
With clique trees, we can do this much more efficiently
„
„
Idea 1: since marginal over a variable can be computed from any
root clique that includes it, perform k clique tree runs (k=# cliques)
Idea 2: Can do much better! How?
28
14
Clique Tree Calibration
„
Observation: a message from Ci to Cj is unique
„
„
„
Æ
Consider two neighboring cliques Ci and Cj
If root Cr is on Cj side, Ci sends Cj a message
Message does not depend on specific Cr (we only need Cr to be on
the Cj side for Ci to send a message to Cj)
Message from Ci to Cj will always be the same, regardless of what
the query variables are.
C5 as the root
C4 as the root
δ3→5 (G, S) :
∑ π (C ) ×δ
I
0
3
2→3
(G, I)
δ 5→4 (G, S) :
∑
S ,L
π 0 (C5 ) ×δ 3→5 (G, S)
29
Clique Tree Calibration
„
Observation: a message from Ci to Cj is unique
„
„
„
Æ
„
Each edge has two messages associated with it
„
„
„
„
Consider two neighboring cliques Ci and Cj
If root Cr is on Cj side, Ci sends Cj a message
Message does not depend on specific Cr (we only need Cr to be on
the Cj side for Ci to send a message to Cj)
Message from Ci to Cj will always be the same, regardless of what
the query variables are.
One message for each direction of the edge
There are only 2(k-1) messages to compute
Can then readily compute the marginal probability over each variable
Compute 2(k-1) messages by
„
Pick any node as the root
Upward pass: send messages to the root
„
Downward pass: send messages to root children
„
„
„
Terminate when root received all messages
Terminate when all leaves received messages
30
15
Clique Tree Calibration: Example
„
Root: C5 (first downward pass)
π 3 [C3 ] = π 30 [C3 ]
C
∏δ
I
D
i →3
i∈{neighbors of r }
G
S
L
J
H
D
C,D
G,I,D
G,I
2
1
G,S,I
3
P(G|I,D)
P(C)
P(D|C)
G,S
G,J,S,L
G,J
H,G,J
5
P(I)
P(S|I)
4
P(H|G,J)
P(L|G)
P(J|L,S)
31
Clique Tree Calibration: Example
„
Root: C5 (second downward pass)
∏δ
π 2 [C2 ] = π [C2 ]
0
2
i →2
i∈{neighbors of r }
C
I
D
G
S
L
J
H
C,D
D
G,I,D
2
1
P(C)
P(D|C)
G,I
G,S,I
3
P(G|I,D)
G,S
G,J,S,L
5
P(I)
P(S|I)
G,J
H,G,J
4
P(L|G)
P(J|L,S)
P(H|G,J)
32
16
Clique Tree Calibration
„
Theorem
„
“Belief” πi is computed for each clique i as above:
π i [Ci ] = π i0 [Ci ]
„
∏δ
j →i
j∈{neighbors of i}
=
∑ P (U)
U − Ci
F
Important: avoid double-counting!
„
Each node i computes the message to its neighbor j using its initial
potentials π0i and not its updated potential (“belief”) πi, since πi
integrates information from Cj which will be counted twice
π 3 [C3 ] = π 30 [C3 ]
∏δ
i →3
i∈{neighbors of r }
33
Acknowledgement
„
These lecture notes were generated based on the
slides from Prof Eran Segal.
CSE 515 – Statistical Methods – Spring 2011
34
17