On the Relationship between Sum-Product Networks and Bayesian

On the Relationship between Sum-Product
Networks and Bayesian Networks
Han Zhao, Mazen Melibari and Pascal Poupart
Presented by: Han Zhao
May 13, 2015
1 / 90
Outline
Background
Bayesian Network
Algebraic Decision Diagram
Sum-Product Network
Main Result
Main Theorems
Sum-Product Network to Bayesian Network
Bayesian Network to Sum-Product Network
Discussion
2 / 90
Bayesian Network
Definition
A graphical representation of a set of random variables X1:N and
their conditional dependencies.
I Node corresponds to random variables (observable or latent)
and edges represent conditional dependency between pairs of
variables.
H1
0
1
H1
0
0
0
0
1
1
1
1
H2
0
0
1
1
0
0
1
1
X
0
1
0
1
0
1
0
1
Pr(H1 )
0.35
0.65
Pr(X|H1 , H2 )
0.3
0.7
0.3
0.7
0.3
0.7
0.4
0.6
H1
H2
H2
0
1
Pr(H2 )
0.2
0.8
X
3 / 90
Bayesian Network
Definition
A graphical representation of a set of random variables X1:N and
their conditional dependencies.
I Node corresponds to random variables (observable or latent)
and edges represent conditional dependency between pairs of
variables.
I Each node is associated with a local conditional probability
distribution (CPD).
H1
0
1
H1
0
0
0
0
1
1
1
1
H2
0
0
1
1
0
0
1
1
X
0
1
0
1
0
1
0
1
Pr(H1 )
0.35
0.65
Pr(X|H1 , H2 )
0.3
0.7
0.3
0.7
0.3
0.7
0.4
0.6
H1
H2
H2
0
1
Pr(H2 )
0.2
0.8
X
4 / 90
Bayesian Network
Definition
A graphical representation of a set of random variables X1:N and
their conditional dependencies.
I Node corresponds to random variables (observable or latent)
and edges represent conditional dependency between pairs of
variables.
I Each node is associated with a local conditional probability
distribution (CPD).
I A directed acyclic graph (DAG).
H1
0
1
H1
0
0
0
0
1
1
1
1
H2
0
0
1
1
0
0
1
1
X
0
1
0
1
0
1
0
1
Pr(H1 )
0.35
0.65
Pr(X|H1 , H2 )
0.3
0.7
0.3
0.7
0.3
0.7
0.4
0.6
H1
H2
H2
0
1
Pr(H2 )
0.2
0.8
X
5 / 90
Bayesian Network
Definition
Local Markov property: each variable is conditionally independent
of its non-descendants given its parents.
Pr(X1:N ) =
N
Y
i=1
Pr(Xi | X1:i−1 ) =
N
Y
i=1
Pr(Xi |Pa(Xi ))
6 / 90
Bayesian Network
Definition
Local Markov property: each variable is conditionally independent
of its non-descendants given its parents.
Pr(X1:N ) =
N
Y
i=1
Pr(Xi | X1:i−1 ) =
N
Y
i=1
Pr(Xi |Pa(Xi ))
Independence induced by the structure of BN
7 / 90
Bayesian Network
Example
A Bayesian Network over 5 random variables:
X1
X2
X3
X4
X5
Pr(X1:5 ) = Pr(X5 |X1:4 ) Pr(X4 |X1:3 ) Pr(X3 |X1:2 ) Pr(X2 |X1 ) Pr(X1 )
=
8 / 90
Bayesian Network
Example
A Bayesian Network over 5 random variables:
X1
X2
X3
X4
X5
Pr(X1:5 ) = Pr(X5 |X1:4 ) Pr(X4 |X1:3 ) Pr(X3 |X1:2 ) Pr(X2 |X1 ) Pr(X1 )
= Pr(X5 |X3 ) Pr(X4 |X3 ) Pr(X3 |X1 , X2 ) Pr(X2 ) Pr(X1 )
9 / 90
Bayesian Network
Inference
Joint, marginal and conditional probabilistic query in Bayesian
Network. Consider the marginal query Pr(XN = True)
X
X
Pr(XN = True) =
···
Pr(X1:N−1 , XN = True)
X1
XN−1
10 / 90
Bayesian Network
Inference
Joint, marginal and conditional probabilistic query in Bayesian
Network. Consider the marginal query Pr(XN = True)
X
X
Pr(XN = True) =
···
Pr(X1:N−1 , XN = True)
X1
XN−1
Naive enumeration is exponential in the number of variables
11 / 90
Bayesian Network
Inference
Joint, marginal and conditional probabilistic query in Bayesian
Network. Consider the marginal query Pr(XN = True)
X
X
Pr(XN = True) =
···
Pr(X1:N−1 , XN = True)
X1
XN−1
Naive enumeration is exponential in the number of variables
Factorization helps to reduce the inference complexity by taking
advantage of the distributive law of × over +
12 / 90
Bayesian Network
Inference
Joint, marginal and conditional probabilistic query in Bayesian
Network. Consider the marginal query Pr(XN = True)
X
X
Pr(XN = True) =
···
Pr(X1:N−1 , XN = True)
X1
XN−1
Naive enumeration is exponential in the number of variables
Factorization helps to reduce the inference complexity by taking
advantage of the distributive law of × over +
X
Pr(X5 = True) =
Pr(X1:4 , X5 = True)
X1:4
13 / 90
Bayesian Network
Inference
Joint, marginal and conditional probabilistic query in Bayesian
Network. Consider the marginal query Pr(XN = True)
X
X
Pr(XN = True) =
···
Pr(X1:N−1 , XN = True)
X1
XN−1
Naive enumeration is exponential in the number of variables
Factorization helps to reduce the inference complexity by taking
advantage of the distributive law of × over +
X
Pr(X5 = True) =
Pr(X1:4 , X5 = True)
X1:4
=
X
X1:4
Pr(X5 = True|X3 ) Pr(X4 |X3 ) Pr(X3 |X1 , X2 ) Pr(X2 ) Pr(X1 )
14 / 90
Bayesian Network
Inference
Joint, marginal and conditional probabilistic query in Bayesian
Network. Consider the marginal query Pr(XN = True)
X
X
Pr(XN = True) =
···
Pr(X1:N−1 , XN = True)
X1
XN−1
Naive enumeration is exponential in the number of variables
Factorization helps to reduce the inference complexity by taking
advantage of the distributive law of × over +
X
Pr(X5 = True) =
Pr(X1:4 , X5 = True)
X1:4
=
X
X1:4
=
X
Pr(X5 = True|X3 ) Pr(X4 |X3 ) Pr(X3 |X1 , X2 ) Pr(X2 ) Pr(X1 )
Pr(X5 = True|X3 )
X3
X
X4
X
X2
Pr(X2 )
X
X1
Pr(X1 ) Pr(X3 |X1 , X2 )
Pr(X4 |X3 )
15 / 90
Bayesian Network
Inference
Exact inference algorithms for Bayesian Networks:
I
Variable Elimination/Sum-Product algorithm
Belief Propagation/Message Passing algorithm
N
X Y
General question:
Pr(Xn |Pa(Xn ))
I
XH ⊆X n=1
16 / 90
Bayesian Network
Inference
Exact inference algorithms for Bayesian Networks:
I
Variable Elimination/Sum-Product algorithm
Belief Propagation/Message Passing algorithm
N
X Y
General question:
Pr(Xn |Pa(Xn ))
I
XH ⊆X n=1
All taking advantage of the distributivity of × over + (can be
extended to any semirings)
17 / 90
Bayesian Network
Inference
Exact inference algorithms for Bayesian Networks:
I
Variable Elimination/Sum-Product algorithm
Belief Propagation/Message Passing algorithm
N
X Y
General question:
Pr(Xn |Pa(Xn ))
I
XH ⊆X n=1
All taking advantage of the distributivity of × over + (can be
extended to any semirings)
1: π ← an ordering of the hidden variables to be eliminated
2: Φ ← {TH | H is a hidden variable}
3: for each hidden variable H in π do
4:
P ← {TX | TX P
includes
Q H}
5:
Φ ← Φ\P ∪ { H T ∈P T }
6: end for
18 / 90
Algebraic Decision Diagram
Motivation
How to represent the conditional probability distribution (CPD)
associated with each variable in Bayesian Network?
19 / 90
Algebraic Decision Diagram
Motivation
How to represent the conditional probability distribution (CPD)
associated with each variable in Bayesian Network?
Tabular representation
A real function over 4 boolean variables
X1
0
0
0
0
0
0
0
0
X2
0
0
0
0
1
1
1
1
X3
0
0
1
1
0
0
1
1
X4
0
1
0
1
0
1
0
1
f (·)
0.4
0.6
0.3
0.3
0.4
0.6
0.3
0.3
X1
1
1
1
1
1
1
1
1
X2
0
0
0
0
1
1
1
1
X3
0
0
1
1
0
0
1
1
X4
0
1
0
1
0
1
0
1
f (·)
0.4
0.6
0.3
0.3
0.1
0.1
0.1
0.1
20 / 90
Algebraic Decision Diagram
Motivation
How to represent the conditional probability distribution (CPD)
associated with each variable in Bayesian Network?
Tabular representation
A real function over 4 boolean variables
X1
0
0
0
0
0
0
0
0
X2
0
0
0
0
1
1
1
1
X3
0
0
1
1
0
0
1
1
X4
0
1
0
1
0
1
0
1
f (·)
0.4
0.6
0.3
0.3
0.4
0.6
0.3
0.3
X1
1
1
1
1
1
1
1
1
X2
0
0
0
0
1
1
1
1
X3
0
0
1
1
0
0
1
1
X4
0
1
0
1
0
1
0
1
f (·)
0.4
0.6
0.3
0.3
0.1
0.1
0.1
0.1
Observation: Once X1 = 0, the value of the function is
independent of the value taken by X2 .
21 / 90
Algebraic Decision Diagram
Motivation
Let X , Y and Z be three random variables.
22 / 90
Algebraic Decision Diagram
Motivation
Let X , Y and Z be three random variables.
Independence
X ⊥⊥ Y ⇒ ∀x, y
Pr(x, y ) = Pr(x) Pr(y )
23 / 90
Algebraic Decision Diagram
Motivation
Let X , Y and Z be three random variables.
Independence
X ⊥⊥ Y ⇒ ∀x, y
Pr(x, y ) = Pr(x) Pr(y )
Conditional Independence
X ⊥⊥ Y | Z ⇒ ∀x, y , z
Pr(x, y |z) = Pr(x|z) Pr(y |z)
24 / 90
Algebraic Decision Diagram
Motivation
Let X , Y and Z be three random variables.
Independence
X ⊥⊥ Y ⇒ ∀x, y
Pr(x, y ) = Pr(x) Pr(y )
Conditional Independence
X ⊥⊥ Y | Z ⇒ ∀x, y , z
Pr(x, y |z) = Pr(x|z) Pr(y |z)
Context Specific conditional Independence (CSI)
X ⊥⊥ Y | Z = z ⇒ ∃z∀x, y
Pr(x, y |z) = Pr(x|z) Pr(y |z)
25 / 90
Algebraic Decision Diagram
Motivation
Let X , Y and Z be three random variables.
Independence
X ⊥⊥ Y ⇒ ∀x, y
Pr(x, y ) = Pr(x) Pr(y )
Conditional Independence
X ⊥⊥ Y | Z ⇒ ∀x, y , z
Pr(x, y |z) = Pr(x|z) Pr(y |z)
Context Specific conditional Independence (CSI)
X ⊥⊥ Y | Z = z ⇒ ∃z∀x, y
Pr(x, y |z) = Pr(x|z) Pr(y |z)
Both independence and conditional independence can be encoded
in the structure of Bayesian Network, but CSI cannot.
26 / 90
Algebraic Decision Diagram
Motivation
Decision Tree representation
Use decision tree to capture the context specific dependencies
X1
X3
0.3
X4
0.4
X2
0.6
0.3
X4
0.4
0.1
X3
X1
0
0
0
0
0
0
0
0
X2
0
0
0
0
1
1
1
1
X3
0
0
1
1
0
0
1
1
X4
0
1
0
1
0
1
0
1
f (·)
0.4
0.6
0.3
0.3
0.4
0.6
0.3
0.3
X1
1
1
1
1
1
1
1
1
X2
0
0
0
0
1
1
1
1
X3
0
0
1
1
0
0
1
1
X4
0
1
0
1
0
1
0
1
f (·)
0.4
0.6
0.3
0.3
0.1
0.1
0.1
0.1
0.6
X2 does not appear in the left branch of X1 and X4 does not
appear in the branch when X3 takes value 1.
27 / 90
Algebraic Decision Diagram
Motivation
Algebraic Decision Diagram
Decision Tree cannot reuse isomorphic sub-graphs
28 / 90
Algebraic Decision Diagram
Motivation
Algebraic Decision Diagram
Decision Tree cannot reuse isomorphic sub-graphs
X1
X3
0.3
X4
0.4
X2
0.6
0.3
X4
0.4
0.1
X3
0.6
29 / 90
Algebraic Decision Diagram
Motivation
Algebraic Decision Diagram
Decision Tree cannot reuse isomorphic sub-graphs
X1
X3
X1
X2
X2
0.3
X4
0.1
X3
X3
0.4
0.6
0.4
0.6
0.3
X4
0.3
X4
0.4
0.1
0.6
30 / 90
Algebraic Decision Diagram
Motivation
Algebraic Decision Diagram
Decision Tree cannot reuse isomorphic sub-graphs
X1
X3
X1
X2
X2
0.3
X4
0.1
X3
X3
0.4
0.6
0.4
0.6
0.3
X4
0.3
X4
0.4
0.1
0.6
Using directed acyclic graphs instead of trees!
31 / 90
Algebraic Decision Diagram
Discussion
I
Algebraic Decision Diagram is a data structure to compactly
encode any discrete function with finite support.
I
Context Specific Independence (CSI) can be encoded using
Algebraic Decision Diagram (better than tabular
representation).
I
Efficiently avoid the replication problem by reusing isomorphic
subgraph (better than decision tree representation).
I
We use Algebraic Decision Diagram to encode local CPDs in
Bayesian Network.
32 / 90
Sum-Product Network
Definition
A Sum-Product Network is a
I
Directed acyclic graph of indicator variables, sum nodes and
product nodes.
I
Each edge emanated from a sum node is associated with a
non-negative weight.
I
Value of a product node is the product of its children.
I
Value of a sum node is the weighted sum of its children.
33 / 90
Sum-Product Network
Example
+
10
×
×
+
I x1
×
+
4
6
9
6
+
+
14
9
1
Ix̄1
6
Ix2
2
8
Ix̄2
f (Ix1 , Ix̄1 , Ix2 , Ix̄2 ) = 10(6Ix1 + 4Ix̄1 )(6Ix2 + 14Ix̄2 ) + 6(6Ix1 +
4Ix̄1 )(2Ix2 + 8Ix̄2 ) + 9(9Ix1 + Ix̄1 )(2Ix2 + 8Ix̄2 ) =
594Ix1 Ix2 + 1776Ix1 Ix̄2 + 306Ix̄1 Ix2 + 824Ix̄1 Ix̄2
34 / 90
Sum-Product Network
Inference
Joint/Marginal/Conditional queries can be answered in linear time
in Sum-Product Network.
Joint Inference
Pr(X1 = 1, X2 = 0)?
+
10
×
×
+
I x1
×
+
4
6
9
6
+
+
14
9
1
Ix̄1
6
Ix2
2
8
Ix̄2
35 / 90
Sum-Product Network
Inference
Joint/Marginal/Conditional queries can be answered in linear time
in Sum-Product Network.
Joint Inference
Pr(X1 = 1, X2 = 0)? Setting Ix1 = 1, Ix̄1 = 0, Ix2 = 0, Ix̄2 = 1.
+
10
×
×
+
1
×
+
4
6
9
6
+
+
14
9
1
0
6
0
2
8
1
36 / 90
Sum-Product Network
Inference
Joint/Marginal/Conditional queries can be answered in linear time
in Sum-Product Network.
Joint Inference
Pr(X1 = 1, X2 = 0)? Setting Ix1 = 1, Ix̄1 = 0, Ix2 = 0, Ix̄2 = 1.
+
10
×
×
6
14
+
4
1
×
9
+
6
9
6
8
+
+
14
9
1
0
6
0
2
8
1
37 / 90
Sum-Product Network
Inference
Joint/Marginal/Conditional queries can be answered in linear time
in Sum-Product Network.
Joint Inference
Pr(X1 = 1, X2 = 0)? Setting Ix1 = 1, Ix̄1 = 0, Ix2 = 0, Ix̄2 = 1.
+
10
6
84
×
6
×
9
14
+
4
1
72
×
+
6
9
48
8
+
+
14
9
1
0
6
0
2
8
1
38 / 90
Sum-Product Network
Inference
Joint/Marginal/Conditional queries can be answered in linear time
in Sum-Product Network.
Joint Inference
Pr(X1 = 1, X2 = 0)? Setting Ix1 = 1, Ix̄1 = 0, Ix2 = 0, Ix̄2 = 1.
1776
+
10
6
84
×
6
×
9
14
+
4
1
72
×
+
6
9
48
8
+
+
14
9
1
0
6
0
2
8
1
39 / 90
Sum-Product Network
Inference
Joint/Marginal/Conditional queries can be answered in linear time
in Sum-Product Network.
Joint Inference
Pr(X1 = 1, X2 = 0)? Setting Ix1 = 1, Ix̄1 = 0, Ix2 = 0, Ix̄2 = 1.
1776
+
10
6
84
×
6
Pr(X1 = 1, X2 = 0) =
×
9
14
+
4
1
72
×
+
6
9
48
9
8
+
+
14
1
6
0
0
1776
594+1776+306+824
=
2
8
1
1776
3500
40 / 90
Sum-Product Network
Inference
Joint/Marginal/Conditional queries can be answered in linear time
in Sum-Product Network.
Marginal Inference
Pr(X1 = 1) ?
+
10
×
×
+
I x1
×
+
4
6
9
6
+
+
14
9
1
Ix̄1
6
Ix2
2
8
Ix̄2
41 / 90
Sum-Product Network
Inference
Joint/Marginal/Conditional queries can be answered in linear time
in Sum-Product Network.
Marginal Inference
Pr(X1 = 1) ? Setting Ix1 = 1, Ix̄1 = 0, Ix2 = 1, Ix̄2 = 1.
+
10
×
×
+
1
×
+
4
6
9
6
+
+
14
9
1
0
6
1
2
8
1
42 / 90
Sum-Product Network
Inference
Joint/Marginal/Conditional queries can be answered in linear time
in Sum-Product Network.
Marginal Inference
Pr(X1 = 1) ? Setting Ix1 = 1, Ix̄1 = 0, Ix2 = 1, Ix̄2 = 1.
+
10
×
×
6
+
4
1
×
9
+
6
9
6
20
10
+
+
14
9
1
0
6
1
2
8
1
43 / 90
Sum-Product Network
Inference
Joint/Marginal/Conditional queries can be answered in linear time
in Sum-Product Network.
Marginal Inference
Pr(X1 = 1) ? Setting Ix1 = 1, Ix̄1 = 0, Ix2 = 1, Ix̄2 = 1.
+
10
6
120
×
6
×
9
+
4
1
90
×
+
6
9
60
20
10
+
+
14
9
1
0
6
0
2
8
1
44 / 90
Sum-Product Network
Inference
Joint/Marginal/Conditional queries can be answered in linear time
in Sum-Product Network.
Marginal Inference
Pr(X1 = 1) ? Setting Ix1 = 1, Ix̄1 = 0, Ix2 = 1, Ix̄2 = 1.
2370
+
10
6
120
×
6
×
9
+
4
1
90
×
+
6
9
60
20
10
+
+
14
9
1
0
6
0
2
8
1
45 / 90
Sum-Product Network
Inference
Joint/Marginal/Conditional queries can be answered in linear time
in Sum-Product Network.
Marginal Inference
Pr(X1 = 1) ? Setting Ix1 = 1, Ix̄1 = 0, Ix2 = 1, Ix̄2 = 1.
2370
+
10
6
120
×
6
1
×
9
+
4
Pr(X1 = 1) =
90
×
+
6
9
60
20
10
+
+
14
9
1
6
0
2370
594+1776+306+824
0
=
2370
3500
2
8
1
46 / 90
Sum-Product Network
Inference
Joint/Marginal/Conditional queries can be answered in linear time
in Sum-Product Network.
Conditional Inference
Pr(X2 = 0|X1 = 1)?
47 / 90
Sum-Product Network
Inference
Joint/Marginal/Conditional queries can be answered in linear time
in Sum-Product Network.
Conditional Inference
Pr(X2 = 0|X1 = 1)?
Pr(X2 = 0|X1 = 1) =
Pr(X1 =1,X2 =0)
Pr(X1 =1)
48 / 90
Sum-Product Network
Inference
Joint/Marginal/Conditional queries can be answered in linear time
in Sum-Product Network.
Conditional Inference
Pr(X2 = 0|X1 = 1)?
1 =1,X2 =0)
Pr(X2 = 0|X1 = 1) = Pr(X
Pr(X1 =1)
Two passes through the Sum-Product Network, one to compute
Pr(X1 = 1, X2 = 0), the other to compute Pr(X1 = 1).
49 / 90
Sum-Product Network
Deep Learning Perspective
Deep structure
I
I
Sum node ⇔ Weighted linear activation function
Product node ⇔ Component-wise nonlinear activation
function
50 / 90
Sum-Product Network
Definition
Definition (scope)
The scope of a node in an SPN is defined as the set of variables
that have indicators among the node’s descendants: For any node
v in an SPN, if v is a terminal node, say, an
S indicator variable over
X , then scope(v ) = {X }, else scope(v ) = ṽ ∈Ch(v ) scope(ṽ ).
Definition (Complete)
An SPN is complete iff each sum node has children with the same
scope.
Definition (Consistent)
An SPN is consistent iff no variable appears negated in one child of
a product node and non-negated in another.
51 / 90
Sum-Product Network
Definition
Definition (Decomposable)
An SPN is decomposable iff for every product node v , scope(vi )
scope(vj ) = ∅ where vi , vj ∈ Ch(v ), i 6= j.
T
Definition (Valid)
An SPN is said to be valid iff it defines a (unnormalized)
probability distribution.
Theorem (Poon and Domingos)
If an SPN S is complete and consistent, then it is valid.
52 / 90
Sum-Product Network
Definition
Definition (Decomposable)
An SPN is decomposable iff for every product node v , scope(vi )
scope(vj ) = ∅ where vi , vj ∈ Ch(v ), i 6= j.
T
Definition (Valid)
An SPN is said to be valid iff it defines a (unnormalized)
probability distribution.
Theorem (Poon and Domingos)
If an SPN S is complete and consistent, then it is valid.
Valid SPN induces a (unnormalized) probability distribution by the
network polynomial defined by the root of the SPN.
53 / 90
Main Theorems
SPN-BN
Let |S| be the size of the SPN, i.e., the number of nodes plus the
number of edges in the graph. For a BN B, the size of B, |B|, is
defined by the size of the graph plus the size of all the CPDs in B.
54 / 90
Main Theorems
SPN-BN
Let |S| be the size of the SPN, i.e., the number of nodes plus the
number of edges in the graph. For a BN B, the size of B, |B|, is
defined by the size of the graph plus the size of all the CPDs in B.
Theorem (SPN-BN)
There exists an algorithm that converts any complete and
decomposable SPN S over Boolean variables X1:N into a BN B
with CPDs represented by ADDs in time O(N|S|). Furthermore, S
and B represent the same distribution and |B| = O(N|S|).
55 / 90
Main Theorems
SPN-BN
Let |S| be the size of the SPN, i.e., the number of nodes plus the
number of edges in the graph. For a BN B, the size of B, |B|, is
defined by the size of the graph plus the size of all the CPDs in B.
Theorem (SPN-BN)
There exists an algorithm that converts any complete and
decomposable SPN S over Boolean variables X1:N into a BN B
with CPDs represented by ADDs in time O(N|S|). Furthermore, S
and B represent the same distribution and |B| = O(N|S|).
Corollary (SPN-BN)
There exists an algorithm that converts any complete and
consistent SPN S over Boolean variables X1:N into a BN B with
CPDs represented by ADDs in time O(N|S|2 ). Furthermore, S and
B represent the same distribution and |B| = O(N|S|2 ).
56 / 90
Main Theorems
SPN-BN
57 / 90
Main Theorems
SPN-BN
Remark
The BN B generated from S has a simple bipartite DAG structure,
where all the source nodes are hidden variables and the terminal
nodes are the Boolean variables X1:N .
58 / 90
Main Theorems
SPN-BN
Remark
The BN B generated from S has a simple bipartite DAG structure,
where all the source nodes are hidden variables and the terminal
nodes are the Boolean variables X1:N .
Remark
Assuming sum nodes alternate with product nodes in SPN S, the
depth of S is proportional to the maximum in-degree of the nodes
in B, which, as a result, is proportional to a lower bound of the
tree-width of B.
59 / 90
Main Theorems
BN-SPN
Theorem (BN-SPN)
Given the BN B with ADD representation of CPDs generated from
a complete and decomposable SPN S over Boolean variables X1:N ,
the original SPN S can be recovered by applying the Variable
Elimination algorithm to B in O(N|S|).
60 / 90
Main Theorems
BN-SPN
Theorem (BN-SPN)
Given the BN B with ADD representation of CPDs generated from
a complete and decomposable SPN S over Boolean variables X1:N ,
the original SPN S can be recovered by applying the Variable
Elimination algorithm to B in O(N|S|).
Remark
The combination of the above two theorems shows that
distributions for which SPNs allow a compact representation and
efficient inference, BNs with ADDs also allow a compact
representation and efficient inference (i.e., no exponential blow up).
61 / 90
Road Map
1. Define Normal SPN, a sub-class of SPN
62 / 90
Road Map
1. Define Normal SPN, a sub-class of SPN
2. Taking advantage of normal SPN, show linear transformation
from SPN to BN
63 / 90
Road Map
1. Define Normal SPN, a sub-class of SPN
2. Taking advantage of normal SPN, show linear transformation
from SPN to BN
3. Taking advantage of Variable Elimination algorithm on ADD,
show linear transformation from BN to SPN
64 / 90
Normal Sum-Product Network
Definition
Definition (Normal Sum-Product Network)
An SPN is said to be normal if
1. It is complete and decomposable.
2. For each sum node in the SPN, the weights of the edges
emanating from the sum node are nonnegative and sum to 1.
3. Every terminal node in an SPN is a univariate distribution
over a Boolean variable and the size of the scope of a sum
node is at least 2 (sum nodes whose scope is of size 1 are
reduced into terminal nodes).
65 / 90
Normal Sum-Product Network
Definition
Definition (Normal Sum-Product Network)
An SPN is said to be normal if
1. It is complete and decomposable.
2. For each sum node in the SPN, the weights of the edges
emanating from the sum node are nonnegative and sum to 1.
3. Every terminal node in an SPN is a univariate distribution
over a Boolean variable and the size of the scope of a sum
node is at least 2 (sum nodes whose scope is of size 1 are
reduced into terminal nodes).
Theorem (Normal Transformation)
For any complete and consistent SPN S, there exists a normal
SPN S 0 such that PrS (·) = PrS 0 (·) and |S 0 | = O(|S|2 ).
66 / 90
Normal Sum-Product Network
Example
+
10
×
I
×
+
4
+
Ix̄1
6
Ix2
Normal Form
9
35
6
35
×
×
×
+
14
9
1
Ix1
4
7
×
+
6
+
9
6
2
8
X1
X1
X2
X2
(0.6, 0.4)
(0.9, 0.1)
(0.3, 0.7)
(0.2, 0.8)
Ix̄2
Each terminal node is a univariate distribution.
67 / 90
Normal Sum-Product Network
Example
+
10
×
×
+
4
+
Ix̄1
6
Ix2
Normal Form
9
35
6
35
×
×
×
+
14
9
1
Ix1
4
7
×
+
6
+
9
6
2
8
X1
X1
X2
X2
(0.6, 0.4)
(0.9, 0.1)
(0.3, 0.7)
(0.2, 0.8)
Ix̄2
I
Each terminal node is a univariate distribution.
I
Each internal sum node corresponds to a hidden variable with
multinomial distribution which defines a mixture model.
68 / 90
Normal Sum-Product Network
Example
+
10
×
×
+
4
+
Ix̄1
6
Ix2
Normal Form
9
35
6
35
×
×
×
+
14
9
1
Ix1
4
7
×
+
6
+
9
6
2
8
X1
X1
X2
X2
(0.6, 0.4)
(0.9, 0.1)
(0.3, 0.7)
(0.2, 0.8)
Ix̄2
I
Each terminal node is a univariate distribution.
I
Each internal sum node corresponds to a hidden variable with
multinomial distribution which defines a mixture model.
I
Each internal product node encodes a rule of context specific
independence over its children.
69 / 90
SPN-BN
Structure Construction
Given a normal SPN S over X1:N , construct:
I
A hidden node Hv for each sum node v in S.
+
×
×
×
X1
X1
X2
X2
(0.6, 0.4)
(0.9, 0.1)
(0.3, 0.7)
(0.2, 0.8)
H
70 / 90
SPN-BN
Structure Construction
Given a normal SPN S over X1:N , construct:
I
I
A hidden node Hv for each sum node v in S.
An observable node Xn for each variable Xn in S.
+
H
×
×
×
X1
X1
X1
X2
X2
(0.6, 0.4)
(0.9, 0.1)
(0.3, 0.7)
(0.2, 0.8)
X2
71 / 90
SPN-BN
Structure Construction
Given a normal SPN S over X1:N , construct:
I
I
I
A hidden node Hv for each sum node v in S.
An observable node Xn for each variable Xn in S.
A directed from Hv to Xn iff Xn appears in the sub-SPN
rooted at v in S.
+
H
×
×
×
X1
X1
X1
X2
X2
(0.6, 0.4)
(0.9, 0.1)
(0.3, 0.7)
(0.2, 0.8)
X2
72 / 90
SPN-BN
Structure Construction
Given a normal SPN S over X1:N , construct:
I
I
I
A hidden node Hv for each sum node v in S.
An observable node Xn for each variable Xn in S.
A directed from Hv to Xn iff Xn appears in the sub-SPN
rooted at v in S.
The structure of B is a directed bipartite graph, with a layer of
hidden nodes pointing to a layer of observable nodes.
+
H
×
×
×
X1
X1
X1
X2
X2
(0.6, 0.4)
(0.9, 0.1)
(0.3, 0.7)
(0.2, 0.8)
X2
73 / 90
SPN-BN
CPD Construction
Given a normal SPN S over X1:N , construct:
I
A decision stump from the sum node v for each hidden node
Hv .
+
4
7
×
AH =
9
35
6
35
H
h1
×
×
X1
X1
X2
X2
(0.6, 0.4)
(0.9, 0.1)
(0.3, 0.7)
(0.2, 0.8)
H
X1
4
7
h2
h3
6
35
9
35
X2
74 / 90
SPN-BN
CPD Construction
Given a normal SPN S over X1:N , construct:
I
A decision stump from the sum node v for each hidden node
Hv .
I
An induced sub-SPN SXn by node set {Xn } from S and then
contract all the product nodes in SXn .
AH =
H
h1
+
4
7
9
35
6
35
×
×
H
×
X1
H
X1
X1
X2
X2
(0.6, 0.4)
(0.9, 0.1)
(0.3, 0.7)
(0.2, 0.8)
h2
h3
4
7
h2
h3
6
35
9
35
X2
= AX1
h1
x1
0.6
X1
x1
x̄1
0.4
0.9
X1
x̄1
0.1
75 / 90
SPN-BN
CPD Construction
Given a normal SPN S over X1:N , construct:
I
A decision stump from the sum node v for each hidden node
Hv .
I
An induced sub-SPN SXn by node set {Xn } from S and then
contract all the product nodes in SXn .
AH =
H
h1
+
4
7
9
35
6
35
×
×
H
×
(0.6, 0.4)
X1
(0.9, 0.1)
X2
(0.3, 0.7)
h2
X2
(0.2, 0.8)
h3
h3
6
35
X1
H
X1
4
7
h2
9
35
X2
= AX1
AX2 =
h1
h1
x1
0.6
X1
0.4
h2
h3
x1
x̄1
H
0.9
X1
x2
x̄1
0.1
0.3
X2
x2
x̄2
0.7
0.2
X2
x̄2
0.8
76 / 90
SPN-BN
Theorems
Theorem
For any normal SPN S over X1:N , the constructed BN B encodes
the same probability distribution, i.e., PrS (x) = PrB (x), ∀x.
77 / 90
SPN-BN
Theorems
Theorem
For any normal SPN S over X1:N , the constructed BN B encodes
the same probability distribution, i.e., PrS (x) = PrB (x), ∀x.
Theorem
There exists an algorithm, for any normal SPN S over X1:N ,
constructs an equivalent BN in time O(N|S|).
78 / 90
SPN-BN
Theorems
Theorem
For any normal SPN S over X1:N , the constructed BN B encodes
the same probability distribution, i.e., PrS (x) = PrB (x), ∀x.
Theorem
There exists an algorithm, for any normal SPN S over X1:N ,
constructs an equivalent BN in time O(N|S|).
Theorem
|B| = O(N|S|), where BN B is constructed from the normal SPN
S over X1:N .
79 / 90
BN-SPN
Algorithm
Extend Algebraic Decision Diagram to Symbolic Algebraic Decision
Diagram where +, −, ×, / are allowed to be internal nodes.
80 / 90
BN-SPN
Algorithm
Extend Algebraic Decision Diagram to Symbolic Algebraic Decision
Diagram where +, −, ×, / are allowed to be internal nodes.
Example
Given symbolic ADDs AX1 over X1 and AX2 over X2 . A symbolic
ADD AX1 ,X2 over X1 , X2 encodes a function over X1 and X2 such
that AX1 ,X2 (x1 , x2 ) , (AX1 ⊗ AX2 ) (x1 , x2 ) = AX1 (x1 ) × AX2 (x2 ).
81 / 90
BN-SPN
Algorithm
Extend Algebraic Decision Diagram to Symbolic Algebraic Decision
Diagram where +, −, ×, / are allowed to be internal nodes.
Example
Given symbolic ADDs AX1 over X1 and AX2 over X2 . A symbolic
ADD AX1 ,X2 over X1 , X2 encodes a function over X1 and X2 such
that AX1 ,X2 (x1 , x2 ) , (AX1 ⊗ AX2 ) (x1 , x2 ) = AX1 (x1 ) × AX2 (x2 ).
Define two operations in symbolic ADD:
I
Multiplication between pairs of symbolic ADDs
I
Summing Out one internal variable in symbolic ADD
82 / 90
BN-SPN
Algorithm
Theorem (SPN-BN)
There exists a variable ordering such that applying Variable
Elimination with the ordering to BN with ADDs builds the original
SPN S in O(N|S|).
= AX1
h3
H
h2
x1
0.6
h1
X1
x̄1
0.4
x1
0.9
AX2 =
X1
x̄1
0.1
⊗
h1
H
h2
h3
h1
Multiplication
x2
0.3
X2
x̄2
0.7
x2
0.2
X2
X2
x̄2
0.8
x2
0.3
x̄2
0.7
⊗
H
X1
x1
0.6
+
x̄1
0.4
⊗
4
7
h3
h2
Summing Out
X2
x2
0.2
x̄2
0.8
⊗
×
X1
x1
0.9
9
35
6
35
×
×
x̄1
0.1
X2
X1
X2
X1
(0.3, 0.7)
(0.6, 0.4)
(0.2, 0.8)
(0.9, 0.1)
83 / 90
Discussion
I
SPNs and BN with ADDs share the same representational
power.
84 / 90
Discussion
I
SPNs and BN with ADDs share the same representational
power.
I
SPNs with any depth ⇔ directed bipartite BN.
85 / 90
Discussion
I
SPNs and BN with ADDs share the same representational
power.
I
SPNs with any depth ⇔ directed bipartite BN.
I
SPNs are history recording or caching of the inference process
on BN.
86 / 90
Discussion
I
SPNs and BN with ADDs share the same representational
power.
I
SPNs with any depth ⇔ directed bipartite BN.
I
I
SPNs are history recording or caching of the inference process
on BN.
The depth of SPN is linearly proportional to a lower bound of
the tree-width of the BN.
87 / 90
Discussion
I
SPNs and BN with ADDs share the same representational
power.
I
SPNs with any depth ⇔ directed bipartite BN.
I
SPNs are history recording or caching of the inference process
on BN.
I
The depth of SPN is linearly proportional to a lower bound of
the tree-width of the BN.
I
SPNs can be viewed as hierarchical mixture models with
reusability.
88 / 90
Discussion
I
SPNs and BN with ADDs share the same representational
power.
I
SPNs with any depth ⇔ directed bipartite BN.
I
SPNs are history recording or caching of the inference process
on BN.
I
The depth of SPN is linearly proportional to a lower bound of
the tree-width of the BN.
I
SPNs can be viewed as hierarchical mixture models with
reusability.
I
CSI are key to allow linear exact inference on BN with high
tree-width.
89 / 90
Thanks
Thanks
Question and Answering
short version: ICML 2015
full version: arXiv:1501.01239
90 / 90