I(.,.|S) - Carnegie Mellon School of Computer Science

Efficient Principled Learning of Junction Trees
Carnegie
Mellon
Anton Chechetka and Carlos Guestrin
Carnegie Mellon University
Motivation
Probabilistic graphical models are
everywhere
- Medical diagnosis, datacenter performance
monitoring, sensor nets, …
Main advantages
- Compact representation of probability
distributions
- Exploit structure to speed up inference
Junction trees
Trees where each node is a set of variables
- Running intersection property: every
clique between Ci and Cj contains Ci  Cj
- Ci and Cj are neighbors  Sij=Ci  Cj is
called a separator
- Example:
Separators
1
AB
B
Cliques
But also problems
- Compact representation≠ tractable inference.
- Exact inference #P-complete in general
- Often still need exponential time even
for compact models
- Example:
Constraint-based
learning
E
B
BC
CD
2
C
EG
5
3
BE
4
V3 4
EF
6
- Notation: Vij is a set of all variables on the
same side of edge i-j as clique Cj:
- V34={GF}, V31={A}, V43={AD}
- Encoded independencies:
≤4 neighbors per variable (a constant!),
but inference still hard
We address both of these issues! We provide
- efficient structure learning algorithm
- guaranteed to learn tractable models
- with global guarantees on the results quality
Question: if S is a separator of an -JT, which
variables are on the same side of S?
- More than one correct answer possible:
AB
- The first polynomial time algorithm with PAC
guarantees for learning low-treewidth
graphical models with
- guaranteed tractable inference!
- Key theoretical insight: polynomial-time
upper bound on conditional mutual information
for arbitrarily large sets of variables
- Empirical viability demonstration
AC
A
AD
S={A}: {B}, {C,D} OR {B,C}, {D}
- We will settle for finding one
- Drop the complexity to polynomial from
exponential
Key theoretical result
Efficient upper bound for I(,|)
Intuition: Suppose a distribution P(V) can be
well approximated by a junction tree with clique
size k. Then for every set SV of size k, A,BV of
arbitrary size, to check that
I(A,B | S) is small, it is enough to check for all
subsets XA, YB of size at most k that
I(X,Y|S) is small.
Intuition: Consider set of variables Q={BCD}.
Suppose an -JT (e.g. above) with separator
S={A} exists s.t. some of the variables in Q
({B}) are on the left of S and the remaining
ones ({CD}) on the right.
then a partitioning of Q into X and Y exists
s.t. I(X,Y|S)<
C
Tractability guarantees:
- Inference exponential in clique size k
- Small cliques  tractable inference 
Only need
to compute
B
I(X,Y|S)
JTs as approximations
Often exact conditional independence is too
strict a requirement
- generalization: conditional mutual information
I(A , B | C)  H(A | B) – H(A | BC)
- H() is conditional entropy
- I(A , B | C) ≥0 always
- I(A , B | C) = 0  (A  B | C)
- intuitively: if C is already known, how much
new information about A is contained in B?
Y
X
for small
X and Y!
S
Computation time is reduced
from exponential in |V| to polynomial!
Set S does not have to relate to the
separators of the “true” JT in any way!
Approximation quality guarantee:
Theorem [Narasimhan and Bilmes, UAI05]:
If for every separator Sij in the junction tree it
holds that the conditional mutual information
I(Vij, Vji | Sij ) <  (call it -junction tree)
then
KL(P||Ptree) < |V|
Goal: find an –junction tree with
fixed clique size k
in polynomial (in |V|) time
D
Alg. 1 (given candidate sep. S), threshold :
- each variable of V\S starts out as a separate
partition
- for every QV\S of size at most k+2
- if minXQ I(X,Q\S | S) > 
Fixed size
- merge all partitions
regardless
that have variables in Q
of |Q|
Example: =0.25
0.3
0.4
0.2
I(.,.|S)
Theorem 1: Suppose an -JT of treewidth k
exists for P(V). Suppose the sets SV of size
k, AV\S of arbitrary size are s.t. for every
XV\S of size k+1 it holds that
I(XA, X(V\SA)S | S) < 
then
I(A, V\SA | S) < |V|(+)
Complexity: O(nk+1). Polynomial in n,
instead of O(exp(n)) for straightforward
computation
Test edge,
merge variables
Constructing
a junction tree
I() too low, do
not merge
merge;
end result
Theorem (results quality):
If after invoking Alg.1(S,=) a set U is a
connected component, then
never mistakenly
- For every Z s.t. I(Z, V\ZS | S)< put variables
it holds that UZ
together
- I(U, V\US | S)<nk
Incorrect splits not too bad
Complexity: O(nk+3). Polynomial in n.
Theoretical guarantees
Using Alg.1 for every SV, obtain a list L of pairs
(S,Q) s.t I(Q,V\SQ|S)<|V|(+)
Example:
B
BE
AB
BC
CD
EF
S
Q
{
:
A
B
,
C
CD
,
ABEF
E
F
B
,
,
EF
,
D
E
,
}
ABCD
Key insight [Arnborg+al:,SIAM-JADM1987,
Narasimhan+Bilmes: UAI05]:
In a junction tree, components (S,Q) have
recursive decomposition:
A
C
B
Intuition: if the intra-clique dependencies are
strong enough, guaranteed to find a wellapproximating JT in polynomial time.
C
Problem: From L, reconstruct a junction tree.
This is non-trivial. Complications:
- L may encode more independencies than a
single JT encodes
- Several different JTs may be consistent with
independencies in L
a clique in the
junction tree
if no such splits exist, all variables of Q
must be on the same side of S
I(A,B | S)=??
A
possible
partitionings
B
Pairwise
This work: contributions
A
E
(Vij  Vji | Sij)
- Often do not even have structure, only
data
- Best structure is NP-complete to find
- Most structure learning algorithms
return complex models, where
inference is intractable
- Very few structure learning algorithms
have global quality guarantees
Naïve Our work
Complexity:
Naively
nk
nk
- for every candidate
sep. S of size k
O(2n) O(nk+3)
- for every XV\S
n
4k+4)
- if I(X, V\SX | S) <  O(2 ) O(2
- add (S,X) to the “list of
useful components” L
O(2n) O(nk+2)
- find a JT consistent with L
Finding almost
independent subsets
smaller components
from L
Theorem: Suppose a maximal -JT tree of
treewidth k exists for P(V) s.t. for every clique
C and separator S of tree it holds that
minX(C\S)I(X,C\SX|S) > (k+3)(+)
then our algorithm will find a k|V|(+)-JT for
P(V) with probability at least (1-) using
 24 k  4
1
n
O 2 log 2 2 log 


 
Experimental results
Model quality (log-likelihood on test set)
Compare this work with
- ordering-based search (OBS)
[Teyssier+Koller:UAI05]
- Chow-Liu alg. [Chow+Liu:IEEE68]
- Karger-Srebro alg.[Karger+Srebro:SODA01]
- local search
- this work + local search combination
(using our algorithm to initialize local search)
samples and
 n 2 k 3 2 4 k  4
n
2 1
O
log 2 log 
2




time
Corollary: Maximal JTs of fixed treewidth s.t. for
every clique C and separator S it holds that
minX(C\S)I(X,C\SX|S) >
for fixed >0 are efficiently PAC learnable
Data: Beinlich+al:ECAIM1988
37 variables, treewidth 4,
learned treewidth 3
EF
Look for such recursive decompositions in L!
DP algorithm (input list L of pairs (S,Q)):
- sort L in the order of increasing |Q|
- mark (S,Q)L with |Q|=1 as positive
- for (S,Q)L, Q≥2, in the sorted order
- if xQ, (S1,Q1), …, (Sm,Qm) L s.t.
- Si {Sx}, (Si,Qi) is positive
- QiQj=
NP-complete to decide 
- i=1:mQi=Q\x We use greedy heuristic
- then mark (S,Q) positive
- decomposition(S,Q)=(S1,Q1),...,(Sm,Qm)
- if S s.t. all (S,Qi)L are positive
- return corresponding junction tree
Greedy heuristic for decomposition search
- initialize decomposition to empty
- iteratively add pairs (Si,Qi) that do not conflict
with those already in the decomposition
- if all variables of Q are covered, success
- May fail even if a decomposition exists
- But we prove that for certain
distributions guaranteed to work
Related work
Ref.
Model
Guarantees
Time
[1,2]
tractable
local
poly(n)
[3]
tree
global
O(n2 log n)
[4]
tree mix
local
O(n2 log n)
[5]
compact
local
poly(n)
[6]
all
global
exp(n)
[7]
tractable
constfactor
poly(n)
[8]
compact
PAC
poly(n)
[9]
tractable
PAC
exp(n)
this
work
tractable
PAC
poly(n)
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
Bach+Jordan:NIPS-02
Choi+al:UAI-05
Chow+Liu:IEEE-1968
Meila+Jordan:JMLR-01
Teyssier+Koller:UAI-05
Singh+Moore:CMU-CALD-05
Karger+Srebro:SODA-01
Abbeel+al:JMLR-06
Narasimhan+Bilmes:UAI-04
Data: Desphande+al:VLDB04
54 variables, treewidth 2
Data: Krause+Guestrin:UAI05
32 variables, treewidth 3
Future work
- Extend to non-maximal junction trees
- Heuristics to speed up performance
- Using information about edges likelihood (e.g.
from L1 regularized logistic regression) to cut
down on computation.

Download Report

I(.,.|S) - Carnegie Mellon School of Computer Science

Paperzz.com

Your Paperzz