Learning Bayesian Networks via Edge Walks on

Learning Bayesian Networks via Edge Walks on DAG
Associahedra
Liam Solus
Based on work with Lenka Matejovicova, Adityanarayanan
Radhakrishnan, Caroline Uhler, and Yuhao Wang
KTH Royal Institute of Technology
[email protected]
18 January 2017
Workshop on Convex Polytopes, Osaka University
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
1 / 34
Bayesian Basics
Definitions
Directed Acyclic Graph (DAG) Models
G = ([n], A) a directed acyclic graph (DAG)
1
3
2
The node i associates to a random variable Xi
4
Markov Assumption (MA)
The nonedges of G encode conditional independence (CI) relations
capturing cause-effect relationships:
Xi ⊥
⊥ Xnondes(i)\ pa(i) | Xpa(i)
pa(i) := the collection of parents of i.
nondes(i) := the nondescendents of i = [n]\ des(i) ∪ {i}.
The CI relations implied by the MA are captured by d-separation in G.
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
2 / 34
Bayesian Basics
Definitions
Directed Acyclic Graph (DAG) Models
Let A, B, C ⊂ [n] disjoint with A, B 6= ∅. We say that C d-connects A and
B in G if there is an undirected path U from A to B such that
1
every collider on U has a descendant in C , and
2
no other node is in C .
1
2
3
4
5
8
6
i−1
i
i+1
7
9
In there is no such path U, we say A and B are d-separated by C .
The Global Markov Property
A probability distribution P obeys MA for a DAG G if and only if
XA ⊥
⊥ XB | XC for all A, B, C for which C d-separates A and B in G.
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
3 / 34
Bayesian Basics
Definitions
Directed Acyclic Graph (DAG) Models
Bio-Mom
Bio-Mom ⊥
⊥ Bio-Grandchild | Bio-Child
Bio-Dad
Bio-Child
Bio-Mom 6⊥
⊥ Bio-Dad | Bio-Child
Bio-Grandchild
General Goal
Suppose we obtain data from an unknown DAG G, from which we infer a
collection of CI relations C. Can we learn the DAG G from C?
Algorithms? Consistency gaurantees?
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
4 / 34
Bayesian Basics
Algorithms
The PC and SP Algorithms
The PC-Algorithm: (Spirtes, Glymour, and Scheines, 2001)
1
Identify undirected graph.
2
Then orient the edges.
The SP-Algorithm: (Uhler and Raskutti, 2014)
1
To each permutation π = π1 π2 . . . πn construct a permutation DAG
Gπ with arrows (πi , πj ) ∈ E (Gπ ) if and only if
i < j and πi 6⊥
⊥ πj | {π1 , . . . , πj−1 }\{πi }.
2
Choose a sparsest permutation π ∗ ; i.e. a permutation for which Gπ
has the fewest edges.
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
5 / 34
Bayesian Basics
Algorithms
The SP-Algorithm
n = 3 and C = {1 ⊥
⊥ 3} :
1
3
1
3
1
3
2
2
2
123
132
213
1
3
1
3
1
3
2
2
2
231
312
321
i < j and πi 6⊥
⊥ πj | {π1 , . . . , πj−1 }\{πi }.
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
6 / 34
Bayesian Basics
Algorithms
Consistency Guarantees
Faithfulness Assumption
A probability distribution P is faithful to a DAG G if the only CI relations
satisfied by P are those entailed by the MA.
Restricted Faithfulness Assumption
P satisfies the restricted faithfulness assumption with respect to a DAG
G = ([n], A) if it satisfies the following two conditions:
1
2
Adjacency Faithfulness – Faithfulness for all arrows i → j ∈ A.
Orientation Faithfulness – Faithfulness for all triples (i, j, k).
j
i
k
Liam Solus (KTH)
j
i
Learning Bayesian Networks
k
18 January 2017
7 / 34
Bayesian Basics
Algorithms
Consistency Guarantees
SMR Assumption
P satisfies the SMR assumption with respect to a DAG G if it satisfies
the MA with respect to G and |G| < |H| for every DAG H such that P
satisfies the MA for H and H is not Markov equivalent to G.
Theorem (Uhler and Raskutti, 2014)
The SP-algorithm is consistent under the SMR assumption which is
strictly weaker than restricted faithfulness.
Downside: SP-algorithm search space is factorial in size!
Can we more efficiently search through the permutations Sn ?
Can we shrink the search space of the SP-algorithm while maintaining
consistency gaurantees?
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
8 / 34
Some Geometry
Permutohedra
Edge Walks on a Permutohedron Pn
dron
Each vertex corresponds to a
permutation DAG Gπ .
Edges correspond to flipping
adjacent transpositions:
1342 − 1432
Can we take a greedy walk
along the edges of Pn ? I.e.
walk from Gπ to Gτ whenever
|E (Gπ )| > |E (Gτ )|?
(One) Problem: Two
permutations can have the
same permutation DAGs.
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
9 / 34
Some Geometry
DAG Associahedra
DAG Associahedra (Mohammadi, Uhler, Wang, Yu, 2016)
DAG associahedron
1
3
2
4
CI relations:
1?
? 2, 1 ?
? 4 | 3, 1 ?
? 4 | {2, 3}
2?
? 4 | 3, 2 ?
? 4 | {1, 3}
Associate CI relations to edges of Pn with respect to the dependence
relations i < j and πi 6⊥
⊥ πj | {π1 , . . . , πj−1 }\{πi } :
1⊥
⊥ 2 – no elements in conditioning set: edges with nothing before {1, 2}:
1234 − 2134 and 1243 − 2143.
2⊥
⊥ 4 | {3} – conditioning set is {3}: edges with 3 before 2 and 4:
3241 − 3421.
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
10 / 34
Some Geometry
DAG Associahedra
DAG
Associahedra
DAG
associahedron(Mohammadi, Uhler, Wang, Yu, 2016)
1
3
4
2
Caroline
(MIT)
Liam
Solus Uhler
(KTH)
CI relations:
1?
? 2, 1 ?
? 4 | 3, 1 ?
? 4 | {2, 3}
2?
? 4 | 3, 2 ?
? 4 | {1, 3}
Sparsest
permutations
Learning
Bayesian
Networks
Munich,
Oct 2016
18 January
2017
10 /1114/ 34
Some Geometry
DAG Associahedra
DAG Associahedra (Mohammadi, Uhler, Wang, Yu, 2016)
DAG associahedron
Theorem (Mohammadi, Uhler, C. Wang, Yu, 2016)
1
CI relations:
1?
?are
2, labeled
1?
? 4 |by
3, the
1?
?different
4 | {2, 3}
Pn (C) is a convex polytope
whose vertices
4
3
2
?
?
4
|
3,
2
?
?
4
|
{1,
3}
possible permutation DAGs for C.
2
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
12 / 34
Greedy SP Algorithms
Greedy SP Algorithms
The vertices of the DAG Associahedron, Pn (C), serve as a reduced search
space for the SP-algorithm.
Edge SP Algorithm
Input: C and a permutation π ∈ Sn .
Take a “nonincreasing edge walk” along the edges of the DAG
associahedron Pn (C).
When is this algorithm consistent? Under the faithfulness assumption?
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
13 / 34
Greedy SP Algorithms
Geometric Aspects of Faithfulness
Geometric Aspects of Faithfulness
A covered edge in a DAG G is any edge i → j such
that
pa(j) = pa(i) ∪ {i}.
Revisiting Our SP-algorithm Example
n = 3 and C = {1 ⊥
⊥ 3} :
1
3
1
3
2
2
π = 123
τ = 132
i < j and πi 6⊥
⊥ πj | {π1 , . . . , πj−1 }\{πi }.
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
14 / 34
Greedy SP Algorithms
Geometric Aspects of Faithfulness
Geometric Aspects of Faithfulness
Theorem (Matejovicova, LS, Uhler, Y. Wang, 2017)
Under the faithfulness assumption, each edge e corresponds to flipping a
covered edge in one of the DAGs associated to the endpoints of e.
Triangle SP Algorithm
Input: C and a permutation π ∈ Sn .
Take a “nonincreasing edge walk” along the edges of the DAG
associahedron Pn (C) that correspond to covered edge reversals.
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
15 / 34
Greedy SP Algorithms
Geometric Aspects of Faithfulness
Edge Walks and Independence Maps
A DAG H is an independence map of another DAG G, written G ≤ H, if
any CI relation entailed by H is also entailed by G.
Theorem (Matejovicova, LS, Uhler, Y. Wang, 2017)
A positive probability distribution P is faithful to a sparsest DAG Gπ∗ if
and only if Gπ∗ ≤ Gπ for all permutations π.
Result of Chickering 2002 implies that under faithfulness we can always
find an sequence of independence maps
Gπ∗ =: G 0 ≤ G 1 ≤ G 2 ≤ · · · ≤ G N := Gπ .
Can always find a sequence that coincides with a nonincreasing edge walk!
Theorem (Matejovicova, LS, Uhler, Y. Wang, 2017)
The Triangle SP Algorithm is consistent under the faithfulness assumption.
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
16 / 34
Greedy SP Algorithms
Parsing out the Assumptions
ESP Assumption: The assumption guaranteeing consistency of Edge SP.
TSP Assumption: The assumption guaranteeing consistency of Triangle SP.
A ≺ B = “A is strictly weaker than B”
Theorem (Matejovicova, LS, Uhler, Y. Wang, 2017)
SMR ≺ ESP ≺ TSP ≺ faithfulness.
What about restricted faithfulness?
Theorem (Matejovicova, LS, Uhler, Y. Wang, 2017)
Consistency under the TSP assumption implies adjacency faithfulness, but
not orientation faithfulness.
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
17 / 34
Greedy SP Algorithms
The Good News
The vertices of the DAG Associahedron serve as a reduced search space for
the SP Algorithm.
In this way, we can execute the SP algorithm in a reduced search space
(less than n! elements).
The ESP and TSP Algorithms can greedily search over the vertices of a
DAG associahedron.
The TSP Algorithm is consistent under faithfulness.
We understand the relationships amongst the SMR, ESP, TSP,
faithfulness, and restricted faithfulness assumptions.
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
18 / 34
Greedy SP Algorithms
Markov Equivalence
The Bad News: Markov Equivalence of DAGs
Our search along the edges is not truly greedy...
At times we are required to move between DAGs Gπ and Gτ where
|Gπ | = |Gτ |.
1
2
3
1
2
3
1
2
3
1⊥
⊥ 3|2
1
2
3
1⊥
⊥3
Two DAGs that differ only by a covered edge reversal entail the same set
of CI relations. We call any two such DAGs Markov equivalent.
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
19 / 34
Greedy SP Algorithms
Markov Equivalence
The Bad News: The Problem of Markov Equivalence
Problem: The algorithm may search through large portions of a Markov
Equivalence Class (MEC) before finding a neighboring DAG with fewer
edges.
To terminate it must search the ENTIRE MEC of the sparsest DAGs!
This motivates two enumerative questions:
1
For a fixed set of graph parameters, how many MECs are there?
2
What are the sizes of these MECs?
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
20 / 34
Greedy SP Algorithms
Markov Equivalence
Markov Equivalence of DAGs
A collider that is not in a triangle is called an immorality
Immorality
Not an immorality
Theorem (Verma and Pearl, 1992)
Two DAGs are Markov equivalent if and only if they have the same
skeleton and the same set of immoralities.
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
21 / 34
Greedy SP Algorithms
Markov Equivalence
Markov Equivalence of DAGs
An MEC with three elements:
[G] = the MEC containing the DAG G.
[G] has an associated partially directed graph called
the essential graph of [G].
chain components of [G] = undirected connected
b
components of G.
essential components of [G] = directed connected
b
components of G.
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
22 / 34
Greedy SP Algorithms
Markov Equivalence
The Two Enumerative Questions
Previous work:
1
2
(Gillespie and Perlman, 2001) Computer enumeration of all MECs on
p ≤ 10 nodes.
(Gillespie ’06, Steinsky ’03, Wagner ’13) Enumeration of MECs of a
size:
1
2
Formulas only for small class sizes (size one, two, and three), or
restricted chordal components.
Inclusion-Exclusion arguments on essential graphs.
A new approach:
1
Greedy SP algorithm must search ENTIRE true MEC to terminate
2
when can we solve the enumeration problem for a fixed skeleton?
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
23 / 34
The Problem of Markov Equivalence
Combinatorial Enumeration by Skeleton
A New Instance of an Old Combinatorial Approach
Restrict to a type of skeleton and solve the enumeration problem here.
+
||
Ip := the path on p vertices.
M(G ) := number of MECs with skeleton G .
M(Ip ) = M(Ip−1 ) + M(Ip−2 )
and
M(I1 ) = 1, M(I2 ) = 1
M(Ip ) = Fp−1 , the (p − 1)st Fibonacci number.
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
24 / 34
The Problem of Markov Equivalence
Combinatorial Enumeration by Skeleton
A Second Proof: More Information
An independent set in G is a subset of mutually non-adjacent nodes:
αk (G ) := number of independent sets in
G of size k
P
I (G ; x) := k≥0 αk (G )x k ;
the independence polynomial of G .
1
1
1
1
mk (G ) := number of MECs on G with k
immoralities.
P
k
k≥0 mk (Ip )x = I (Ip−2 ; x) = Fp−1 (x),
the (p −
1)st
Fibonacci Polynomial
Liam Solus (KTH)
Learning Bayesian Networks
1
1
1
1
6
7
2
3
4
5
1
1
3
6
1
4
10 10
1
5
15 20 15
21 35 35 21
1
6
1
7
18 January 2017
1
25 / 34
The Problem of Markov Equivalence
Combinatorial Enumeration by Skeleton
A Second Proof: More Information
s` (G ) := the number of MECs on G of size `.
A composition of p into k parts is an
ordered sum of k positive integers
the value of which is p:
1
+
3
+
2
c1 + c2 + · · · + ck = p.
s` (Ip ) is the number of compositions
of p − k into k + 1 parts over all k
for which
k+1
Y
cj = `.
j=1
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
26 / 34
The Problem of Markov Equivalence
Combinatorial Enumeration by Skeleton
Combinatorial Statistics: Refining the Problem
M(G ) = the total number of MECs on G .
mk (G ) = the number of MECs on G with precisely k immoralities.
m(G ) = the maximum number of immoralities within an MEC on G .
sk (G ) = the number of MECs on G of size k.
The first three statistics combine into the polynomial generating function:
m(G )
M(G ; x) :=
X
mk (G )x k
k=0
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
27 / 34
The Problem of Markov Equivalence
Combinatorial Enumeration by Skeleton
Some Further Examples
The complete set of these statistics are recoverable for some other
important graphs including:
Cp
Sp ' K1,p
K2,p
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
28 / 34
The Problem of Markov Equivalence
Combinatorial Enumeration by Skeleton
Sparse Examples: Bounds for Trees
A classical bound on the number of independent sets a tree also holds for
the number of MECs on a tree:
Theorem (Radhakrishnan, LS, Uhler, 2016)
Let Tp be an undirected tree on p nodes. Then
Fp−1 = M(Ip ) ≤ M(Tp ) ≤ M(Sp−1 ) = 2p−1 − p + 1.
We can also bound the size of an MEC:
Theorem (Radhakrishnan, LS, Uhler, 2016)
Let Tp be a directed tree on p nodes whose essential graph has ` > 0
chain components m ≥ 0 essential components. Then
`
2 ≤ #[Tp ] ≤
Liam Solus (KTH)
p−m
`
Learning Bayesian Networks
`
.
18 January 2017
29 / 34
Greedy SP-Algorithms
An Implementable Algorithm
Overcoming Exponentiality: An Implementable Algorithm
We need a version of the TSP-Algorithm that avoids the problem of
exponentially-sized Markov equivalence classes of permutation DAGs.
Solution: Introduce a search-depth bound d and a fixed number of runs r .
Triangle SP Algorithm with depth and run bounds
Input: C and two positive integers d and r .
1
Pick a permutation DAG Gπ and do a depth-first search along the
edges of Pn (C) with depth bound d, searching for a sparser
permutation DAG. Repeat search until no sparser DAG is found, and
return the last DAG visited.
2
Do step 1 r times and then select the sparsest of the r permutation
DAGs recovered.
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
30 / 34
Greedy SP-Algorithms
An Implementable Algorithm
A Sample of Some Simulations
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
31 / 34
Greedy SP-Algorithms
An Implementable Algorithm
A Sample of Some Simulations
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
32 / 34
Moral of the Story
Combinatorial convex geometry of Generalized Permutohedra can
provide DAG model learning algorithms useful in causal inference!
Combinatorics of DAG associahedra provide a graphical version of these
algorithms that are implementable and are consistent under common
identifiability assumptions!
The number and size of Markov equivalence classes is important to
understanding the efficiency of algorithms searching over a space of DAGs.
Connections to classic combinatorial optimization problems yield FUN
problems and reveal that MECs can be large even for sparse graphs!
Adjusting algorithms with fixed search-depth and run bounds results in
algorithms that are efficient and more reliable than the PC-algorithm!
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
33 / 34
Thank You!
(Preprints available on the ArXiv!)
Liam Solus (KTH)
Learning Bayesian Networks
18 January 2017
34 / 34