Integer Programming for Bayesian Network Structure Learning

Quality Technology &
Quantitative Management
QTQM
Vol. 11, No. 1, pp. 99-110, 2014
© ICAQM 2014
Integer Programming for Bayesian
Network Structure Learning
James Cussens*
Department of Computer Science and York Centre for Complex Systems Analysis
University of York, York, UK
(Received July 2013, accepted December 2013)
______________________________________________________________________
Abstract: Bayesian networks provide an attractive representation of structured probabilistic information.
There is thus much interest in 'learning' BNs from data. In this paper the problem of learning a Bayesian
network using integer programming is presented. The SCIP (Solving Constraint Integer Programming)
framework is used to do this. Although cutting planes are a key ingredient in our approach, primal
heuristics and efficient propagation are also important.
Keywords: Bayesian networks, integer programming, machine learning.
______________________________________________________________________
1. Introduction
A
Bayesian network (BN) represents a probability distribution over a finite number of
random variables. In this paper, unless specified otherwise, it will be assumed that all
random variables are discrete. A BN has two components: an acyclic directed graph (DAG)
representing qualitative aspects of the distribution, and a set of parameters. Figure 1
presents the structure of the famous 'Asia' BN which was introduced by Lauritzen and
Spiegelhalter [16]. This BN has 8 random variables A, T, X, E, L, D, S and B. The BN
represents an imagined probabilistic medical 'expert system' where A = visit to Asia, T =
Tuberculosis, X = Normal X-Ray result, E = Either tuberculosis or lung cancer, L = Lung
cancer, D = Dyspnea (shortness of breath), S = Smoker and B = Bronchitis. Each of these
random variables has two values: TRUE (t) and FALSE (f). A joint probability distribution
for these 8 random variables must specify a probability for each of the 28 joint
instantiations of these random variables.
To specify these 28 probabilities some parameters are needed. To explain what these
parameters are some terminology is now introduced. In a BN if there is an arrow from
node X to node Y we say that X is a parent of Y (and that Y is a child of node X). The
parameters of a BN are defined in terms of the set of parents each node has. They are
conditional probability tables (CPTs), one for each random variable, which specify a
distribution for the random variable for each possible joint instantiation of its parents. So,
for example, the CPT for D in Figure 1 could be:
P (D  t
P (D  t
P (D  t
P (D  t
*
B  f , E  f )  0.3
B  f , E  t )  0.4
B  t , E  f )  0.5
B  t , E  t )  1.0
P (D  f
P (D  f
P (D  f
P (D  f
Corresponding author. E-mail: [email protected]
B  f , E  f )  0.7
B  f , E  t )  0.6
B  t , E  f )  0.5
B  t , E  t )  0.0
100
Cussens
Note that deterministic relations can be represented using 0 and 1 values for
probabilities. If a random variable has no parents (like A and S in Figure 1) an
unconditional probability distribution is defined for its values. For example the CPT for A
might be P ( A  t )  0.1 , P ( A  f )  0.9 .
Figure 1. An 8-node DAG which is the structure of a BN (the
'Asia' BN [16]) with random variables A, T, X, E, L, D, S, and B.
The probability of any joint instantiation of the random variables is given by
multiplying the relevant conditional probabilities found in the CPTs. Although there are 28
such joint instantiations for the BN in Figure 1 the number of parameters of the BN is far
fewer so that BNs provide a compact representation. They can do this since the BN
structure encodes conditional independence assumptions about the random variables. A full
account of this will not be give here: the interested reader should consult Koller and
Friedman's excellent book on probabilistic graphical models [15]. However the basic idea is
that if a node (or collection of nodes) V3 'blocks' a path in the graph between two other
nodes V1 and V2 , then V1 is independent of V2 given V3 (Koller and Friedman [15]
provide a proper definition of what it means to 'block' a path). So for example, in Figure 1
A is dependent on E, D and X, but it is independent of these random variables given T. To
put it informally: knowing about A tells you something about E, D and X, but once you
know the value of T, A provides no information about E, D or X---it is only via T that A
provides information about E, D or X. The graphs allows one to 'read off' these
relationships between the variables. For example, recall that in the 'Asia' BN in Figure 1, S
= Smoking, L = Lung cancer, B = Bronchitis, and D = Dyspnea (shortness of breath). The
structure of the graph tell us that smoking influences dyspnea, but it only does so as a result
of lung cancer or bronchitis.
Such structural information can provide considerable insight, but this raises the
question of how it can be reliably obtained. Two main approaches are taken. In the first a
domain expert is asked to provide the structure. There are, of course, many problems with
such a 'manual' approach: experts' time is expensive, experts may disagree and make
mistakes and any expert used would have to first understand the semantics of BNs.
An appealing alternative is to infer BN structure directly from data. Any data which
can be viewed as having been sampled from some unknown joint probability distribution is
appropriate. The goal is to learn a BN structure for this unknown distribution. For example,
supposing again that the BN in Figure 1 is a medical expert system, it could be inferred
from a database (single table) of patient records, where for each patient there is a field
recording whether they smoke, have lung cancer, have bronchitis, suffer from dyspnea, etc.
Integer Programming for Bayesian Network Structure Learning
101
There are currently two main approaches taken to learning BN structure from data. In
the first, statistical tests are performed with a view to determining conditional
independence relations between the variables. An algorithm then searches for a DAG
which represents the conditional independence relations thus found [6, 8, 19, 21]. In the
second approach, often called search and score, each candidate DAG has a score which
reflects how well it fits the data. The goal is then simply to find the DAG with the highest
score [4, 5, 7, 9, 18, 24]. It is also possible to combine elements of both these main
approaches [20].
The difficulty with search and score is that the number of candidate DAGs grows
super-exponentially with the number of variables, so a simple enumerative approach is out
of the question for all but the smallest problems. A number of search techniques have been
applied, including greedy hill-climbing [19], dynamic programming [18], branch-and-bound
[5] and A*[24]. In many cases the search is not complete in the sense that there is no
guarantee that the BN structure returned has an optimal score. However recently there has
been much interest in complete (also known as exact) BN structure learning where a search
is conducted until a guaranteed optimal structure is returned.
2. Bayesian Network Structure Learning with IP
In the rest of this paper an integer programming approach to exact BN learning is
described. The basic ideas of integer programming (IP) are first briefly presented. This is
then followed by an account of how BN structure learning can be encoded and efficiently
solved using IP. Two important extensions are then described: adding in structural prior
information and finding multiple solutions. The article ends with a summary of how well
this approach performs.
2.1. Integer Programming
In an integer programming problem the goal is to maximise a linear objective function
subject to linear constraints with the added constraint that all variables must take integer
values. (Any minimisation problem can be easily converted into a maximisation problem,
so here only maximisation problems are considered.) Let x  ( x1 , x 2 , x n ) be the problem
variables where each x i   (i.e. can only take integer values). Assume that finite upper
and lower bounds on each x i are given. Let c  (c1 , c 2 ,c n ) be the real-valued vector of
objective coefficients for each problem variable. Viewing x as a column vector and c as
a row vector, the problem of maximising cx with no constraints is easy: just set each x i
to its lower bound if c i  0 and set it to its upper bound otherwise.
The problem becomes significantly harder once linear constraints on acceptable
solutions are added. Each such constraint is of the form ax  b where a is a real-valued
row vector and b is a real number. Many important industrial and business problems can
be encoded as an IP and there are many powerful solvers ( such as CPLEX ) which can
provide optimal solutions even when there are thousands of problem variables and
constraints.
A proper account of the many techniques of integer programming will not be provided
here (for that see Wolsey's book [22]) but some basics are now given. Although solving an
IP may be very hard, solving the linear relaxation of an IP is much easier (the simplex
algorithm is often used). The linear relaxation is the same as the original IP except that the
variables are now permitted to take non-integer values. Note that if we are 'lucky' and the
solution to the linear relaxation 'happens' to be integer valued then the original IP is also
102
Cussens
solved. In general, this is not the case but the solution to the linear relaxation does provide
a useful upper bound on the objective value of an optimal integer solution.
Two important parts of the IP solving process (in addition to solving the linear
relaxation) are the addition of cutting planes and branching. A cutting plane is a linear
inequality not present in the original problem whose validity is implied by (1) those linear
inequalities that are initially present and (2) the integrality restriction on the problem
variables. Typically an IP solver will search for cutting planes which the solution to the
linear relaxation (call it x * ) does not satisfy. IP solvers typically contain a number of
generic cutting plane algorithms (e.g. Gomory, Strong Chvátal-Gomory, zero-half [23])
which can be applied to any IP. In addition, users can create problem-specific cutting plane
algorithms. Adding cutting planes will not rule out the yet-to-be-found optimal integer
solution but will rule out x * . It follows that adding cutting planes in this way will produce
a new linear relaxation whose solution will provide a tighter upper bound.
In some problems it is possible to add sufficiently many cutting planes of the right sort
so that a linear relaxation is produced whose solution is entirely integer-valued. In such a
case the original IP problem is solved. Typically this is not the case so another approach is
required, the most common of which is branching. In branching a problem variable x i is
selected together with some appropriate integer value l . Two new subproblems are then
created: one where x i  l  1 and one where x i  l . Usually a variable is selected with a
non-integer value in the linear relaxation solution x * . Since there are only finitely many
variables each with finitely many values it is not difficult to see that one can search for all
possible solutions by repeated branching. In practice this search is made efficient by pruning.
Pruning takes advantage of the upper bound provided by the linear relaxation. It also uses
the incumbent: the best (not necessarily optimal) solution found so far. If the upper bound
on the best solution for some subproblem is below that of the objective value of the
incumbent then the optimal solution for the subproblem is worse than the incumbent and
no further work on the subproblem is necessary.
2.2. Bayesian Network Learning as an IP Problem
In this section it is shown how to represent the BN structure learning problem as an IP.
This question has been considered in a number of papers [2, 10, 11, 12, 14]. Firstly, we
need to create IP problem variables to represent the structure of DAGs. This is done by
creating binary 'family' variables I (W  v ) for each node v and candidate parent set
W , where I (W  v )  1 iff W is the parent set for v . In this encoding, the DAG in
Figure 1 would be represented by a solution where I (  A)  1, I (  S )  1,
I ({ A}  T )  1, I ({S }  L )  1, I ({S }  B )  1, I ({L,T }  E )  1,
I ({E }  X )  1,
I ({B, E }  D )  1 , and all other IP variables have the value 0.
The next issue to consider is how to score candidate BNs: how do we measure how
'good' a given BN is for the data from which we are learning? A number of scores are used
but here only one is considered: log marginal likelihood or the BDeu score. The BDeu score
comes from looking at the problem from the perspective of Bayesian statistics. In that
approach the problem is to find the 'most probable' BN given the data, i.e. to find a BN G
which maximises P (G |Data) . Using Bayes theorem we have that P (G |Data)
 P (G ) P (Data|G ) , where P (G ) is the prior probability of BN G and P (Data|G ) is
the marginal likelihood. If we have no prior bias between the candidate BNs it is reasonable
for P (G ) to have the same value for all G . In this case maximising marginal likelihood
or indeed log marginal likelihood will maximise P (G |Data). (Note that the word
Integer Programming for Bayesian Network Structure Learning
103
''Bayesian'' in ''Bayesian networks'' is misleading, since BNs are no more Bayesian than
other probabilistic models which do not have the world ''Bayesian'' in their name.)
Crucially, given certain restrictions, the BDeu score can be expressed as a linear
function of the family variables I (W  v ) (hence the decision to encode the graph using
them). So-called 'local scores' c (v,W ) are computed from the data for each I (W  v )
variable. The BN structure learning problem then becomes the problem of maximising
 c (v,W ) I (W  v ) ,
v ,W
(1)
subject to the condition that the values assigned to the I (W  v ) represent a DAG.
Linear inequalities are now required to restrict instantiations of the I (W  v )
variables so that only DAGs are represented. Firstly, it is easy to ensure that each BN
variable (call BN variables 'nodes') has exactly one (possibly empty) parent set. Letting V
be the set of BN nodes, the following linear constraints are added to the IP:
v V : I (W  v )  1 ,
(2)
W
Ensuring that the graph is acyclic is more tricky. The most successful approach has
been to use 'cluster' constraints:
C  V : 

I (W  v )  1 ,
vC W :W  C 
(3)
introduced by Jaakkola et al. [14]. A cluster is a subset of BN nodes. For each cluster C
the associated constraint declares that at least one v  C has no parents in C . Since there
are exponentially many cluster constraints these are added as cutting planes in the course
of solving: each time the linear relaxation of the IP is solved there is a search for a cluster
constraint which is not satisfied by the linear relaxation solution. If no cluster constraint
can be found there are two possibilities depending on whether the linear relaxation solution
(call it x * ) has variables with fractional values or not. If there are no fractional variables
then x * must represent a DAG, and moreover this DAG is optimal since x * is a solution
to the linear relaxation and thus an upper bound. Alternatively, x * may include variables
with fractional values. If so, generic cutting plane algorithms are run in the hope of finding
cutting planes which are not 'cluster' constraints (3).
Figure 2. Branch-and-cut approach to solving an IP.
A standard 'branch-and-cut' approach, as summarised in Figure 2, is taken to solving
the IP. Cutting planes are added (if possible) each time the linear relaxation is solved. If no
suitable cutting planes can be found progress is made by branching on a variable.
104
Cussens
Eventually this algorithm will return an optimal solution. In addition to cutting and
branching two further ingredients are used to improve performance.
The first of these is a 'sink-finding' primal heuristic algorithm which searches for a
feasible integer solution (i.e. a DAG) 'near' the solution to the current LP relaxation. The
point of this is to find a good (probably suboptimal) solution early in the solving process
since this allows earlier and more frequent pruning of the search if and when branching
begins.
To understand the sink-finding algorithm recall that each family variable I (W  v )
has an associated objective function coefficient. It follows that the potential parent sets for
each BN node can be ordered from 'best' (highest coefficient) to 'worst' (lowest coefficient).
Suppose, without loss of generality, that the BN nodes are labelled {1,2, p} and let
Wv ,1 ,Wv ,kv be the parent sets for BN node v ordered from best to worst, as illustrated in
Table 1. (In this table the rows are shown as being of equal length for neatness, but this is
typically not the case, since different BN nodes may have differing numbers of candidate
parent sets.)
Table 1. Example initial state of the sink-finding heuristic for
|V | p . Rows need not be of the same length.
I (W1,1  1)
I (W1,2  1)
…
I (W1,k1  1)
I (W2,1  2)
I (W2,2  2)
…
I (W2,k2  2)
I (W3,1  3)
I (W3,2  3)
…
I (W3, k3  3)
…
…
…
…
I (W p,1  p )
I (W p,2  p )
…
I (W p , k p  p )
Table 2. Example intermediate state of the sink-finding heuristic.
I (W1,1  1)
I (W1,2  1)
…
I (W1,k1  1)
I (W2,1  2)
I (W2,2  2)
…
I (W2,k2  2)
I (W3,1  3)
I (W3,2  3)
…
I (W3,k3  3)
…
…
…
…
I (W p,1  p )
I (W p,2  p )
…
I (W p , k p  p )
Each DAG must have at least one sink node, that is a node which has no children. So
any optimal DAG has a sink node for which one can choose its best parent set without fear
of creating a cycle. It follows that at least one of the parent sets in the leftmost column in
Table 1 must be selected in any optimal BN.
The sink-finding algorithm works by selecting parent sets for each BN node. It starts
by finding a BN node v such that the value of the family variable Wv ,1 is as close to 1 as
possible in the solution to the current LP relaxation. The parent set Wv ,1 is chosen for v
and then parent sets for other variables containing v are 'ruled out', ensuring that v will
be a sink node of the DAG eventually created (hence the name of the algorithm). Table 2
illustrates the state of the algorithm with v  2 and where v W1,1 , v W3,2 , v W1, p
and v W2, p .
Integer Programming for Bayesian Network Structure Learning
105
In its second iteration the sink-finding algorithm looks for a sink node for a DAG with
nodes V \{v} in the same way—selecting a best allowable parent set with a value closest
to 1 in the solution to the linear relaxation. In subsequent iterations the algorithm proceeds
analogously until a DAG is fully constructed. Since best allowable parent sets are chosen in
each iteration, the hope is that a high scoring (if not optimal) DAG will be returned.
The second extra ingredient for improving efficiency — propagation — can be more
briefly described. Suppose that due to branching decisions the IP variables I ({S }  L ) ,
and I ({L,T }  E ) have both been set to 1 in some subproblem. In this case it is
immediate that, for example, the variable I ({E }  S ) should be set to 0 in this
subproblem, since having all three set to 1 would result in a cyclic subgraph. Propagation
allows 'non-linear' reasoning within an IP approach and can bring important performance
benefits.
Although the BN structure learning problem has now been cast as an IP there are
some problems with this approach. Firstly there are exponentially many I (W  v ) IP
variables. To deal with this a restriction on candidate parent sets is typically made, usually
by limiting the number of parents any node can have to some small number (e.g. 2, 3 or 4).
It follows that the IP approach to BN learning is most appropriate to applications where
such a restriction is reasonable. Secondly, it is necessary to precompute the local scores
c (v,W ) which can be a slow business.
2.3. Adding Structural Constraints
This approach to BN structure learning has been implemented in the GOBNILP
system which is available for download from http://www.cs.york.ac.uk/aig/sw/gobnilp.
GOBNILP uses the SCIP 'constraint integer programming' framework [1] (scip.zib.de).
As well as implementing 'vanilla' BN learning using IP, GOBNILP allows the user to
add additional constraints on the structure of BNs. This facility is very important in solving
real problems since domain experts typically have some knowledge on how the variables in their data
are related. Failing to incorporate such knowledge (usually called 'prior knowledge') into the
learning process will produce inferior results. We may end up with a BN expressing
conditional independence relations between the random variables, which we know to be
untrue. The user constraints available in GOBNILP 1.4 are now given.
Conditional independence relations It may be that the user knows some conditional
independence relations that hold between the random variables. This can be declared and
GOBNILP will only return BNs respecting them.
(Non-)existence of particular arrows If the user knows that particular arrows must occur
in the BN this can be stated. In addition, if certain arrows must not occur this too can be
declared.
(Non-)existence of particular undirected edges If the user knows that there must be an
arrow between two particular nodes but does not wish to specify the direction this can be
declared. Similarly the non-existence of an arrow in either direction may be stated.
Immoralities If two parents of some node do not have an arrow connecting them, this is
known as an 'immorality' (or v-structure). It is sometimes useful to state the existence or
non-existence of immoralities. This is possible in GOBNILP.
Number of founders A founder is a BN node with no parents. Nodes A and S are the only
founders in Figure 1. GOBNILP allows the user to put upper and lower bounds on the
number of founders.
106
Cussens
Number of parents Each node in a BN is either a parent of some other node or not. In
Figure 1 all nodes are parents apart from the 'sink' nodes D and X. GOBNILP allows the
user to put upper and lower bounds on the number of parents. Such constraints were used
by Pe'er et al. [19] (not using GOBNILP!)
Number of arrows The BN in Figure 1 has 8 arrows. GOBNILP allows the user to put
upper and lower bounds on the number of arrows.
In many cases adding the functionality to allow such user-defined constraints is very
easy—because an IP approach has been taken. Integer programming allows what might be
called 'declarative machine learning' where the user can inject knowledge into the learning
algorithm without having to worry about how that algorithm will use it to solve the
problem.
One final feature which the IP approach makes simple is the learning of multiple BNs.
It is very important to acknowledge that the output of any BN learning algorithm (even an
'exact' one) can only be a guess as to what the 'true' BN should be. Although one can have
greater confidence in the accuracy of this guess as the amount of data increases, the
impossibility of deducing the correct BN remains. Given this, it is useful to consider a
range of possible BNs. GOBNILP does this by returning the top k best scoring BNs,
where k is set by the user. This is simply done: once a highest scoring BN is found a
linear constraint is added ruling out just that BN and the problem is re-solved.
2.4. Results
The IP approach to BN structure learning as implemented in GOBNILP 1.3 (not the
current version) has been evaluated by Bartlett and Cussens [2]. The main results from that
paper are reproduced here in Table 3 as a convenience. Synthetic datasets were generated
by sampling from the joint probability distributions defined by various Bayesian networks
(Column 'Network' in Table 3). p is the number of variables in the data set. m is the
limit on the number of parents of each variable. N is the number of observations in the
data set. Families is the number of family variables in the data set after pruning. All times
are given in seconds (rounded). '' [—]'' indicates that the solution had not been found after 2
hours --- the value given is the gap, rounded to the nearest percent, between the score of the
best found BN and the upper bound on the score of the best potential BN, as a percentage
of the score of the best found BN. A limit on the size of parent sets was then set (Column
m ) and local BDeu scores for 'family' variables were then computed. An IP problem was
then created and solved as described in the preceding sections.
The goal of these empirical investigations was to measure the effect of different
strategies and particularly to check whether the sink-finding algorithm and propagation did
indeed lead to faster solving. Comparing the column GOBNILP 1.3 to the columns SPH
and VP showed that typically (not always) both the sink-finding algorithm and propagation
(respectively) were helpful.
However what is most striking is how sensitive solving time is to the choice of cutting
plane strategy. Table 3 shows that using three of SCIP's builtin generic cutting plane
algorithms (Gomory, Strong CG and Zero-half) has a big, usually positive, effect. Entries in
italics are at least 10% worse than GOBNILP 1.3, while those in bold are at least 10%
better. Turning these off and just using cluster constraint cutting planes typically led to
much slower solving.
It is also evident that adding set packing constraints lead to big improvements. By a set
packing constraint we mean an inequality of the form given in (4).
Integer Programming for Bayesian Network Structure Learning
C  V : 

vC W :C \{v }W
107
I (W  v )  1
(4)
These inequalities state that for any subset C of nodes at most one v  C may have
all other members of C as its parents. The effect of adding in all such inequalities for all
C such that |C | 4 is what is recorded in column SPC of Table 3. Doing so typically
leads to faster solving since it leads to tighter linear relaxations.
To understand these results it is useful to dip into the theory of integer programming
and consider the convex hull of DAGs represented using family variables. Each DAG so
represented can be seen as a point in  n where n is the total number of family variables
in some BN learning problem instance. The convex hull of all such points is an
n -dimensional polyhedron (or more properly polytope) whose vertices correspond to DAGs.
If it were possible to compactly define this shape using a modest number of inequalities
one could construct a linear program (LP) (not an IP) with just these inequalities and there
would be a guarantee that any solution to the LP would be an optimal DAG. Unfortunately,
any such convex hull would require very many inequalities to define it, so it is necessary to
resort to approximating the convex hull by a much smaller number of inequalities. What the
results in this section show is that constructing a good approximation is vital; this is
because solutions to linear relaxations will provide better bounds. Using the set packing
constraints (4) and SCIP's generic cutting planes provide a much better approximation than
the cluster constraints (2) alone which leads to the improved solving times shown in Table 3.
Table 3. Comparison of GOBNILP 1.3 with older systems and impact of various features.
Network
m
p
N
Families
hailfinder
3
56
100
1000
10000
244
761
3708
hailfinder
4
56
100
1000
10000
4106
767
4330
18
4
68
270
14
934
alarm
3
37
100
1000
10000
907
1928
6473
2
5
289
6
14
792
alarm
4
37
100
1000
10000
1293
2097
8445
2
7
398
7
15
839
carpo
3
60
100
1000
10000
5068
3827
16391
756
106
1311
887
171
566
carpo
4
60
100
1000
10000
13185
4722
34540
[0%]
151
[0%]
[1%]
406
4065
diabetes
2
413
100
1000
10000
4441
21493
262129
2982
[17%]
[44%]
[31%]
[23%]
[17%]
pigs
2
441
100
1000
10000
2692
15847
304219
89
1818
[3%]
[0%]
[7%]
[9%]
5103
[8%]
[13%]
GOBNILP GOBNILP Cussens Without Solver Feature
1.3
1.0
2011
SPC
SPH
VP
3
1
1
1
1
1
14
11
5
5
5
4
169
361
100
102
102
56
4
15
12872
15176
593
42275
No Cuts of Type
G
SCG
ZH
1
1
1
4
4
4
558
75
83
34
10
62
18
4
128
34
4
587
18
5
216
13
6
71
13
4
71
3
12
479
2
4
394
1
5
397
2
7
739
2
4
1049
2
5
710
9
21
1253
2
6
806
7
8
1421
3
6
633
2
3
1567
2
6
1052
742
143
2158
628
134
1574
716
115
1071
690
117
4350
642
104
1032
740
110
2057
6649
252
[0%]
[0%]
168
[0%]
[0%]
188
[0%]
[0%]
240
[0%]
7014
208
[0%]
[0%]
140
[0%]
[39%] 3082
[168%] [199%]
[378%] [380%]
3040
[16%]
[44%]
3036 1506 3212
[17%] [15%] [17%]
[44%] [44%] [44%]
87
1788
[42%]
32
1715
[3%]
88
1714
[3%]
85
1802
[3%]
89
1822
[3%]
Key: SPC-Set Packing Constraints, SPH-Sink Primal Heuristic, VP-Value Propagator, G-
Gomory cuts, SCG-Strong CG cuts, ZH-Zero-half cuts.
108
Cussens
3. Conclusions and Future Work
It is instructive to examine which benefits of Bayesian networks are stressed by
commercial vendors of BN software such as Hugin Expert A/S (www.hugin.com), Norsys
Software Corp (www.norsys.com) and Bayesia (www.bayesia.com). A number of themes
stand out:
1.
The graphical structure of a BN allows one to 'read off' relationships between
variables of interest.
2.
It is possible to 'learn' BNs from data (using, perhaps, expert knowledge also).
3.
Since a BN represents a probability distribution the strength of a (probabilistic)
relation is properly quantified.
4.
BNs can be used for making predictions.
5.
By adding nodes representing actions and costs, Bayesian networks can be
extended into decision networks to help users make optimal decisions in
conditions of uncertainty.
The following extract from Bayesia's website stresses the first two of these benefits:
You can use the power of non-supervised learning to extract the set of
significant probabilistic relations contained in your databases (base
conceptualisation). Apart from significant time savings made by revealing direct
probabilistic relations compared with a standard analysis of the table of
correlations, this type of analysis is a real knowledge finding tool helping one
understand phenomena. [3]
So BN learning is an important task, but it is also known to be NP-hard (which means
that one cannot expect to have an algorithm which performs learning in time polynomial in
the size of the input). Nonetheless, it has been shown that integer programming is an
effective approach to 'exact' learning of Bayesian networks in certain circumstances.
However, current approaches have severe limitations. In particular, in order to prevent too
many IP variables being created restrictions, often artificial, on the number of these
variables are imposed. However, in any solution only one IP variable for each BN node has
a non-zero value (indicating the selected set of parents for that node). This suggests seeking
to avoid creating IP variables unless there is some prospect that they will have a non-zero
value in the optimal solution. Fortunately, there is a well-known IP technique which does
exactly this: delayed column generation [13] where variables are created 'on the fly'. A 'pricing'
algorithm is used to search for new variables which might be needed in an optimal solution.
This technique has yet to be applied to Bayesian network learning but it holds out the
possibility, at least, of allowing exact approaches to be applied to substantially bigger
problems.
Acknowledgements
Thanks to an anonymous referee for useful criticisms. This work has been supported
by the UK Medical Research Council (Project Grant G1002312).
References
1.
Achterberg, T. (2007). Constraint Integer Programming. Ph.D. thesis, TU Berlin.
Integer Programming for Bayesian Network Structure Learning
109
2.
Barlett, M. and Cussens, J. (2013). Advances in Bayesian network learning using
integer programming. In Ann Nicholson and Padhraic Smyth, editors, Proceedings of
the 29th Conference on Uncertainty in Artificial Intelligence (UAI 2013), 182–191, Bellevue.
AUAI Press.
3.
Bayesia. (2013). The strengths of Bayesia’s technology for marketing in 18 points.
http://www.bayesia.com/en/applications/marketing/advantages-marketing.php.
4.
Bøttcher, S. G. and Dethlefsen, C. (2003). DEAL: A Package for Learning Bayesian
Networks. Technical report, Department of Mathematical Sciences, Aalborg
University.
5.
Campos, de C., Zeng, Z. and Ji, Q. (2009). Structure learning of Bayesian networks
using constraints. Proceedings of the 26th International Conference on Machine Learning,
113-120, Canada.
6.
Cheng, J., Greiner, R., Kelly, J., Bell, D. and Liu, W. (2002). Learning Bayesian
networks from data: An information-theory based approach. Artificial Intelligence,
137(1-2), 43–90.
7.
Chickering, D. M., Geiger, D. and Heckerman, D. (1995). Learning Bayesian
networks: Search methods and experimental methods. Proceedings of the 5th
International Workshop on Artificial Intelligence and Statistics, 112–128, USA.
8.
Claassen, T., Mooij, J. and Heskes, T. (2013). Learning sparse causal models is not
NP-hard. Proceedings of the 29th Conference on Un-certainty in Artificial Intelligence
(UAI-13), 172–181, USA.
9.
Cooper, G. F. and Herskovits, E. (1992). A Bayesian method for the induction of
probabilistic networks from data. Machine Learning, 9, 309–347.
10. Cussens, J. (2011) Bayesian network learning with cutting planes. In Cozman, F. G.,
and Pfeffer, A. editors, Proceedings of the 27th Conference on Uncertainty in Artificial
Intelligence (UAI 2011), 153–160, Barcelona. AUAI Press.
11. Cussens, J. (2010). Maximum likelihood pedigree reconstruction using integer
programming. Proceedings of the Workshop on Constraint Based Methods for Bioinformatics
(WCB-10), Edinburgh.
12. Cussens, J., Bartlett, M., Jones, E. M. and Sheehan, N. A. (2013). Maximum
likelihood pedigree reconstruction using integer linear programming. Genetic
Epidemiology, 37(1), 69–83.
13. Desaulniers, G., Desrosiers, J. and Solomon, M. M., (2005). Column Generation.
Springer, USA.
14. Jaakkola, T., Sontag, D., Globerson, A. and Meila, M. (2010). Learning Bayesian
network structure using LP relaxations. Proceedings of 13th International Conference on
Artificial Intelligence and Statistics (AISTATS 2010), Italy, 9, 358–365. Journal of Machine
Learning Research Workshop and Conference Proceedings.
15. Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and
Techniques. MIT Press.
16. Lauritzen, S. L. and Spiegelhatler, D. (1988). Local computations with probabilities
on graphical structures and their application to expert systems. Journal of the Royal
Statistical Society (Series B), 50(2), 157–224.
17. Pe’er, D., Tanay, A. and Regev, A. (2006). MinReg: A scalable algorithm for learning
parsimonious regulatory networks in yeast and mammals. Journal of Machine Learning
Research, 7, 167–189.
110
Cussens
18. Silander, T. and Myllymäki, P. (2006). A simple approach for finding the globally
optimal Bayesian network structure. Proceedings of 22nd Conference on Uncertainly in
Artificial Intelligence, 445-452, AUAI Press, USA.
19. Spirtes, P., Meek, C., and Scheines, R. (1993). Causation, Prediction and Search.
Springer-Verlag, New York.
20. Tsamardinos, I., Brown, L. E. and Aliferis, C. F. (2006). The max-min hill-climbing
Bayesian network structure learning algorithm. Machine Learning, 65(1), 31–78.
21. Verma, T. and Pearl, J. (1992). An algorithm for deciding if a set of observed
independencies has a causal explanation. Proceedings of 8th Conference on Uncertainty in
Artificial Intelligence (UAI-92), 323–330.
22. Wolsey, L. A. (1998). Integer Programming. John Wiley.
23. Wolter, K. (2006). Implementation of Cutting Plane Separators for Mixed Integer
Programs. Master’s thesis, Technische Universität Berlin.
24. Yuan, C. and Malone, B. (2012). An improved admissible heuristic for learning
optimal Bayesian networks. Proceedings of the 28th Conference on Uncertainty in Artificial
Intelligence (UAI-12), Catalina Island, CA.
Author’s Biography:
James Cussens received his Ph.D. in the philosophy of probability from King's College,
London, UK. After spells working at the University of Oxford (twice), Glasgow
Caledonian University and King's College, London, he joined the University of York as a
Lecturer in 1997. He is currently a Senior Lecturer in the Artificial Intelligence Group,
Dept of Computer Science and also a member of the York Centre for Complex Systems
Analysis. He works on machine learning, probabilistic graphical models, discrete
optimisation and combinations thereof.