Quality Technology & Quantitative Management QTQM Vol. 11, No. 1, pp. 99-110, 2014 © ICAQM 2014 Integer Programming for Bayesian Network Structure Learning James Cussens* Department of Computer Science and York Centre for Complex Systems Analysis University of York, York, UK (Received July 2013, accepted December 2013) ______________________________________________________________________ Abstract: Bayesian networks provide an attractive representation of structured probabilistic information. There is thus much interest in 'learning' BNs from data. In this paper the problem of learning a Bayesian network using integer programming is presented. The SCIP (Solving Constraint Integer Programming) framework is used to do this. Although cutting planes are a key ingredient in our approach, primal heuristics and efficient propagation are also important. Keywords: Bayesian networks, integer programming, machine learning. ______________________________________________________________________ 1. Introduction A Bayesian network (BN) represents a probability distribution over a finite number of random variables. In this paper, unless specified otherwise, it will be assumed that all random variables are discrete. A BN has two components: an acyclic directed graph (DAG) representing qualitative aspects of the distribution, and a set of parameters. Figure 1 presents the structure of the famous 'Asia' BN which was introduced by Lauritzen and Spiegelhalter [16]. This BN has 8 random variables A, T, X, E, L, D, S and B. The BN represents an imagined probabilistic medical 'expert system' where A = visit to Asia, T = Tuberculosis, X = Normal X-Ray result, E = Either tuberculosis or lung cancer, L = Lung cancer, D = Dyspnea (shortness of breath), S = Smoker and B = Bronchitis. Each of these random variables has two values: TRUE (t) and FALSE (f). A joint probability distribution for these 8 random variables must specify a probability for each of the 28 joint instantiations of these random variables. To specify these 28 probabilities some parameters are needed. To explain what these parameters are some terminology is now introduced. In a BN if there is an arrow from node X to node Y we say that X is a parent of Y (and that Y is a child of node X). The parameters of a BN are defined in terms of the set of parents each node has. They are conditional probability tables (CPTs), one for each random variable, which specify a distribution for the random variable for each possible joint instantiation of its parents. So, for example, the CPT for D in Figure 1 could be: P (D t P (D t P (D t P (D t * B f , E f ) 0.3 B f , E t ) 0.4 B t , E f ) 0.5 B t , E t ) 1.0 P (D f P (D f P (D f P (D f Corresponding author. E-mail: [email protected] B f , E f ) 0.7 B f , E t ) 0.6 B t , E f ) 0.5 B t , E t ) 0.0 100 Cussens Note that deterministic relations can be represented using 0 and 1 values for probabilities. If a random variable has no parents (like A and S in Figure 1) an unconditional probability distribution is defined for its values. For example the CPT for A might be P ( A t ) 0.1 , P ( A f ) 0.9 . Figure 1. An 8-node DAG which is the structure of a BN (the 'Asia' BN [16]) with random variables A, T, X, E, L, D, S, and B. The probability of any joint instantiation of the random variables is given by multiplying the relevant conditional probabilities found in the CPTs. Although there are 28 such joint instantiations for the BN in Figure 1 the number of parameters of the BN is far fewer so that BNs provide a compact representation. They can do this since the BN structure encodes conditional independence assumptions about the random variables. A full account of this will not be give here: the interested reader should consult Koller and Friedman's excellent book on probabilistic graphical models [15]. However the basic idea is that if a node (or collection of nodes) V3 'blocks' a path in the graph between two other nodes V1 and V2 , then V1 is independent of V2 given V3 (Koller and Friedman [15] provide a proper definition of what it means to 'block' a path). So for example, in Figure 1 A is dependent on E, D and X, but it is independent of these random variables given T. To put it informally: knowing about A tells you something about E, D and X, but once you know the value of T, A provides no information about E, D or X---it is only via T that A provides information about E, D or X. The graphs allows one to 'read off' these relationships between the variables. For example, recall that in the 'Asia' BN in Figure 1, S = Smoking, L = Lung cancer, B = Bronchitis, and D = Dyspnea (shortness of breath). The structure of the graph tell us that smoking influences dyspnea, but it only does so as a result of lung cancer or bronchitis. Such structural information can provide considerable insight, but this raises the question of how it can be reliably obtained. Two main approaches are taken. In the first a domain expert is asked to provide the structure. There are, of course, many problems with such a 'manual' approach: experts' time is expensive, experts may disagree and make mistakes and any expert used would have to first understand the semantics of BNs. An appealing alternative is to infer BN structure directly from data. Any data which can be viewed as having been sampled from some unknown joint probability distribution is appropriate. The goal is to learn a BN structure for this unknown distribution. For example, supposing again that the BN in Figure 1 is a medical expert system, it could be inferred from a database (single table) of patient records, where for each patient there is a field recording whether they smoke, have lung cancer, have bronchitis, suffer from dyspnea, etc. Integer Programming for Bayesian Network Structure Learning 101 There are currently two main approaches taken to learning BN structure from data. In the first, statistical tests are performed with a view to determining conditional independence relations between the variables. An algorithm then searches for a DAG which represents the conditional independence relations thus found [6, 8, 19, 21]. In the second approach, often called search and score, each candidate DAG has a score which reflects how well it fits the data. The goal is then simply to find the DAG with the highest score [4, 5, 7, 9, 18, 24]. It is also possible to combine elements of both these main approaches [20]. The difficulty with search and score is that the number of candidate DAGs grows super-exponentially with the number of variables, so a simple enumerative approach is out of the question for all but the smallest problems. A number of search techniques have been applied, including greedy hill-climbing [19], dynamic programming [18], branch-and-bound [5] and A*[24]. In many cases the search is not complete in the sense that there is no guarantee that the BN structure returned has an optimal score. However recently there has been much interest in complete (also known as exact) BN structure learning where a search is conducted until a guaranteed optimal structure is returned. 2. Bayesian Network Structure Learning with IP In the rest of this paper an integer programming approach to exact BN learning is described. The basic ideas of integer programming (IP) are first briefly presented. This is then followed by an account of how BN structure learning can be encoded and efficiently solved using IP. Two important extensions are then described: adding in structural prior information and finding multiple solutions. The article ends with a summary of how well this approach performs. 2.1. Integer Programming In an integer programming problem the goal is to maximise a linear objective function subject to linear constraints with the added constraint that all variables must take integer values. (Any minimisation problem can be easily converted into a maximisation problem, so here only maximisation problems are considered.) Let x ( x1 , x 2 , x n ) be the problem variables where each x i (i.e. can only take integer values). Assume that finite upper and lower bounds on each x i are given. Let c (c1 , c 2 ,c n ) be the real-valued vector of objective coefficients for each problem variable. Viewing x as a column vector and c as a row vector, the problem of maximising cx with no constraints is easy: just set each x i to its lower bound if c i 0 and set it to its upper bound otherwise. The problem becomes significantly harder once linear constraints on acceptable solutions are added. Each such constraint is of the form ax b where a is a real-valued row vector and b is a real number. Many important industrial and business problems can be encoded as an IP and there are many powerful solvers ( such as CPLEX ) which can provide optimal solutions even when there are thousands of problem variables and constraints. A proper account of the many techniques of integer programming will not be provided here (for that see Wolsey's book [22]) but some basics are now given. Although solving an IP may be very hard, solving the linear relaxation of an IP is much easier (the simplex algorithm is often used). The linear relaxation is the same as the original IP except that the variables are now permitted to take non-integer values. Note that if we are 'lucky' and the solution to the linear relaxation 'happens' to be integer valued then the original IP is also 102 Cussens solved. In general, this is not the case but the solution to the linear relaxation does provide a useful upper bound on the objective value of an optimal integer solution. Two important parts of the IP solving process (in addition to solving the linear relaxation) are the addition of cutting planes and branching. A cutting plane is a linear inequality not present in the original problem whose validity is implied by (1) those linear inequalities that are initially present and (2) the integrality restriction on the problem variables. Typically an IP solver will search for cutting planes which the solution to the linear relaxation (call it x * ) does not satisfy. IP solvers typically contain a number of generic cutting plane algorithms (e.g. Gomory, Strong Chvátal-Gomory, zero-half [23]) which can be applied to any IP. In addition, users can create problem-specific cutting plane algorithms. Adding cutting planes will not rule out the yet-to-be-found optimal integer solution but will rule out x * . It follows that adding cutting planes in this way will produce a new linear relaxation whose solution will provide a tighter upper bound. In some problems it is possible to add sufficiently many cutting planes of the right sort so that a linear relaxation is produced whose solution is entirely integer-valued. In such a case the original IP problem is solved. Typically this is not the case so another approach is required, the most common of which is branching. In branching a problem variable x i is selected together with some appropriate integer value l . Two new subproblems are then created: one where x i l 1 and one where x i l . Usually a variable is selected with a non-integer value in the linear relaxation solution x * . Since there are only finitely many variables each with finitely many values it is not difficult to see that one can search for all possible solutions by repeated branching. In practice this search is made efficient by pruning. Pruning takes advantage of the upper bound provided by the linear relaxation. It also uses the incumbent: the best (not necessarily optimal) solution found so far. If the upper bound on the best solution for some subproblem is below that of the objective value of the incumbent then the optimal solution for the subproblem is worse than the incumbent and no further work on the subproblem is necessary. 2.2. Bayesian Network Learning as an IP Problem In this section it is shown how to represent the BN structure learning problem as an IP. This question has been considered in a number of papers [2, 10, 11, 12, 14]. Firstly, we need to create IP problem variables to represent the structure of DAGs. This is done by creating binary 'family' variables I (W v ) for each node v and candidate parent set W , where I (W v ) 1 iff W is the parent set for v . In this encoding, the DAG in Figure 1 would be represented by a solution where I ( A) 1, I ( S ) 1, I ({ A} T ) 1, I ({S } L ) 1, I ({S } B ) 1, I ({L,T } E ) 1, I ({E } X ) 1, I ({B, E } D ) 1 , and all other IP variables have the value 0. The next issue to consider is how to score candidate BNs: how do we measure how 'good' a given BN is for the data from which we are learning? A number of scores are used but here only one is considered: log marginal likelihood or the BDeu score. The BDeu score comes from looking at the problem from the perspective of Bayesian statistics. In that approach the problem is to find the 'most probable' BN given the data, i.e. to find a BN G which maximises P (G |Data) . Using Bayes theorem we have that P (G |Data) P (G ) P (Data|G ) , where P (G ) is the prior probability of BN G and P (Data|G ) is the marginal likelihood. If we have no prior bias between the candidate BNs it is reasonable for P (G ) to have the same value for all G . In this case maximising marginal likelihood or indeed log marginal likelihood will maximise P (G |Data). (Note that the word Integer Programming for Bayesian Network Structure Learning 103 ''Bayesian'' in ''Bayesian networks'' is misleading, since BNs are no more Bayesian than other probabilistic models which do not have the world ''Bayesian'' in their name.) Crucially, given certain restrictions, the BDeu score can be expressed as a linear function of the family variables I (W v ) (hence the decision to encode the graph using them). So-called 'local scores' c (v,W ) are computed from the data for each I (W v ) variable. The BN structure learning problem then becomes the problem of maximising c (v,W ) I (W v ) , v ,W (1) subject to the condition that the values assigned to the I (W v ) represent a DAG. Linear inequalities are now required to restrict instantiations of the I (W v ) variables so that only DAGs are represented. Firstly, it is easy to ensure that each BN variable (call BN variables 'nodes') has exactly one (possibly empty) parent set. Letting V be the set of BN nodes, the following linear constraints are added to the IP: v V : I (W v ) 1 , (2) W Ensuring that the graph is acyclic is more tricky. The most successful approach has been to use 'cluster' constraints: C V : I (W v ) 1 , vC W :W C (3) introduced by Jaakkola et al. [14]. A cluster is a subset of BN nodes. For each cluster C the associated constraint declares that at least one v C has no parents in C . Since there are exponentially many cluster constraints these are added as cutting planes in the course of solving: each time the linear relaxation of the IP is solved there is a search for a cluster constraint which is not satisfied by the linear relaxation solution. If no cluster constraint can be found there are two possibilities depending on whether the linear relaxation solution (call it x * ) has variables with fractional values or not. If there are no fractional variables then x * must represent a DAG, and moreover this DAG is optimal since x * is a solution to the linear relaxation and thus an upper bound. Alternatively, x * may include variables with fractional values. If so, generic cutting plane algorithms are run in the hope of finding cutting planes which are not 'cluster' constraints (3). Figure 2. Branch-and-cut approach to solving an IP. A standard 'branch-and-cut' approach, as summarised in Figure 2, is taken to solving the IP. Cutting planes are added (if possible) each time the linear relaxation is solved. If no suitable cutting planes can be found progress is made by branching on a variable. 104 Cussens Eventually this algorithm will return an optimal solution. In addition to cutting and branching two further ingredients are used to improve performance. The first of these is a 'sink-finding' primal heuristic algorithm which searches for a feasible integer solution (i.e. a DAG) 'near' the solution to the current LP relaxation. The point of this is to find a good (probably suboptimal) solution early in the solving process since this allows earlier and more frequent pruning of the search if and when branching begins. To understand the sink-finding algorithm recall that each family variable I (W v ) has an associated objective function coefficient. It follows that the potential parent sets for each BN node can be ordered from 'best' (highest coefficient) to 'worst' (lowest coefficient). Suppose, without loss of generality, that the BN nodes are labelled {1,2, p} and let Wv ,1 ,Wv ,kv be the parent sets for BN node v ordered from best to worst, as illustrated in Table 1. (In this table the rows are shown as being of equal length for neatness, but this is typically not the case, since different BN nodes may have differing numbers of candidate parent sets.) Table 1. Example initial state of the sink-finding heuristic for |V | p . Rows need not be of the same length. I (W1,1 1) I (W1,2 1) … I (W1,k1 1) I (W2,1 2) I (W2,2 2) … I (W2,k2 2) I (W3,1 3) I (W3,2 3) … I (W3, k3 3) … … … … I (W p,1 p ) I (W p,2 p ) … I (W p , k p p ) Table 2. Example intermediate state of the sink-finding heuristic. I (W1,1 1) I (W1,2 1) … I (W1,k1 1) I (W2,1 2) I (W2,2 2) … I (W2,k2 2) I (W3,1 3) I (W3,2 3) … I (W3,k3 3) … … … … I (W p,1 p ) I (W p,2 p ) … I (W p , k p p ) Each DAG must have at least one sink node, that is a node which has no children. So any optimal DAG has a sink node for which one can choose its best parent set without fear of creating a cycle. It follows that at least one of the parent sets in the leftmost column in Table 1 must be selected in any optimal BN. The sink-finding algorithm works by selecting parent sets for each BN node. It starts by finding a BN node v such that the value of the family variable Wv ,1 is as close to 1 as possible in the solution to the current LP relaxation. The parent set Wv ,1 is chosen for v and then parent sets for other variables containing v are 'ruled out', ensuring that v will be a sink node of the DAG eventually created (hence the name of the algorithm). Table 2 illustrates the state of the algorithm with v 2 and where v W1,1 , v W3,2 , v W1, p and v W2, p . Integer Programming for Bayesian Network Structure Learning 105 In its second iteration the sink-finding algorithm looks for a sink node for a DAG with nodes V \{v} in the same way—selecting a best allowable parent set with a value closest to 1 in the solution to the linear relaxation. In subsequent iterations the algorithm proceeds analogously until a DAG is fully constructed. Since best allowable parent sets are chosen in each iteration, the hope is that a high scoring (if not optimal) DAG will be returned. The second extra ingredient for improving efficiency — propagation — can be more briefly described. Suppose that due to branching decisions the IP variables I ({S } L ) , and I ({L,T } E ) have both been set to 1 in some subproblem. In this case it is immediate that, for example, the variable I ({E } S ) should be set to 0 in this subproblem, since having all three set to 1 would result in a cyclic subgraph. Propagation allows 'non-linear' reasoning within an IP approach and can bring important performance benefits. Although the BN structure learning problem has now been cast as an IP there are some problems with this approach. Firstly there are exponentially many I (W v ) IP variables. To deal with this a restriction on candidate parent sets is typically made, usually by limiting the number of parents any node can have to some small number (e.g. 2, 3 or 4). It follows that the IP approach to BN learning is most appropriate to applications where such a restriction is reasonable. Secondly, it is necessary to precompute the local scores c (v,W ) which can be a slow business. 2.3. Adding Structural Constraints This approach to BN structure learning has been implemented in the GOBNILP system which is available for download from http://www.cs.york.ac.uk/aig/sw/gobnilp. GOBNILP uses the SCIP 'constraint integer programming' framework [1] (scip.zib.de). As well as implementing 'vanilla' BN learning using IP, GOBNILP allows the user to add additional constraints on the structure of BNs. This facility is very important in solving real problems since domain experts typically have some knowledge on how the variables in their data are related. Failing to incorporate such knowledge (usually called 'prior knowledge') into the learning process will produce inferior results. We may end up with a BN expressing conditional independence relations between the random variables, which we know to be untrue. The user constraints available in GOBNILP 1.4 are now given. Conditional independence relations It may be that the user knows some conditional independence relations that hold between the random variables. This can be declared and GOBNILP will only return BNs respecting them. (Non-)existence of particular arrows If the user knows that particular arrows must occur in the BN this can be stated. In addition, if certain arrows must not occur this too can be declared. (Non-)existence of particular undirected edges If the user knows that there must be an arrow between two particular nodes but does not wish to specify the direction this can be declared. Similarly the non-existence of an arrow in either direction may be stated. Immoralities If two parents of some node do not have an arrow connecting them, this is known as an 'immorality' (or v-structure). It is sometimes useful to state the existence or non-existence of immoralities. This is possible in GOBNILP. Number of founders A founder is a BN node with no parents. Nodes A and S are the only founders in Figure 1. GOBNILP allows the user to put upper and lower bounds on the number of founders. 106 Cussens Number of parents Each node in a BN is either a parent of some other node or not. In Figure 1 all nodes are parents apart from the 'sink' nodes D and X. GOBNILP allows the user to put upper and lower bounds on the number of parents. Such constraints were used by Pe'er et al. [19] (not using GOBNILP!) Number of arrows The BN in Figure 1 has 8 arrows. GOBNILP allows the user to put upper and lower bounds on the number of arrows. In many cases adding the functionality to allow such user-defined constraints is very easy—because an IP approach has been taken. Integer programming allows what might be called 'declarative machine learning' where the user can inject knowledge into the learning algorithm without having to worry about how that algorithm will use it to solve the problem. One final feature which the IP approach makes simple is the learning of multiple BNs. It is very important to acknowledge that the output of any BN learning algorithm (even an 'exact' one) can only be a guess as to what the 'true' BN should be. Although one can have greater confidence in the accuracy of this guess as the amount of data increases, the impossibility of deducing the correct BN remains. Given this, it is useful to consider a range of possible BNs. GOBNILP does this by returning the top k best scoring BNs, where k is set by the user. This is simply done: once a highest scoring BN is found a linear constraint is added ruling out just that BN and the problem is re-solved. 2.4. Results The IP approach to BN structure learning as implemented in GOBNILP 1.3 (not the current version) has been evaluated by Bartlett and Cussens [2]. The main results from that paper are reproduced here in Table 3 as a convenience. Synthetic datasets were generated by sampling from the joint probability distributions defined by various Bayesian networks (Column 'Network' in Table 3). p is the number of variables in the data set. m is the limit on the number of parents of each variable. N is the number of observations in the data set. Families is the number of family variables in the data set after pruning. All times are given in seconds (rounded). '' [—]'' indicates that the solution had not been found after 2 hours --- the value given is the gap, rounded to the nearest percent, between the score of the best found BN and the upper bound on the score of the best potential BN, as a percentage of the score of the best found BN. A limit on the size of parent sets was then set (Column m ) and local BDeu scores for 'family' variables were then computed. An IP problem was then created and solved as described in the preceding sections. The goal of these empirical investigations was to measure the effect of different strategies and particularly to check whether the sink-finding algorithm and propagation did indeed lead to faster solving. Comparing the column GOBNILP 1.3 to the columns SPH and VP showed that typically (not always) both the sink-finding algorithm and propagation (respectively) were helpful. However what is most striking is how sensitive solving time is to the choice of cutting plane strategy. Table 3 shows that using three of SCIP's builtin generic cutting plane algorithms (Gomory, Strong CG and Zero-half) has a big, usually positive, effect. Entries in italics are at least 10% worse than GOBNILP 1.3, while those in bold are at least 10% better. Turning these off and just using cluster constraint cutting planes typically led to much slower solving. It is also evident that adding set packing constraints lead to big improvements. By a set packing constraint we mean an inequality of the form given in (4). Integer Programming for Bayesian Network Structure Learning C V : vC W :C \{v }W 107 I (W v ) 1 (4) These inequalities state that for any subset C of nodes at most one v C may have all other members of C as its parents. The effect of adding in all such inequalities for all C such that |C | 4 is what is recorded in column SPC of Table 3. Doing so typically leads to faster solving since it leads to tighter linear relaxations. To understand these results it is useful to dip into the theory of integer programming and consider the convex hull of DAGs represented using family variables. Each DAG so represented can be seen as a point in n where n is the total number of family variables in some BN learning problem instance. The convex hull of all such points is an n -dimensional polyhedron (or more properly polytope) whose vertices correspond to DAGs. If it were possible to compactly define this shape using a modest number of inequalities one could construct a linear program (LP) (not an IP) with just these inequalities and there would be a guarantee that any solution to the LP would be an optimal DAG. Unfortunately, any such convex hull would require very many inequalities to define it, so it is necessary to resort to approximating the convex hull by a much smaller number of inequalities. What the results in this section show is that constructing a good approximation is vital; this is because solutions to linear relaxations will provide better bounds. Using the set packing constraints (4) and SCIP's generic cutting planes provide a much better approximation than the cluster constraints (2) alone which leads to the improved solving times shown in Table 3. Table 3. Comparison of GOBNILP 1.3 with older systems and impact of various features. Network m p N Families hailfinder 3 56 100 1000 10000 244 761 3708 hailfinder 4 56 100 1000 10000 4106 767 4330 18 4 68 270 14 934 alarm 3 37 100 1000 10000 907 1928 6473 2 5 289 6 14 792 alarm 4 37 100 1000 10000 1293 2097 8445 2 7 398 7 15 839 carpo 3 60 100 1000 10000 5068 3827 16391 756 106 1311 887 171 566 carpo 4 60 100 1000 10000 13185 4722 34540 [0%] 151 [0%] [1%] 406 4065 diabetes 2 413 100 1000 10000 4441 21493 262129 2982 [17%] [44%] [31%] [23%] [17%] pigs 2 441 100 1000 10000 2692 15847 304219 89 1818 [3%] [0%] [7%] [9%] 5103 [8%] [13%] GOBNILP GOBNILP Cussens Without Solver Feature 1.3 1.0 2011 SPC SPH VP 3 1 1 1 1 1 14 11 5 5 5 4 169 361 100 102 102 56 4 15 12872 15176 593 42275 No Cuts of Type G SCG ZH 1 1 1 4 4 4 558 75 83 34 10 62 18 4 128 34 4 587 18 5 216 13 6 71 13 4 71 3 12 479 2 4 394 1 5 397 2 7 739 2 4 1049 2 5 710 9 21 1253 2 6 806 7 8 1421 3 6 633 2 3 1567 2 6 1052 742 143 2158 628 134 1574 716 115 1071 690 117 4350 642 104 1032 740 110 2057 6649 252 [0%] [0%] 168 [0%] [0%] 188 [0%] [0%] 240 [0%] 7014 208 [0%] [0%] 140 [0%] [39%] 3082 [168%] [199%] [378%] [380%] 3040 [16%] [44%] 3036 1506 3212 [17%] [15%] [17%] [44%] [44%] [44%] 87 1788 [42%] 32 1715 [3%] 88 1714 [3%] 85 1802 [3%] 89 1822 [3%] Key: SPC-Set Packing Constraints, SPH-Sink Primal Heuristic, VP-Value Propagator, G- Gomory cuts, SCG-Strong CG cuts, ZH-Zero-half cuts. 108 Cussens 3. Conclusions and Future Work It is instructive to examine which benefits of Bayesian networks are stressed by commercial vendors of BN software such as Hugin Expert A/S (www.hugin.com), Norsys Software Corp (www.norsys.com) and Bayesia (www.bayesia.com). A number of themes stand out: 1. The graphical structure of a BN allows one to 'read off' relationships between variables of interest. 2. It is possible to 'learn' BNs from data (using, perhaps, expert knowledge also). 3. Since a BN represents a probability distribution the strength of a (probabilistic) relation is properly quantified. 4. BNs can be used for making predictions. 5. By adding nodes representing actions and costs, Bayesian networks can be extended into decision networks to help users make optimal decisions in conditions of uncertainty. The following extract from Bayesia's website stresses the first two of these benefits: You can use the power of non-supervised learning to extract the set of significant probabilistic relations contained in your databases (base conceptualisation). Apart from significant time savings made by revealing direct probabilistic relations compared with a standard analysis of the table of correlations, this type of analysis is a real knowledge finding tool helping one understand phenomena. [3] So BN learning is an important task, but it is also known to be NP-hard (which means that one cannot expect to have an algorithm which performs learning in time polynomial in the size of the input). Nonetheless, it has been shown that integer programming is an effective approach to 'exact' learning of Bayesian networks in certain circumstances. However, current approaches have severe limitations. In particular, in order to prevent too many IP variables being created restrictions, often artificial, on the number of these variables are imposed. However, in any solution only one IP variable for each BN node has a non-zero value (indicating the selected set of parents for that node). This suggests seeking to avoid creating IP variables unless there is some prospect that they will have a non-zero value in the optimal solution. Fortunately, there is a well-known IP technique which does exactly this: delayed column generation [13] where variables are created 'on the fly'. A 'pricing' algorithm is used to search for new variables which might be needed in an optimal solution. This technique has yet to be applied to Bayesian network learning but it holds out the possibility, at least, of allowing exact approaches to be applied to substantially bigger problems. Acknowledgements Thanks to an anonymous referee for useful criticisms. This work has been supported by the UK Medical Research Council (Project Grant G1002312). References 1. Achterberg, T. (2007). Constraint Integer Programming. Ph.D. thesis, TU Berlin. Integer Programming for Bayesian Network Structure Learning 109 2. Barlett, M. and Cussens, J. (2013). Advances in Bayesian network learning using integer programming. In Ann Nicholson and Padhraic Smyth, editors, Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence (UAI 2013), 182–191, Bellevue. AUAI Press. 3. Bayesia. (2013). The strengths of Bayesia’s technology for marketing in 18 points. http://www.bayesia.com/en/applications/marketing/advantages-marketing.php. 4. Bøttcher, S. G. and Dethlefsen, C. (2003). DEAL: A Package for Learning Bayesian Networks. Technical report, Department of Mathematical Sciences, Aalborg University. 5. Campos, de C., Zeng, Z. and Ji, Q. (2009). Structure learning of Bayesian networks using constraints. Proceedings of the 26th International Conference on Machine Learning, 113-120, Canada. 6. Cheng, J., Greiner, R., Kelly, J., Bell, D. and Liu, W. (2002). Learning Bayesian networks from data: An information-theory based approach. Artificial Intelligence, 137(1-2), 43–90. 7. Chickering, D. M., Geiger, D. and Heckerman, D. (1995). Learning Bayesian networks: Search methods and experimental methods. Proceedings of the 5th International Workshop on Artificial Intelligence and Statistics, 112–128, USA. 8. Claassen, T., Mooij, J. and Heskes, T. (2013). Learning sparse causal models is not NP-hard. Proceedings of the 29th Conference on Un-certainty in Artificial Intelligence (UAI-13), 172–181, USA. 9. Cooper, G. F. and Herskovits, E. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, 309–347. 10. Cussens, J. (2011) Bayesian network learning with cutting planes. In Cozman, F. G., and Pfeffer, A. editors, Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011), 153–160, Barcelona. AUAI Press. 11. Cussens, J. (2010). Maximum likelihood pedigree reconstruction using integer programming. Proceedings of the Workshop on Constraint Based Methods for Bioinformatics (WCB-10), Edinburgh. 12. Cussens, J., Bartlett, M., Jones, E. M. and Sheehan, N. A. (2013). Maximum likelihood pedigree reconstruction using integer linear programming. Genetic Epidemiology, 37(1), 69–83. 13. Desaulniers, G., Desrosiers, J. and Solomon, M. M., (2005). Column Generation. Springer, USA. 14. Jaakkola, T., Sontag, D., Globerson, A. and Meila, M. (2010). Learning Bayesian network structure using LP relaxations. Proceedings of 13th International Conference on Artificial Intelligence and Statistics (AISTATS 2010), Italy, 9, 358–365. Journal of Machine Learning Research Workshop and Conference Proceedings. 15. Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press. 16. Lauritzen, S. L. and Spiegelhatler, D. (1988). Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society (Series B), 50(2), 157–224. 17. Pe’er, D., Tanay, A. and Regev, A. (2006). MinReg: A scalable algorithm for learning parsimonious regulatory networks in yeast and mammals. Journal of Machine Learning Research, 7, 167–189. 110 Cussens 18. Silander, T. and Myllymäki, P. (2006). A simple approach for finding the globally optimal Bayesian network structure. Proceedings of 22nd Conference on Uncertainly in Artificial Intelligence, 445-452, AUAI Press, USA. 19. Spirtes, P., Meek, C., and Scheines, R. (1993). Causation, Prediction and Search. Springer-Verlag, New York. 20. Tsamardinos, I., Brown, L. E. and Aliferis, C. F. (2006). The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning, 65(1), 31–78. 21. Verma, T. and Pearl, J. (1992). An algorithm for deciding if a set of observed independencies has a causal explanation. Proceedings of 8th Conference on Uncertainty in Artificial Intelligence (UAI-92), 323–330. 22. Wolsey, L. A. (1998). Integer Programming. John Wiley. 23. Wolter, K. (2006). Implementation of Cutting Plane Separators for Mixed Integer Programs. Master’s thesis, Technische Universität Berlin. 24. Yuan, C. and Malone, B. (2012). An improved admissible heuristic for learning optimal Bayesian networks. Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence (UAI-12), Catalina Island, CA. Author’s Biography: James Cussens received his Ph.D. in the philosophy of probability from King's College, London, UK. After spells working at the University of Oxford (twice), Glasgow Caledonian University and King's College, London, he joined the University of York as a Lecturer in 1997. He is currently a Senior Lecturer in the Artificial Intelligence Group, Dept of Computer Science and also a member of the York Centre for Complex Systems Analysis. He works on machine learning, probabilistic graphical models, discrete optimisation and combinations thereof.
© Copyright 2026 Paperzz