Discovering Branching Rules for Mixed Integer Programming by

Discovering Branching Rules for
Mixed Integer Programming by
Computational Intelligence
Kjartan Brjánn Pétursson
Faculty
Faculty of
of Industrial
Industrial Engineering,
Engineering,
Mechanical
Mechanical Engineering
Engineering and
and Computer
Computer Science
Science
University
University of
of Iceland
Iceland
2015
2015
DISCOVERING BRANCHING RULES FOR MIXED
INTEGER PROGRAMMING BY COMPUTATIONAL
INTELLIGENCE
Kjartan Brjánn Pétursson
60 ECTS thesis submitted in partial fulfillment of a
Magister Scientiarum degree in Computational Engineering
Advisors
Tómas Philip Rúnarsson
Páll Melsted
Faculty Representative
Eyjólfur Ingi Ásgeirsson
Faculty of Industrial Engineering,
Mechanical Engineering and Computer Science
School of Engineering and Natural Sciences
University of Iceland
Reykjavik, October 2015
Discovering Branching Rules for Mixed Integer Programming by Computational Intelligence
Discovering Branching Rules for Mixed Integer Programming
60 ECTS thesis submitted in partial fulfilment of a M.Sc. degree in Computational Engineering
c 2015 Kjartan Brjánn Pétursson
Copyright All rights reserved
Faculty of Industrial Engineering,
Mechanical Engineering and Computer Science
School of Engineering and Natural Sciences
University of Iceland
Hjarðarhagi 2
107, Reykjavik, Reykjavik
Iceland
Telephone: 525 4000
Bibliographic information:
Kjartan Brjánn Pétursson, 2015, Discovering Branching Rules for Mixed Integer Programming
by Computational Intelligence, M.Sc. thesis, Faculty of Industrial Engineering,
Mechanical Engineering and Computer Science, University of Iceland.
Printing: Háskólaprent, Fálkagata 2, 107 Reykjavík
Reykjavik, Iceland, October 2015
Abstract
This thesis introduces two novel search methods for automated design and discovery of
heuristic branching rules, employed by the branch-and-bound algorithm for Mixed Integer
Programming (MIP). For this experiment, SCIP, an open source MIP solver is used. The
rules are specifically designed/trained to be effective at solving the 0-1 Multidimensional
Knapsack Problem. A supervised approach is developed, utilizing data collected at the
root node of the solution tree, where each instance is solved once for each branching
candidate available at the root. Rules are then designed via feature selection with weights
inferred through evolutionary search. The resulting branching rules are designed for use at
the root node only where the most important branching decisions are made. Another, nonsupervised method, assigns weights to heuristic components selected beforehand. Weights
are assigned using (1+1)CMA-ES which evaluates performance by solving every training
instance with SCIP. For both methods, the branching rules are trained on small randomly
generated knapsack instances, and after implementation in SCIP, tested on randomly
generated knapsack instances of various sizes and structures. Both methods are shown to
generate rules that significantly outperform the state-of-the-art default branching rules of
the SCIP solver over most of the groups of test instances.
Útdráttur
Í ritherðinni eru kynntar tvær nýjar aðferðir fyrir hönnun og uppgötvun á kvíslunarreglum
fyrir heiltölubestun í SCIP, en SCIP er opinn hugbúnaður fyrir heiltölubestun. Reglurnar
eru sérstaklega hannaðar/þjálfaðar til að nýta við lausn 0-1 fjölvíða bakpokaverkefnisins
með kvíslunaraðferð. Önnur aðferðin notar leiðbeint nám við að læra af gögnum er safnað
hefur verið við að leysa hvert verkefni einu sinni fyrir hvern valkost í rót lausnartrésins.
Eftir sjálfvirkt val á skýribreytum er vægi breyta úthlutað með þróunararreikniriti. Reglur
sem búnar eru til með þessari aðferð eru hugsaðar til notkunar í rót lausnartrésins. Hin
aðferðin notar sjálfstætt nám þar sem að skýribreytur hafa verið valdar fyrirfram, en
vægi breyta eru síðan fundin með (1+1)CMA-ES þróunarreikniritinu. Reglur fundnar
með báðum aðferðum eru þjálfaðar á litlum bakpokaverkefnum búnum til af handahófi.
Reglurnar eru síðan útfærðar í SCIP og prófaðar á bakpokaverkefnum af ýmsum stærðum
og gerðum. Sýnt er fram á að báðar þessar aðferðir leiða til reglna sem flýta fyrir lausn á
flestum flokkum þeirra verkefna sem lögð eru fyrir til prófunar.
v
Contents
List of Figures
ix
List of Tables
xi
Glossary
xiii
Acknowledgments
xv
1. Introduction
1.1. Motivation . .
1.2. Contribution .
1.3. Related Work
1.4. Overview . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
3
3
2. MIP and the Multidimensional Knapsack Problem
2.1. Mixed Integer Programming - MIP . . . . . . . . . . . .
2.2. The Branch-and-Bound Algorithm . . . . . . . . . . . .
2.3. Other Solving Tools of a Branch-and-Bound Based Solver
2.4. SCIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5. The Multidimensional Knapsack Problem . . . . . . . . .
2.6. Summary . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
6
6
7
8
8
3. Branching Rules
3.1. The Score Function . . . . . . . . . . . . . . .
3.2. Pseudocost Branching . . . . . . . . . . . . .
3.3. Strong Branching . . . . . . . . . . . . . . . .
3.4. Other Branching Rules . . . . . . . . . . . . .
3.5. Hybrid Branching Rules . . . . . . . . . . . .
3.5.1. Reliability Branching . . . . . . . . . .
3.5.2. Hybrid Reliability/Inference Branching
3.6. Summary . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
12
12
13
14
14
15
15
4. Methods for Learning of Branching Rules Using Computational Intelligence
4.1. (1+1)CMA-ES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1. Implementation of (1+1)CMA for Learning of Branching Rules . . .
4.2. Performance Measures and Rule Components . . . . . . . . . . . . . . . .
4.2.1. Types of Features Used . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2. Data Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
17
20
21
21
22
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
vii
Contents
4.3. Supervised Approach for Learning a Root Node Branching Rule . . .
4.3.1. Collection of Training Data . . . . . . . . . . . . . . . . . . .
4.3.2. Inference of Weights . . . . . . . . . . . . . . . . . . . . . . .
4.3.3. An Algorithm for Feature Selection and Training . . . . . . .
4.4. Non-Supervised Black Box Approach for Learning of Branching Rules
4.5. Generation of Training Instances . . . . . . . . . . . . . . . . . . . .
4.6. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23
23
24
26
29
33
34
5. Experimental Study: Supervised Method
35
5.1. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3. Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6. Experimental Study: Non-Supervised Method
6.1. Experimental Setup . . . . . . . . . . . . . .
6.2. Analysis of Validation Results . . . . . . . .
6.3. Test Results . . . . . . . . . . . . . . . . . .
6.3.1. Results for Test Sets 1 through 4 . .
6.3.2. Results for Test Set 5 . . . . . . . . .
6.4. Summary and Discussion . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41
41
42
46
47
47
49
7. Conclusion and Future Work
51
Bibliography
53
A. Full List of Features
57
A.1. List of Features with Names and Explanations . . . . . . . . . . . . . . . . 57
A.2. Full Feature Listing Along With Relative Performances . . . . . . . . . . . 60
B. Model results
63
C. Full Test Results
C.1. Supervised Method . . . . . . . . . . . . . . .
C.2. Non-Supervised Method . . . . . . . . . . . .
C.2.1. Average Number of Nodes . . . . . . .
C.2.2. Average Number of Simplex Iterations
C.2.3. Average Runtime . . . . . . . . . . . .
67
67
71
72
73
74
viii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
List of Figures
4.1. Distribution of evolved weight values returned for each feature by type of
rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.1. Distribution of average performances of all rules by problem set as well as
the geometric mean ("Mean") over all sets. Averages for the HIB rule are
shown for comparison (dashed lines). . . . . . . . . . . . . . . . . . . . . . 44
6.2. Confidence intervals of estimated parameters β̂l of model 6.10 . . . . . . . 45
ix
List of Tables
4.1. Feature components for each type of evolved branching rule . . . . . . . . . 31
4.2. Number of rules evolved for each type of rule, fitness measure used and
number of problems used for training . . . . . . . . . . . . . . . . . . . . . 32
5.1. Combinations of parameters used for generation of testing instances for
supervised method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2. Combinations of parameters used for generation of testing instances for
supervised methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3. List of rules, generated with time as performance measure, used for testing
and their feature make-up . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.4. List of rules, generated with nodes as performance measure, used for testing
and their feature make-up . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.5. List of rules, generated with iterations as performance measure, used for
testing and their feature make-up . . . . . . . . . . . . . . . . . . . . . . . 38
5.6. Test results for test sets 1-4, for all tested rules with dˆ set at optimal level
in each case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.1. Combinations of parameters used for generation of testing instances for
non-supervised methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2. Combinations of parameters used for generation of testing instances for
non-supervised methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.3. Test results for test sets 1-4, for three highest ranking rules (Oy ) with
respect to each performance measure . . . . . . . . . . . . . . . . . . . . . 47
6.4. Test results for test set 5, for three highest ranking rules (Oy ) with respect
to each performance measure . . . . . . . . . . . . . . . . . . . . . . . . . 48
xi
LIST OF TABLES
A.1. Full list of features (excluding con, conl, cut, inf ) along with performance
relative to the HIB rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
B.1. Fitted coefficients β̂l of model 6.10 . . . . . . . . . . . . . . . . . . . . . . 64
B.2. Fitted coefficients α̂l of model 6.10 . . . . . . . . . . . . . . . . . . . . . . 65
B.3. Fitted coefficients γ̂l of model 6.10 . . . . . . . . . . . . . . . . . . . . . . . 66
C.1. Test results for test set 1 for all tested rules for all tested maximum depths dˆ 67
C.2. Test results for test set 2 for all tested rules for all tested maximum depths dˆ 68
C.3. Test results for test set 3 for all tested rules for all tested maximum depths dˆ 69
C.4. Test results for test set 4 for all tested rules for all tested maximum depths dˆ 70
C.5. Average number of generated nodes η̄ in thousands (time in seconds)/[proportional
gap in percentages] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
C.6. Average number of performed simplex iterations ς¯ in thousands (time in
seconds)/[proportional gap in percentages] . . . . . . . . . . . . . . . . . . 73
C.7. Average computation times t̄ in seconds/[proportional gap %̄LP in percentages] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
xii
Glossary
SCIP Solving Constraint Integer Programs, an integer programming solver.
MIP Mixed integer programming
SB Strong branching.
RB Reliability branching.
HIB Hibrid inference/reliability branching.
Q A specific node generated while solving an MIP instance.
I The set of all integer/binary variables of a MIP instance.
Q̌ The current LP-relaxed node being solved.
F The set of all fractional variables in I at Q̌.
F0 The set of all fractional variables in I at the root node Q0 .
FQ The set of all fractional variables in I at a specific node Q.
P Any set of multiple MIP instances.
p A single mixed integer programming instance
y A general performance value associated with solving of MIP’s.
υ Fitness value as used in an evolutionary algorithm.
ς Total number of simplex iterations performed during solving of MIP’s.
η Total number of nodes generated during solving of MIP’s.
t Total runtime spent during solving of MIP’s.
ςSB Total number of simplex iterations performed during strong branching calculations
xiii
LIST OF TABLES
of MIP’s.
qj± Upwards/downwards (+/−) general measures of quality of variable j as a branching
candidate.
Φ±
j Upwards/downwards (+/−) average inference value of variable j.
x̂j Solution for variable xj of the node LP-relaxation.
fj± Upwards/downwards (+/−) fractional value of variable j.
Ψ±
j Upwards/downwards (+/−) average pseudocost of variable j.
xiv
Acknowledgments
I am forever grateful to Sven Þ. Sigurðsson, Arnar Ingólfsson and my familiy, as well as
to all of those who supported me during this study. I am especially grateful to my wife
Hiroe, for her love and support throughout this work.
Kærar þakkir fá Sven Þ. Sigurðsson, Arnar Ingólfsson og fjölskylda mín auk annara þeirra
sem hafa stutt mig á einn eða annan hátt við gerð þessarar ritgerðar, þar af fær Hiroe
bróðurpartinn.
xv
1. Introduction
1.1. Motivation
The term heuristic describes an often simple methodology for making a decision, during
a step of a search algorithm. A wealth of search algorithms are based on one or more
heuristics (Nilsson, 2014) for various applications, such as for solving hard optimization
problems. In most cases these heuristics have been known to give good results when
employed to previous related problems. More often than not, new heuristic algorithms are
designed using the designer’s insight and knowledge of the problem at hand. When more
than one heuristic are employed together in a search algorithm, some sort of algorithmic
procedure or balance has to be maintained in order for the search algorithm to solve
the task at hand. The algorithm should be able to fully use the potential of all its
heuristic components and resolve any kind of conflicts between them. Such an algorithm
could be said to incorporate a hybrid heuristic (Resende and Werneck, 2004). A simple
example of this would be assigning a vector of real-valued weights to multiple heuristic
outcomes represented by numbers, and thus making a real-valued decision-making, or
a score function, for a linear combination of heuristic decision values. Often, and up
until recently, the values assigned to such weights, or to other types of numeric control
parameters of an algorithm, are the product of a combination of the designer’s knowledge
and the method of trial-and-error. Another aspect of algorithm design is when the designer
is faced with the choice of one or more out of a multiple possible heuristic candidates for
incorporation into an algorithm. Such decisions have also, at least until recently, mainly
been made through the designer’s knowledge and testing.
The main motivation for this work is, rather than to use some specific knowledge, or trialand-error, to use computational intelligence to design and discover hybrid heuristic search
algorithms for optimization. The field of search algorithm design chosen for this research
is the branch-and-bound algorithm implemented in mixed integer programming (MIP).
The branch-and-bound algorithm is an effective approach for solving MIPs, dating from
the 1960’s and incorporated into most if not all modern dedicated MIP solvers. The focus
of this work is on the variable branching decision rule (or branching rule for short) of the
branch-and-bound algorithm applied on a specific type of integer programming problem,
namely the 0-1 Multidimensional Knapsack Problem. An attempt is made, using a stateof-the-art open source MIP solver and methods of computational intelligence, to design
and discover effective branching rules for solving multidimensional knapsack problems.
The rules are inferred using small randomly generated knapsack instances using certain
1
1. Introduction
sets of parameters that have previously been used to generate random instances in the
literature. These training instances typically take only a fraction of a second to solve.
The rules are then applied to separate test instances of similar size, generated with the
same set of parameters as the training instances, as well as to instances much larger in
size or generated using different sets of parameters. The tests should then reveal if the
methods used are effective on the same types of problems as the training problems and
if so, whether they are effective when applied to instances that are either different in size
or structure. For this end, hybrid branching rules which are linear combinations of one
or more branching heuristics are designed by computational intelligence and then tested
and compared against current state-of-the-art branching rules. Should these methods
proof successful on some of the types of test problems that the new branching rules are
applied to, a possible modification of the methods used might enhance the domain of their
successful application.
1.2. Contribution
This thesis presents two novel methods for the design of branching rules as described
above. The new rules are implemented in a state-of-the-art open source solver and then
thoroughly tested and compared against the solver’s default branching rules.
The first method is a supervised method for selecting heuristic components, which then
become the features of the supervised learning approach, as well as for inferring a set
of corresponding weights. This is accomplished using data consisting of features and
associated performance values, logged while solving the MIP training instances. The data
is collected after making a set of branching decisions at the root node of the branch-andbound tree generated during the solution of each training instance. After the branching
decision is made at the root, the rest of the problem is solved while using the default
branching rule of the solver. The associated resulting performance of each branching
decision is logged as well. The (1+1)CMA-ES evolutionary algorithm is then embedded
in a feature selection wrapper for selection of heuristics and inference of weights. The
intended domain of application of the rules resulting from this supervised method is
limited, since the rules are designed for use at the root node only where the most important
branching decisions of the whole solving process are made (Forrest et al., 1974). The rest
of the instance the rule is applied to will then be solved using the default branching rule
of the solver. Since the method makes use of labelled data, i.e. explanatory variables, or
features, as well as performance measures, the method is considered supervised.
The second approach is non-supervised and involves selecting heuristic features manually
and then applying the (1+1)CMA-ES algorithm by directly and repeatedly solving all of
the training instances to infer the corresponding weights. Although this method requires
much more computational resources since the training instances are solved repeatedly,
the resulting branching rules have much wider intended domain of application than the
rules resulting from the supervised method. The rules designed with the non-supervised
2
1.3. Related Work
method should be effective throughout the branch-and-bound tree, therefore having more
potential to significantly outperform the rules they are compared against.
1.3. Related Work
In recent years, increasing amount of work is being carried out on researching the use of
artificial intelligence on enhancement and selection of optimization algorithms. Kotthoff
(2012) and Hutter (2009) are recent overviews of algorithm selection and tuning methods
and Smith-Miles (2008) is an application to the quadratic assignment problem (QAP).
Xu et al. (2011) is a presentation of an automatic tuning tool for mixed integer programming solvers. The focus of this thesis however, is more in line with automated design
(or learning) of heuristics by constructing them from basic components, rather than the
more limiting, but not necessarily less successfull, tuning and selection of currently known
heuristic algorithms. Burke et al. (2013) is an extensive overview of this line of research,
using the term hyper-heuristics for methods that automate the design of heuristic algorithms for solving hard computational search problems. As is explained in chapter 4,
Daumé and Eisner (2014) and Alvarez et al. (2014) are the two bodies of work, that are
probably most closely related to the current study.
1.4. Overview
The thesis is organized as follows. In chapter 2, the basic theory of MIP and MIP
solving is presented. The multidimensional knapsack problem as well as the branch-andbound procedure are explained in the same chapter. In chapter 3 the branching rules
most commonly implemented in MIP solving are explained in detail, some of which are
the current state-of-the-art in modern commercial MIP solvers. Then in chapter 4 the
methods developed for this study are presented and described in detail along with the
applied (1+1)CMA-ES algorithm. In chapters 5 and 6 the experimental setup is presented
for each method as well as the results and analysis followed by discussion. A summary of
main findings as well as possible directions of future work are then presented in chapter 7.
For reference, full results of tests and analysis performed are available in the appendix.
Notation for MIP-related theory roughly follows that of Achterberg (2007).
3
2. MIP and the Multidimensional
Knapsack Problem
Mixed Integer Programming problems or MIPs are linear (unless otherwise stated) programming problems of which some or all of the decision variables are restricted to take
on integral values. Solving of MIPs is important in a variety of different fields, such
as telecommunications (Borndörer and Grötschel, 2012) and process scheduling (Floudas
and Lin, 2005).
In general MIP solvers are based on the branch-and-bound algorithm while employing
many other algorithms and heuristics (Lodi, 2010). The heuristics chosen and the order
in which they are used will greatly affect the time required to solve a particular MIP
instance. One of the fastest if not the fastest non-commercial MIP solver is SCIP (Koch
et al., 2011).
The Multidimensional Knapsack Problem is one of the best studied classes of MIP (Puchinger
et al., 2010). It has many applications in resource allocations (Kellerer et al., 2004). Due
to its simplicity it is a good candidate for being at the center of this study.
2.1. Mixed Integer Programming - MIP
MIP is defined in standard form as
z ∗ = min {cT x : Ax ≥ b, x ≥ 0, xk ∈ Z ∀k ∈ I, xj ∈ R ∀j ∈
/ I}
(2.1)
The optimal solution value z ∗ is equal to cT x∗ where x∗ is the set of optimal solution
values the variables x take, so that cT x is minimal. The set I denotes the set of integer
variables.
The MIP is often solved by a divide and conquer algorithm by iteratively dividing the original MIP into smaller and smaller subproblems. Each subproblem is then initially solved
by relaxing the requirements of integrality. The Linear Programming (LP) relaxation of
the MIP is defined as
ẑ ∗ = min {cT x : Ax ≥ b, x ≥ 0, xj ∈ R ∀j}
(2.2)
5
2. MIP and the Multidimensional Knapsack Problem
Here the MIP constraints of integrality have been cancelled. The MIP is generally NPhard while LP is, except for in worst cases, solvable in polynomial time (with efficient
solving techniques existing) (Spielman and Teng, 2004).
2.2. The Branch-and-Bound Algorithm
The branch-and-bound algorithm iteratively divides each MIP problem into subproblems,
or sub-MIPs, each having smaller solution space than the parent problem. This is most
commonly implemented by splitting the domain of a fractional variable into two disjoint
parts, i.e. by branching on a variable. SCIP allows splitting problems into more than two
subproblems as well as branching on constraints. By default the problems are divided into
two subproblems. In this way, two so called child sub-MIPs are created by bounding the
variable in question. The bounds are determined by using the upwards and downwards
rounded LP relaxation solution values, i.e. for integer variable xj with a LP-relaxed
solution value of x̂j , the new bounds for that variable in each of the two child subproblems
will be xj ≤ bx̂j c and xj ≥ dx̂j e respectively.
While iteratively splitting the current sub-MIP into two new subproblems, the algorithm
creates a tree structure where each node represents a child subproblem. The root of the
tree corresponds to the initial problem instance. The leaves nodes of the edges of the
tree are subproblems that either have been solved or have yet to be solved. Unsolved
leaf nodes are stored in a priority queue to be processed later. Leaf subproblems are
considered solved if one of three criteria applies:
1. The subproblem LP-relaxation is infeasible
2. An integer feasible optimal solution has been found for the subproblem
3. It has been proven that an integer feasible optimal solution of the subproblem can
not be any better than the best known feasible solution (the global upper bound)
of the original problem.
After a leaf node has been solved that particular subproblem is not divided any further
and the node is said to have been pruned.
2.3. Other Solving Tools of a Branch-and-Bound
Based Solver
The modern branch-and-bound based solver typically incorporates some solving tools that
work alongside the branch-and-bound algorithm (Lodi, 2010) to speed up the solving
6
2.4. SCIP
process, the most important of which are:
Node Selection involves selecting the subproblem that should be processed next after
processing each node in the branch-and-bound generated tree. This decision making is
driven by often two conflicting goals of the search. One is to find good feasible solutions
quickly, to improve the primal (upper) bound, and thus excluding large parts of the
solution space early on in the solution process. This will be obtained by diving deep in
the tree for quickly finding feasible solutions. Another goal of the node selection algorithm
is to improve the global dual (lower) bound by choosing nodes that can possibly harbour
good feasible solutions, possibly the optimal one (Wolsey, 2008).
Presolving will transform the problem instance into an equivalent instance that should
be easier to solve (Van Roy and Wolsey, 1987). This can be done by removing irrelevant
information such as redundant constraints or fixed variables, but also by strengthening
the LP relaxation by tightening bounds of variables or improving constraint coefficients.
Presolving techniques can also extract information about variables and constraints that
can be used later on in the solving process. In the SCIP solver restarts are sometimes
applied after solving the root node (Achterberg, 2007). This is done if processing of the
root node has led to the fixing of at least 5% of the integer variables. The solver can then
start the solving process again with a simpler MIP problem.
Domain Propagation (Achterberg, 2007), also called node preprocessing, can be seen as
restricted local presolving. Domain propagation tightens the domains of variables by inspecting the constraints and domains (derived or original) of other variables (Savelsbergh,
1994).
Cut Seperation techniques (Balas et al., 1996; Gomory, 1958) are effective when combined
with branch-and-bound methods and the resulting algorithm is sometimes referred to as
branch-and-cut. The algorithm finds linear inequalities that separate the LP-optimum
from the convex hull of the true feasible set. Each such inequality forms a so called cut.
When such a cut is added to the relaxed LP-problem, the current non-integer LP-solution
is no longer feasible. By applying this technique iteratively an optimal integer solution
can be found.
Primal Heuristics try to find good feasible solutions ad hoc without spending to much
computing time. Once implemented there is no guarantee that a primal heuristic will find
such a solution, let alone an optimal one. Primal heuristics are very important in finding
good feasible solutions early on in the solving process (Lodi, 2010; Wolsey, 2008).
2.4. SCIP
SCIP, or Solving Constraint Integer Programs, is a software framework for solving constraint programming (CP), mixed integer programming (MIP), and satisfiability testing
7
2. MIP and the Multidimensional Knapsack Problem
(SAT) by integrating the solving techniques for each into constraint integer programming
(CIP) (Achterberg, 2009). SCIP was developed by researchers at Konrad-Zuse-Zentrum
für Informationstechnik Berlin (ZIB) starting in 2002. The source code is available freely
for academic and non-commercial use at the SCIP website: http://scip.zib.de. SCIP is
currently one of the fastest, if not the fastest, non-commercial mixed integer programming
(MIP) solver (Koch et al., 2011; Meindl and Templ, 2012).
For solving MIP, CP or SAT problems, SCIP uses a branch-and-bound approach. The
branch-and-bound process is complemented by LP relaxations and cutting plane separators as they are used for solving MIPs as well as constraint domain propagation and
conflict analysis that are used for solving CP and SAT problems (Achterberg, 2009). For
this type of study, an available and open source code for a state-of-the-art solver is a
prerequisite.
2.5. The Multidimensional Knapsack Problem
A special case of a MIP and one of the most extensively studied combinatorial optimization
problem is the Multidimensional Knapsack Problem (MKP) (Puchinger et al., 2010). Due
to its simple structure, and to the fact it is well studied, it makes for an ideal candidate
for this study. The problem can be loosely described as that given a set of n items with
each item j having an associated profit cj > 0, there are available m resources, with each
resource i having a capacity bi > 0. If chosen, each item consumes an amount aij ≥ 0
from resource i. The goal is then to choose a subset of items that gives the maximum
sum of profits while not exceeding the capacity of any of the resources. The problem can
be formalized as the mixed integer program:
z ∗ = maximize cT x
s.t. Ax ≤ b
x ∈ {0, 1}n
(2.3a)
(2.3b)
(2.3c)
where A is an m × n non-negative matrix, x is the variable vector of dimension n, and
b and c are non-negative vectors of dimension m and n respectively. Each element xj of
the vector x is a 0 − 1 decision variable indicating whether item j should be selected.
2.6. Summary
This chapter briefly explains the concept of Mixed Integer Programming problems and
how they are generally solved by using the Branch-and-Bound algorithm at the core
of dedicated MIP solvers. In particular, the efficient non-commercial solver, SCIP, was
introduced as well as the Multidimensional Knapsack problem.
8
2.6. Summary
In the next chapter, a few heuristic methods for choosing the most promising variable to
branch on in a branch-and-bound process, or branching rules, are explained as well as
their implementations in SCIP.
9
3. Branching Rules
For the branch-an-bound algorithm, various different strategies for choosing the best candidate variable to branch on have been developed over the years. The branching rules are
still at the heart of any modern branch-and-bound based solver Lodi (2010). Recently, a
number of branching rules have been developed and presented in the literature, such as
cloud branching (Berthold and Salvagnin, 2013), branching on nonchimerical fractionalities (Fischetti and Monaci, 2012) and backdoor branching (Fischetti and Monaci, 2011).
3.1. The Score Function
For the standard branching method, the set of potential branching variables are those
integer variables that are non-integer in the LP relaxation. Choosing the best candidate
may significantly enhance the performance of the branch-and-bound procedure. This
requires some sort of scoring function, with variable xj having score sj (q) : Ru 7→ R
based on a set q of u given measurements of quality. Let qj+ and qj− be some measures of
quality associated with choosing variable xj to branch on, with qj+ being associated with
the "upwards" child node which obtains the bound xj ≥ dx̂j e on variable xj . Analogously
qj− is associated with the "downwards" child node which obtains the bound xj ≤ bx̂j c. A
possible score function could be defined as
score(qj+ , qj− ) = (1 − µ) · min {qj+ , qj− } + µ · max {qj+ , qj− }
(3.1)
where the score factor µ is some number between 0 and 1. In SCIP (Achterberg, 2007)
this value is set as default to 16 . However SCIP uses as a default a different score function
score(qj+ , qj− ) = max {qj− , } · max {qj+ , }
(3.2)
where = 10−6 . Bounding the values to allows for comparisons between pairs of quality
measures (qj+ , qj− ) and (qk+ , qk− ) where for each pair one of the measures is zero. For both of
these functions the variable with the highest score function value is chosen as the variable
to branch on, i.e. if the outcome of the score function for variable xj is denoted as sj , then
the index of the next branching variable is j = arg maxk∈F {sk }. Here sk = score(qk+ , qk− )
and F is the set of fractional variables available to branch on.
11
3. Branching Rules
3.2. Pseudocost Branching
Pseudocost branching (Benichou et al., 1971) keeps a historical record on objective function gains obtained by branching on each integer variable. In SCIP this rule is implemented as following (Achterberg, 2007):
+
Let λ−
j and λj be the gains in objective value per unit change in variable xj at node Q
after branching in the corresponding direction or
λ−
j
∆−
j
= −,
fj
λ+
j
∆+
j
= +
fj
(3.3)
where fj− = x̂j − bx̂j c and fj+ = dx̂j e − x̂j are the fractional values of the LP solutions of
the branched to subproblems. The pseudocosts of variable xj are the arithmetic means
Ψ−
j =
σj−
,
ηj−
Ψ+
j =
σj+
ηj+
(3.4)
Here σj− is the sum of λ−
j over all problems Q where the variable xj was branched on
downwards and where the LP relaxation of Q was solved and was feasible. ηj− is the
number of these subproblems. σj+ and ηj+ are analogously defined for cases where the
variable xj was branched on upwards. Then the score function value (see subsection 3.1)
for variable xj , is
+ +
(3.5)
sj = score(fj− Ψ−
j , fj Ψj )
Initially the values for ηj− and ηj+ are zero for all j ∈ I, where I is the set of all integer
−
variables. The pseudocost for variable xj is denoted as Ψ−
j , where ηj = 0, as downwards
−
−
uninitialized. Downward uninitialized pseudocosts are set to Ψj = Ψ−
av , where Ψav is
the avarege of the initialized downward pseudocosts over all variables. If no downward
pseudocosts are initialized this value is set to 1. Upwards uninitialized pseudocosts are
set analogously. The pseudocosts of a variable are said to be uninitialized if if they are
uninitialized in either or both directions (Achterberg, 2007).
3.3. Strong Branching
With strong branching (Applegate et al., 1995; Linderoth and Savelsbergh, 1999) the
methodology involves testing for progress in the dual bound for each of the candidate
variables before making any decision on which variable to branch on. Testing is done on
each fractional variable by introducing an upper bound xj ≤ bx̂j c and then a lower bound
xj ≥ dx̂j e, and solve the resulting LP subproblems to get estimates for the corresponding
lower bounds. The resulting changes in the objective values are then used as the measures
of quality. In full strong branching all fractional variables are tested in this way and
optimal solutions are found for all the LPs. Computational times for full strong branching
12
3.4. Other Branching Rules
are quite large so usually some restrictions are set on either the number of candidate
variables tested or on the number of simplex iterations performed while solving the strong
branching LPs and thus finding non-optimal solutions.
In SCIP (Achterberg, 2007) the number of candidate variables to be tested is determined
by stopping the process if no new best candidate has been found for λ = 8 candidates.
The variables are tested in the order of their pseudocost scores so the ones with the largest
values are tested first. A maximum of κ = 100 candidate variables are tested. The limit
max
on the number of simplex iterations performed during strong branching testing, ςSB
, is
max
av
set to ςSB = 2ςLP while no fewer than 10 and no more than 500 iterations are applied.
av
Here ςLP
is the average number of simplex iterations per LP needed so far.
3.4. Other Branching Rules
The most infeasible branching rule (Achterberg, 2007) chooses the branching variable with
the fractional value closest to 0.5, that is
sj = θ(xj ) = min {x̂j − bx̂j c, dx̂j e − x̂j }
(3.6)
Here θ(xj ) is the so called fractionality of a variable. Then j = arg maxk∈F {θ(x̂k )} is the
index of the next branching variable. The rationale behind this rule is to branch on the
variable that "could go either way" or have an equal tendency to be rounded up or down
in the integer solution of the current subproblem.
In contrast to most infeasible branching, the least infeasible branching rule (Achterberg,
2007) chooses the branching variable closest to integrality, that is
sj = max {x̂k − bx̂k c, dx̂k e − x̂k }
(3.7)
and the variable branched on has index j = arg maxk∈F max {x̂k − bx̂k c, dx̂k e − x̂k } .
The most infeasible and the least infeasible branching rules show generally poor performance, in fact they have been shown not to do much better than if the variable would be
chosen at random (Achterberg, 2007).
As opposed to past gains in the bound due to variable selection, an alternative measure
of branching quality is the number of domain deductions on other variables made after
branching on the variable in question. Analogous to pseudocosts the inference branching
value of a variable xj , j ∈ I, is defined as
Φ−
j =
qj+
ϕ−
j
=
,
νj−
fj+
Φ+
j =
ϕ+
qj−
j
=
νj+
fj−
(3.8)
+
where ϕ−
j , ϕj are the total number of inferences deduced after branching in the corresponding direction on the variable xj . νj− and νj+ are weighted counts of corresponding
subproblems where domain propagation has been applied. The idea is that a variable with
13
3. Branching Rules
a large historic inference value will be likely to produce more domain propagations and
thus smaller subproblems in the future. In SCIP uninitialized inference values for nonbinary integer variables are set to zero (Achterberg, 2007). For binary variables, inference
values are initialized using information obtained from presolving rounds.
SAT solvers learn so called conflict clauses from the analysis of infeasible subproblems.
A branching rule for MIP based on this approach is conflict branching which takes into
account whether corresponding candidate variables have been used in recent conflict graph
analysis to produce conflict constraints. The appearence of each variable in a clause is
counted while periodically the counted sum is divided by a constant. This rule is based on
the SAT assignment rule developed by Moskewicz et al. (2001). A similiar rule, conflict
length branching, uses the average lengths of the conflict clauses a variable appears in
(Achterberg, 2007).
Lastly, cutoff branching favours branching variables in relation to the average incidences,
where branching on the variable in question has led to either infeasible subproblems or to
nodes pruned by bound, that is the number of cutoffs (Achterberg, 2007).
3.5. Hybrid Branching Rules
3.5.1. Reliability Branching
The weakness of strong branching is the computational cost while the weakness of pseudocost branching is that in the beginning of the solution process not much information about
past branching is available. A way to avoid the drawbacks of each method is to combine
them and use strong branching estimates for uninitialized pseudocosts (Achterberg, 2007).
Reliability Branching (RB) (Achterberg, 2007) does this while also using strong branching
estimates for unreliable pseudocost values. The pseudocost of a variable xj is said to be
unreliable if min {ηj+ , ηj− } < ηrel where ηrel is the so called reliability parameter.
In SCIP the value of the reliability parameter changes according to how the total number
of performed strong branching simplex iterations, ςSB , compares to the total number of
regular node solving simplex iterations ςLP (Achterberg, 2007). As a default ηrel is set to
max
zero when ςSB > ςSB
= 12 ςLP + 105 resulting in no pseudocosts being deemed unreliable.
max
max
While ςSB increases from 12 ςSB
to ςSB
, ηrel decreases linearly with ςSB from 8 to 1. For
small values of ςSB , that is when ςSB < 12 ςLP , ηrel is increased proportionally to the ratio
ςLP /ςSB so that reliability branching resembles strong branching in the limit ηrel → ∞.
14
3.6. Summary
3.5.2. Hybrid Reliability/Inference Branching
As well as combining reliability branching and inference branching, SCIP’s hybrid reliability/inference branching (HIB) rule takes into account whether corresponding candidate
variables have been used in recent conflict graph analysis to produce conflict constraints
as in conflict and conflict length branching (Achterberg, 2007). Also influencing the choice
of branching variables are the number of times that branching on the variable in question
has led to either infeasible subproblems or to nodes pruned by bound, i.e. the number of
inf con cut
cutoffs as in cutoff branching. Let srel
be the score function values for the
j , sj , sj , sj
reliability branching, inference branching, conflict branching and cutoff branching rules
respectively. Using the scaling function
g : R≥0 → [0, 1),
g(x) =
x
x+1
the score function value of variable xj for hybrid reliability/inference branching is
!
!
!
!
rel
cut
inf
con
s
s
s
s
j
j
j
j
sj = ω rel g rel + ω inf g inf + ω con g con + ω cut g cut
(3.9)
s̄
s̄
s̄Q
s̄Q
Q
Q
where the s̄Q values are historical values averaged over all integer variables of all solved
nodes Q of the original problem (see definition 4.2.1). The weights ω are set to ω rel =
1, ω inf = 10−4 , ω con = 10−4 , and ω cut = 10−2 (Achterberg, 2007).
In general we can define a hybrid score function, shy (j), as a function which is a linear
combination of k component functions, zi (j), i ∈ {1 . . . k}, each of which being a function
of one or more quality measures. The component functions can be appropriately scaled
if needed. For example a score function output for variable j could then be:
shy (j) = ω1 φ1 (x̂j ) + ω2 φ2 (x̂j ) + . . . + ωk φk (x̂j ) = ω T φ(x̂j ) ∈ R
(3.10)
where φk (j) is the k-th appropriately scaled component, or feature, of the rule, and ωk
its assigned weight.
3.6. Summary
This chapter explained how for each branching rule, the score function is used to select
the most promising branching candidate to branch on next. A few branching rules, in
particular the branching rules implemented in SCIP, are explained in some detail and
some of their strengths and weaknesses are discussed.
In the next chapter, methods for learning and discovering new branching rules are introduced.
15
4. Methods for Learning of
Branching Rules Using
Computational Intelligence
Computational learning for the design of heuristics and selection of heuristics, for branchand-bound in mixed integer programming has very recently received some attention. Liberto et al. (2013) introduce an approach that dynamically switches branching rules, and
does so by learning to cluster instances and subproblems based on instance-specific information and the current node depth. Daumé and Eisner (2014) use imitation learning to
train a general node selection rule on a collection of various MIP benchmark instances.
Perhaps most related to the current study is the work of Alvarez et al. (2014) where a
supervised learning method is applied in an attempt to replace the strong branching rule
with a faster approximation. The method learns from previous strong branching decisions over a set of training instances, and thus should make it possible to estimate the
preference of the strong branching rule based on the instance-, node- and variable-specific
information at hand.
In this chapter, two main approaches for learning of branching rules are proposed and
explained. The former is a more limited approach designed to demonstrate that it is
possible to use a supervised approach to learn a new efficient branching rule from scratch
by using already available information from the SCIP solver as well as variable specific
data. However this rule is designed to be used at the root node only while using the solver’s
default branching rule to solve the rest of the problem. The latter is a non-supervised
approach that is applicable throughout the whole branch-and-bound tree. By this method
branching rules are constructed beforehand from available components, while optimal
weights for the constructed rules are learned from the solution of training instances. Both
approaches use the (1+1)CMA evolutionary strategy for inferring weights.
4.1. (1+1)CMA-ES
Evolution strategies (ES) are stochastic search methods. They belong to a family of
evolutionary algorithms (EA) and evolutionary computations for traversing search spaces
of numerical optimization problems or other complex search spaces (Yu and Gen, 2010).
EA are inspired by the natural process of evolution. They are implemented so that in
17
4. Methods for Learning of Branching Rules Using Computational Intelligence
each iteration, a new generation of offspring (or candidate solutions) is generated from
their parents by inducing small stochastic variations (Hansen et al., 2013). The parents
are candidate solutions whose fitness values are already evaluated. After evaluating the
fitness of the offspring the more fit offspring are selected to become the next generation’s
parents.
The (1+1) covariance matrix adaptation evolutionary strategy is a single parent search
strategy. This means that the fitness evaluation is done only once per generation which
can be beneficial if each evaluation of this function is expensive. The parent solution g is
replicated (imperfectly) to produce the next generation h ← g + σ N (0, C). Here σ is a
global step size and C is the covariance matrix of the d-dimensional zero mean Gaussian
distribution. The replication is implemented in three basic steps:
z ∼ N (0, I)
s ← Dz
h ← g + σs
(4.1a)
(4.1b)
(4.1c)
where the covariance matrix has been decomposed into Cholesky factors DD| . The normally distributed random vector z is sampled from the standard normal distribution
N (0, I). The success probability of this replication is updated as
πsucc ← (1 − η)πsucc + η Itrue (υ(h) ≤ υ(g))
(4.2)
where υ(·) if the objective function or the fitness, that needs to be minimized. Here Itrue (·)
is the indicator function and takes the value one if its argument is true otherwise zero.
The parameter η is the learning rate (0 < η ≤ 1) and is set to η = 1/12. The initial value
of πsucc = 2/11 is also the target success probability π̌succ . Following the evaluation of the
success probability the global step size is updated by
πsucc − π̌succ
(4.3)
σ ← σ exp
δ(1 − π̌succ )
where δ = 1+d/2. The initial global step size will be problem dependent but should cover
the intended search space. The parameter settings used are that suggested by Suttorp
et al. (2009).
If the replication was successful, that is if υ(h) ≤ υ(g), then h will replace the parent
search point g. Futhermore, the Cholesky factors D will be updated. Initially D and D−1
are set to the identity matrix and s is set to 0. For ccov = 2/(d2 + 6) and the threshold
probability π̄ = 0.44, the update of the covariance matrix is then as follows (Suttorp
et al., 2009):
1. If πsucc < π̄ then set
p
s ← (1 − δ −1 )s + δ −1 (2 − δ −1 )Dz
α ← (1 − ccov )
18
(4.4a)
(4.4b)
4.1. (1+1)CMA-ES
else set
s ← (1 − δ −1 )s
α ← 1 − c2cov δ −1 (2 − δ −1 )
(4.5a)
(4.5b)
γ ← D−1 s
p
β ← 1 + ccov kγk2 /α
(4.6a)
2. Compute
3. Compute
D←
√
√
αD + α(β − 1)sγ | /kγk2
(4.6b)
(4.7)
4. Compute
| −1 1
1
D−1 ← √ D−1 − √
1
−
1/β
γ γ D
α
αkγk2
(4.8)
Computing D−1 from D requires Θ(d2 ) time, whereas a factorization of the covariance
matrix requires Θ(d3 ) time (Suttorp et al., 2009). The Cholesky version of the (1+1)CMA
is therefore computationally more efficient.
19
4. Methods for Learning of Branching Rules Using Computational Intelligence
4.1.1. Implementation of (1+1)CMA for Learning of Branching
Rules
The implementation of the (1+1)CMA algorithm for the purpose of this study is outlined
in the pseudocode for Algorithm 1 below.
Algorithm 1: LearnWeightsCMA
input : Set of generated problems P or data χ, a feature map φ : x̂j → Ru and a
performance evaluation algorithm EvalFitness()
output: Optimal set of weights w∗ found by the (1+1)CMA algorithm for score
function sj = ω T φ(x̂j )
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Initialize ω ← ω0 , πsucc ← p̌, σ ← σ0 , s ← s0 , C ← C0 , k ← 1;
Set υ ∗ ← EvalFitness(P,wT φ); //See algorithms 4 and 5 of sections 4.3 and
4.4 respectively
Set ω ∗ ← ω;
while k ≤ kmax and σ > σmin do
ω ← Replicate(ω,s,C); //See eq. 4.1a - 4.1c
υ ← EvalFitness(P or χ,ω T φ);
πsucc , σ ← Update(πsucc ,σ); //See eq. 4.2 - 4.3
if υ ≤ υ ∗ then
s, C ← Update(s,C); //See eq. 4.4a - 4.8
υ ∗ ← υ;
ω ∗ ← ω;
end
k ← k + 1;
end
return ω ∗
In line 1 of Algorithm 1 all necessary parameters are initialized as described in section 4.1
as well as the iteration counter k and the weight vector ω. The weights are all initialized
to ω0 = 0. In line 2 the fitness value is initially evaluated for weights ω. In each iteration
a new weight vector ω is replicated from its parent in line 5 as described in equations 4.1a
- 4.1c. A corresponding fitness value is then obtained in line 6 followed by updates of πsucc
and σ in line 7 as described in equations 4.2 and 4.3. Should the fitness value υ obtained
in line 6 be less than the currently optimal one υ ∗ , the covariance matrix C is updated
in line 9 followed by updates of the currently optimal fitness value υ ∗ and weights ω ∗ in
lines 10 and 11. The process is repeated until σ is no longer greater than σmin or until
the number of iterations k has exceeded the maximum allowed number of iterations kmax .
The optimal weight vector ω ∗ is then returned.
20
4.2. Performance Measures and Rule Components
4.2. Performance Measures and Rule Components
The methods introduced in this chapter are all in some way optimized as to maximize the
performance of branching rules. There are three main performance measures available:
Total runtime t spent solving an instance, total number of nodes generated by the branchand-bound algorithm η, and total number of simplex iterations ς needed to solve the LPs
of these nodes. Simplex iterations for strong branching measures are not included in the
performance measurement. The performance of a branching rule over a set of instances
is then the sum of any of the three performance values over all of the instances in the
set. Since the problems used for training are relatively small, each typically taking only a
fraction of a second to solve, the runtime is considered less reliable for training purposes
than the other two measures. Therefore total number of nodes and simplex iterations
were the primary performance measures optimized while creating or training branching
rules. However, since for the supervised method all of the training instances are solved
only once for each branching candidate at the root, and therefore experimenting with time
as a measure of performance is inexpensive, rules were also trained using the runtime as
performance measure when applying the supervised method.
4.2.1. Types of Features Used
Each branching rule created by the methods described in this chapter has a score function
which is a linear combination of some rule components or features using machine learning
terminology. Here is a non-detailed short description of the main features used or tested
during this study. For a full list of features and descriptions see appendix A. The features
can be considered to fall into two main categories which are:
• SCIP branching rule features
These are the upwards and downwards components, qj+ and qj− , of branching rules
already implemented in SCIP and described in chapter 3 as well as the maxima
max {qj+ , qj− }, minima min {qj+ , qj− } and the product max {qj− , } · max {qj+ , } or the
score according to equation 3.2. Some of the subcomponents that make up the
components qj+ and qj− were also used as separate features.
• Problem specific features These features are instance-specific, they relate to the
parameters associated with each variable, the cost coefficient ci and the constraint
coefficients aij and bj and how these relate to parameters of other variables. Some of
these features are functions of only the local sub-MIP problem of the specific node.
A few examples of the problem specific features are:
eff : Efficiency measure, e.g. as suggested by Freville and Plateau (1994):
cj
i=1 ri Aij
efp
j = Pm
(4.9)
21
4. Methods for Learning of Branching Rules Using Computational Intelligence
where
Pn
j=1 Aij − bi
ri = Pn
j=1 Aij
is a relevance value with respect to the i-th constraint.
ncc: Fraction of all problem constraints a branching candidate appears in,
m
ζj =
1 X
Itrue (aij > 0)
m i=1
(4.10)
fra: The fractionality θ(xj ) of a variable according to equation 3.6.
rcc: Relative constraint coefficient, e.g. the average relative weighted constraint
coefficient:
m
0 (w)
1 X
ri Aij
(4.11)
Āj =
m i=1
where ri is as in eq. 4.9.
4.2.2. Data Normalization
As is the case with the HIB branching rule of subsection 3.5.2, normalization of features
is used in this study to make the feature values more compatible to different problems.
The scaling function used g(x) is the same as defined for the HIB rule or
g : R≥0 → [0, 1),
g(x) =
x
x+1
(4.12)
Three different types of normalization constants are employed for this study. The three
resulting normalizations are defined below.
Definition 4.2.1. Historic normalization of θj is defined as:
θj
(Q)
θ̃j = g
θ̄Q
(4.13)
where θ̄Q is a historic (sometimes weighted) average of values of θ calculated for all
variables over the set of nodes Q processed from the start of solving the current problem
instance. The values used for this study are all available from the SCIP solver. If a feature
θ is treated with a historic normalization and θ̄Q is not available for the feature in SCIP,
0
0
an available value of θ̄Q , averaged over a feature θ related to θ, is used instead.
Definition 4.2.2. Given the set I of all (integer/binary) variables of problem instance p
and a given feature θj = θ(xj ) evaluated at node Q̌, the full variable set normalization of
θj is defined as:
θj
(I)
θ̃j = g
(4.14)
θ̄I
22
4.3. Supervised Approach for Learning a Root Node Branching Rule
with
θ̄I =
1X
θj
n j∈I
(4.15)
Definition 4.2.3. Given set F of all fractional variables of problem instance p and given
feature θj = θ(xj ) evaluated at node Q̌, the fractional variable set normalization of θj is
defined as:
θj
(F )
θ̃j = g
(4.16)
θ̄F
with
θ̄F =
1 X
θj
|F| j∈F
(4.17)
The full set of variable set normalizations is not always applicable for branching specific
features and the historic normalization is mostly used for features whose historical data
is available during the SCIP solving process.
For purposes of comparison and analysis in this work, weight values of a given weight
vector are normalized when presented. For feature i, the weight value of component ωi of
weight vector ω is normalized as ω̃i according to
ω̃i =
dim ω
ωi
|ω|1
(4.18)
4.3. Supervised Approach for Learning a Root Node
Branching Rule
4.3.1. Collection of Training Data
This approach involves learning a branching rule by collecting examples of branching
candidate variables available at the root node Q0 over a set of training instances Ptr .
The instances were then solved by using the SCIP default rule at the lower nodes after
choosing each of the branching variables in the set F0 of available branching variables at
the root. Then a set χtr of training example tuples (φ(x̂j ), yj ) is generated:
χtr = φ(x̂j ), yj : j ∈ {F0 (p) : p ∈ Ptr }
where φ(x̂j ) is the feature map φ : x̂j → Ru and is a u-dimensional vector of features.
The variable yj is the resulting performance value after branching on the variable x̂j . The
23
4. Methods for Learning of Branching Rules Using Computational Intelligence
basic procedure is outlined in the pseudocode of Algorithm 2 below.
Algorithm
input :
output:
result :
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
2: CollectTrainingData
A set of training problems Ptr
A collection of training examples χtr
All necessary training data has been collected
// Utilizes SCIP solver with parameters set at default values
Set χtr ← {};
for all problems p ∈ Ptr
for all variables j ∈ F0 (p)
while SCIP.IsSolving(p) do
if d(Q̌) = 0 then
// d(Q̌) is the depth of the current node Q̌ of the generated
branch-and-bound tree
SCIP.Branch(Q̌,j);
// SCIP is forced to branch on candidate j
end
end
χtr ← χtr ∪ φ(x̂j ), yj ;
end
end
return χtr
At line 6 of Algorithm 2, the LP relaxation of the root node has been solved. SCIP is then
forced to branch on the variable x̂j . In line 12 data specific to variable xj is collected.
Then, following branching on candidate j at the root node, the performance value yj
is collected after solving the remainder of the problem p. Each time the solver restarts
after solving the root node (see section 2.3) all collected data from the previous run is
discarded.
For this study a total of |Ptr | ' 12000 training problems were generated, each with
around 12 candidates in F0 on average. The dimension of the vector φ(x̂j ), or the number
of original features, is around 50. However inclusion of derived features increases this
number three-fold.
4.3.2. Inference of Weights
After collecting the training data the goal is to learn a score function sj = ω T φ(x̂j ) ∈ R.
This score function should then for any unseen problem rank the fractional variables
{x̂j : j ∈ F0 }, by giving potentially better branching candidates increasingly higher score
24
4.3. Supervised Approach for Learning a Root Node Branching Rule
values. Here ω is a u-dimensional weight vector so that the learned function sj assigns a
weight ωk to the k-th feature. Learning the score function can be stated as an optimization
problem
X
y ∗ = minimize
y(j ∗ , p)
(4.19a)
ω
∗
p∈Ptr
T
subject to j = arg max ω φ(x̂j ),
∀p ∈ Ptr
(4.19b)
j∈F0 (p)
Here, y(j, p) is the performance value associated with choosing candidate j at the root
node of problem p. With |Ptr | ' 12000 this would indeed be an intractable problem to
solve optimally. Similar ranking problems are at the heart of the study of information
retrieval, a well established field of research (Cossock and Zhang, 2006). These problems
have been tackled using various statistical and machine learning methods (see Liu (2009)
for overview). However most of the applications for which these methods have been
developed are on an even larger scale and with unknown or approximated utility functions,
so that certain approximations are in order and often necessary. Also, the problem of eq.
4.19 is a very special case of ranking, it is sufficient to select the best branching candidate,
the ranking among the other candidates has no impact on the utility. Furthermore, there
are only on average around 12 candidates to choose from per training instance, much
less in general than the objects being ranked in the literature (Liu, 2009). Last, but not
least, due to the fact that each training instance can have quite different properties from
the next one, the utility function y(ω) being minimized is likely to be noisy compared
to much of the application found in the literature. For all of these reasons, one might
be discouraged to use methods from the literature that are designed with a different
application in mind. Indeed, by applying regression, pair-wise classification (Kamishima
et al., 2011), minimization of Bayesian risk based on the Placket-Luce model (Kuo et al.,
2009) and ordinal regression, a preliminary comparison study indicates that solving 4.19
directly by the method presented here gives better results than these other methods tested.
The data set used for this study is sufficiently small for the (1+1)CMA-ES algorithm
to be applied for finding a sufficiently good candidate solutions. For practical reasons,
the number of features available is too large for all of them to be used. Also to be
considered is that the risk of overfitting would increase if such a large number of features
would be included in the solution. Some sort of feature selection method therefore needs
to be employed to reduce the number of features to an acceptable number. Also, since
ES algorithms have a tendency to overfit (Yu and Gen, 2010), further safeguards should
be applied. A separate validation set, χva , can be used for that end. By evaluating
each returned weight vector from the ES algorithm using the evaluation set, the risk of
overfitting can be reduced. The number of instances in the validation set is around half
that of the training set, or |Pva | ≈ 21 |Ptr |. As a still further safeguard (Loughrey and
Cunningham, 2005), restrictions on the number of iterations for the ES algorithm were
set at kmax = max{40, 10(dim ω)} based on preliminary experiments.
25
4. Methods for Learning of Branching Rules Using Computational Intelligence
4.3.3. An Algorithm for Feature Selection and Training
For the purpose of training and feature selection, an embedded feature wrapper can be
applied (Guyon and Elisseeff, 2003). A popular wrapper method (Ding and Peng, 2005)
for classification sequentially selects features based on maximum relevance and minimum
redundancy. Geng et al. (2007) is an example of a study on modifying this principle for the
task of ranking. Since the problem of eq. 4.19 is a very special case of a ranking problem
this study uses customized methods for the evaluation of candidate features. The HIB
branching rule of subsection 3.5.2 uses all but one of the features as tie-breakers, i.e. the
magnitude of the corresponding weights of these features is much less than the magnitude
of the weight corresponding to its most important feature, the reliability pseudocost score.
For selecting effective features for the problem at hand and to simultaneously encourage
selection of less important features that might be good tie-breakers, new measures of
feature relevance and redundancy are proposed below.
Definition 4.3.1. Given training data χ collected from training instances Ptr , already selected set of features φ and associated weight vector ω, the conditional relevance ψf (Ptr , φ, ω)
of feature f 6∈ φ is measured as:
X
ψf (Ptr , φ, ω) = max
y(ω T φ, p) − y(ωf f, p) , 0
(4.20)
p∈Ptr
Here y(ω T φ, p) = y(j ∗ , p) such that j ∗ = arg max ω T φ(x̂j ), is the performance value over
j∈F0 (p)
instance p, according to training data Ptr , using score function ω T φ. Also, y(ωf f, p) is the
corresponding performance value when using f as the score function with ωf being either
1 or −1, whichever gives better performance over all instances in Ptr . The conditional
relevance is then an approximate upper limit of how the feature could potentially enhance
the performance.
Definition 4.3.2. Given training data χ collected from training instances Ptr , current
selection of features φ where f ∈ φ, and associated weight vector ω, where ωf − is the
weight vector ω with the weight corresponding to feature f set to zero, the redundancy
ξf (Ptr , φ, ω) of feature f is defined as:
X
ξf (Ptr , φ, ω) = −
y(ωfT− φ, p)
(4.21)
p∈Ptr
The redundancy of feature f ∈ φ is then simply the negative of the performance value
using ω T φ as the score function with the element of ω corresponding to f set to zero.
Algorithm 3 below is the pseudocode for a wrapper loosely based on Dang and Croft
(2010) while incorporating the feature evaluation methods defined above. It should be
26
4.3. Supervised Approach for Learning a Root Node Branching Rule
noted that the quantities y(ωf f, p) can be calculated beforehand and therefore need to be
calculated only once and not at runtime. Algorithm 3 starts with an empty feature subset
and a set of candidate features Φ0 . The algorithm consists of both a forward selection
phase, as well as backward elimination phase. In each step the algorithm sequentially
adds and subtracts a fixed number of features, one at a time, to and from the currently
best known subset of features φ∗ . The algorithm uses a separate validation data set for
the evaluation of each feature subset and the corresponding inferred weights, while it uses
the training data set and algorithms 1 and 4 for finding solutions to problem 4.19. For
this study, the maximum allowable number of performance comparisons Rmax and the
number of stepwise evaluations α, are set to 24 and 6 respectively.
On line 4 of Algorithm 3, the initially most promising candidate feature f ∗ , according to
the performance resulting from using the candidate as a score function, is pushed onto
queue q and into the pool Υ of already explored solutions. q is a priority queue for storing
solution tuples (φ, ω) of feature sets φ and associated weights ω. In each iteration of the
main loop, the candidate solution that evaluates best is popped off the queue on line 7.
At line 16, in the forward selection phase, a new candidate feature is chosen at random,
where each candidate f is assigned a probability πf+ of being the next addition to the
current feature selection. This probability is calculated according to
X
πf+ = exp{ψf (Ptr , φ∗ , ω ∗ )}/
exp{ψf 0 (Ptr , φ∗ , ω ∗ )}
(4.22)
f 0 ∈Φ+ \f
Each feature is thus assigned a probability in relation to its conditional relevance. This
inherent randomness will keep the algorithm from consistently selecting the next feature
from a small subset of features. Then on line 17, the new feature subset is assigned
weights by using Algorithm 1, taking as input the fitness evaluation algorithm outlined
in the pseudocode of Algorithm 4 below. The tuple of feature subset and corresponding
weights (f ∪ φ∗ , ω+ ) is then pushed on to the queue and the choice of available features is
reduced by f . In an analogous fashion, on line 25, the backward elimination phase assigns
each feature f , belonging to the current feature subset φ∗ , with a probability of πf− to be
selected next for elimination according to the redundancy of f :
X
πf− = exp{ξf (Ptr , φ∗ , ω ∗ )}/
exp{ξf 0 (Ptr , φ∗ , ω ∗ )}
(4.23)
f 0 ∈φ∗ \f
The resulting feature subset φ∗ \ f with inferred weights ω− are then pushed onto the
queue in the same way as in the forward selection phase.
The pseudocode of Algorithm 3 is a slight simplification. Instead of selecting a single
feature for forward selection at line 16, a cluster of features is selected that all have the
same or very similar performance when using each of them as a score function of 4.19.
After the cluster of features fy is assigned a probability πf−y on line 16, the most promising
feature f of cluster fy is subsequently chosen to be included in φ∗ according to
nX
o
X
∗
∗
f = arg min min
y(s−f 0 , p),
y(s+f 0 , p)
(4.24)
f 0 ∈fy
p∈Ptr
p∈Ptr
27
4. Methods for Learning of Branching Rules Using Computational Intelligence
Algorithm 3: TrainingWrapper
input : Training data χtr , validation data χva , maximum allowable number of
performance comparisons Rmax , a set of candidate features Φ, number of
stepwise evaluations α
output: A set Υ of tuples (φ, ω) of feature sets φ and associated weights ω
1
2
3
Set ω ← {}, q ← {}, Υ ← {};
Set R ← 0, ybest ← ∞; P
Evaluate f ∗ ← arg min p∈Ptr y(ωf f, p);
f
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Set q ← (f ∗ , ωf ∗ ), Υ ← (f ∗ , ωf ∗ );
while R < Rmax do P
P
(φ∗ , ω ∗ ) ← arg min p∈Pva y(ω T φ, p), y ∗ ← p∈Pva y((ω ∗ )T φ∗ , p);
∗
(φ,ω)∈q
∗
q ← q \ (φ , ω );
if y ∗ ≥ ybest then
R ← R + 1;
else
ybest ← y ∗ ;
end
Φ+ ← Φ0 \ {f : f ∪ φ∗ ∈ Υ};
Set counter ← 0;
while counter < α and Φ+ 6= {} do
Choose at random a single feature f from Φ+ with probability
πf+ ∼ ψf (Ptr , φ∗ , ω ∗ );
Set ω+ ←LearnWeightsCMA (χtr , f ∪ φ∗ , EvalFitnessSuper);
Add q ← q ∪ (f ∪ φ∗ , ω+ ), Υ ← Υ ∪ (f ∪ φ∗ , ω+ );
Φ+ ← Φ+ \ f ;
counter ← counter +1;
end
Φ− ← {f : f ∈ φ∗ ∧ {φ∗ \ f } 6∈ Υ};
Set counter ← 0;
while |φ∗ | > 2 and counter < α and Φ− 6= {} do
Choose at random a single feature f from Φ− with probability
πf− ∼ ξf (Ptr , φ∗ , ω ∗ );
Set ω− ←LearnWeightsCMA (χtr , φ∗ \ f , EvalFitnessSuper);
Add q ← q ∪ (φ∗ \ f, ω− ), Υ ← Υ ∪ (φ∗ \ f, ω− );
Φ− ← Φ− \ f ;
counter ←counter +1;
end
end
return Υ
where
0
s∗±f 0
28
f
=s ± ∗
|φ |
∗
(4.25)
4.4. Non-Supervised Black Box Approach for Learning of Branching Rules
is a weighted score function estimating the impact on performance by adding the feature
0
f to the current feature selection φ∗ .
Algorithm 4 below shows the fitness evaluation function of Algorithm 3, used as an input
to the call of the (1+1)CMA-ES algorithm, Algorithm 1, on lines 17 and 26.
Algorithm 4: EvalFitnessSuper
input : Data χ(P) collected from problem set P, a score function s = ω T φ
output: Evaluated fitness value υ
1
2
3
4
Set υ ← 0;
for all problems p ∈ P
Performance ← y(j ∗ , p);
// subject to j ∗ = arg max ω T φ(x̂j )
j∈F0 (p)
5
6
7
υ ← υ+Performance;
end
return υ
P
The time complexity of Algorithm 4 is O
p∈P F0 (p) . Therefore, neglecting the covariance matrix update cost, P
the total time
complexity of Algorithm 3 per inferred weight
vector ω is O kmax (ω) × p∈P F0 (p) . The cost of Algorithm 2, the data collection is
not included in this calculation. That cost is substantial, but
P for all the weight
Pcommon
vectors of the returned set Υ, and is in runtime equal to t2 = p∈P 0 j∈F0 (p) t(j, p), where
t(j, p) is the runtime associated with choosing candidate j at the root node of instance
0
p and P = Ptr ∪ Pva . This cost is, however much less than the cost of the method
introduced in next section.
4.4. Non-Supervised Black Box Approach for Learning
of Branching Rules
In this section a method for learning of branching rules is presented, one that is not only
to be used at the root node, but for the entire solving process. For this approach the
evolution algorithm is applied directly, i.e. instead of collecting labelled training data
beforehand the fitness evaluation of lines 2 and 6 of Algorithm 1 is performed by directly
solving the training problems for each new instance of the weight vector ω. Another key
difference is that the features to be used as building blocks of the learned rules are chosen
in advance by the researcher. As can be seen from the pseudocode of Algorithm 5 below,
this method will require much larger computation time than the method presented in
29
4. Methods for Learning of Branching Rules Using Computational Intelligence
section 4.3.
Algorithm 5: EvalFitnessNonSuper
input : Set of problems P, a score function s = ω T φ
output: Evaluated fitness value υ
1
2
3
4
5
6
7
8
// Utilizes SCIP solver with parameters set at default values
Set υ ← 0;
for all problems P ∈ P
SCIP.SetScoreFuncion(s);
Performance ← SCIP.Solve(P );
υ ← υ+Performance;
end
return υ
The pseudocode of Algorithm 5 above shows the fitness evaluation procedure used for
the approach. The algorithm returns the sum total of measured performances over all
generated problems P in problem set P. Neglecting the covariance matrix updates, the
computation time needed for creating (training) a branching rule is ttr = t̄ × |Ptr | × kfinal ,
where t̄ is the average runtime over all problems in Ptr over all iterations and kfinal is the
total number of iterations performed by the evolutionary algorithm.
As is the case of the ideal measure of performance, the ideal number of training problems
|Ptr | is unknown. Also unknown are the ideal combination and number of features of the
score function of the evolved branching rules. Therefore for the purpose of this experiment,
these components were varied and multiple rules evolved for each combination of these
components. The number of problems used for training was |Ptr | ∈ {1, 5, 10} × 103
while the feature combinations were chosen relatively ad hoc. Types of rules, or feature
combinations used, were of four types. The feature components of these types are shown
in table 4.1 below as well as the components of the HIB and RB rules which are used for
comparison in the experimental study.
The shorthand names of table 4.1 stand for:
• rel - Reliability branching features
• psc - Pseudocost branching features
• con - Conflict branching features
• conl - Conflict length branching features
• inf - Inference branching features
• cut - Cutoff branching features
30
4.4. Non-Supervised Black Box Approach for Learning of Branching Rules
Table 4.1: Feature components for each type of evolved branching rule
type
dim ω
component
up/down
HIB
5
min/max
score
up/down
RB
5
min/max
score
up/down
EB1
5
min/max
score
up/down
EB2
8
min/max
score
up/down
16
min/max
EB3
score
up/down
EB4
5
min/max
score
Normalization
rel
psc
1
con
conl
inf
cut
10−4
0
10−4
10−2
X
X
X
X
ncc
fra
eff
X
X
X
1
X
X
X
X
X
X
X
X
X
X
F
X
X
F
X
X
X
X
X
X
X
Q
Q
Q
Q
-
-
-
• ncc - Fraction of problem constraints the variable xj appears in
• fra - Fractionality or θ(xj ) = min {fj− , fj+ }
• eff - Item efficiency measure as suggested by Freville and Plateau (1994)
A checkmark in a row in table 4.1 labeled "up/down" corresponds to the qj+ , qj− components of each branching rule being used, while a checkmark in a row labeled "min/max"
corresponds to the minimum and maximum of these and a checkmark in the "score" row
means the score according to equation 3.2.
The evolved rules of the type EB1 have the same features as the SCIPs default hybrid
inference/reliability branching rule as well as the conflict length branching score. While
pseudocost measures are hybridized with strong branching measures in reliability branching, the pseudocost branching quantities used in rules of types EB2 and EB3 are "pure"
pseudocosts as are used in the pseudocost branching rule described in section 3.2. Also
used were the scores of the conflict, conflict length, inference and cutoff branching rules,
as is the case with the SCIP hybrid/inference branching rule described in subsection 3.5.2.
Further features are fractionality, the fraction of problem constraints that a variable appears in, and item efficiency measure as in Freville and Plateau (1994). The ncc and eff
were included as a special knapsack related features. It is theorized that they could have
an impact on branching rules used to solve knapsack problems. Shown are also in table
4.1, the default weight values for the HIB and RB rules. The low weight values for three
31
4. Methods for Learning of Branching Rules Using Computational Intelligence
of the features of the HIB rule, when compared to the reliability branching score weight
value, are indicative of these features being mainly used for tie-breaking.
Since for such small training problems, runtime was considered unreliable for training
purposes, the goal was to minimize the total number of nodes η and/or the total number
of simplex iterations ς needed, which are known to be highly correlated with the runtime
spent solving the problems. Both performance measures were used as fitness measures
for creating rules and a number of rules evolved for both cases. In total |R| = 146 rules
were created with the evolutionary algorithm where R is the set of all rules evolved for
this study. Table 4.2 shows the number of rules r ∈ R evolved by type of rule and the
number of problems |Ptr | used for training. The numbers of each group of rule r is split
according to which type of fitness measure was used for evaluation.
Table 4.2: Number of rules evolved for each type of rule, fitness measure used and number
of problems used for training
rule |Ptr | #{r : υ = η} #{r : υ = ς}
1000
9
9
EB1 5000
6
7
10000
3
1000
9
9
EB2
5000
7
9
1000
9
8
EB3 5000
6
6
1000
7
9
EB4 5000
8
9
10000
7
9
The distribution of weight values of the |R| rules returned by the ES algorithm is summarized by the box plot of fig. 4.1. Positive weight values are associated with features
whose increase in value is associated with decrease in fitness value (better performance).
The ncc feature, which is a part of rules EB2 and EB3 , has negative weight values for
most of the evolved rules of that type. This indicates that variables involved in fewer
constraints are, on average, better choices than others, at least while solving the instances
of the training set. Negative weight values are more prominent for the cutoff branching
score and the conflict length score, especially when used in the EB3 rule. The fractionality,
efficiency and conflict length score features seem to be the least effective when it comes
to minimizing the fitness value, having the corresponding weights distributed around
zero. The weights relating to the psc(·) and rel(·) features seem to have the largest
impact on the fitness value. The weights of these pseudocost and strong branching related
features are mostly positive, as would have been expected. For both pseudocost as well as
the reliability branching features, the upwards measure appears more effective than the
downwards measure and so do the maxima when compared to the corresponding minima.
These differences are more pronounced for the reliability branching features. Except for
the reliability branching score, all of the score features used by the HIB rule have low
32
4.5. Generation of Training Instances
EB3
EB2
1 2 3 4
EB1
EB4
●
●
●
−1
●
psc
~
ω
i
(−)
psc
(+)
rel(−
relm relm
a
rel(+
)
in
)
x
rels
3
●
●
1
●
●
−3
−1
●
●
●
psc psc con con
max
min
l
s
s
infs
cut
s
ncon fra
eff
i
Figure 4.1: Distribution of evolved weight values returned for each feature by type of rule
apparent effectiveness which coincides with their tie-breaking status in the HIB rule.
4.5. Generation of Training Instances
This study utilizes randomly generated multidimensional knapsack instances. Instances
were generated with similar methods as the benchmark instances used in Puchinger et al.
(2010). The main difference is that in this study instances with higher sparsity of the
constraint matrix are used. All generated problem instances have the form of
z ∗ = maximize cT x
s.t. Ax ≤ b
x ∈ {0, 1}n ,
m×n
c ∈ Rn≥0 , b ∈ Nm
0 , A ∈ N0
(4.26a)
(4.26b)
(4.26c)
(4.26d)
33
4. Methods for Learning of Branching Rules Using Computational Intelligence
The coefficients were randomly generated as
aij ∼ U{0, amax }
n
X
bi = b(τ /100)
aij c
(4.27a)
(4.27b)
j=1
m
1 X
ρ
cj =
uj amax
aij +
m i=1
100
(4.27c)
uj ∼ U(0, 1)
(4.27d)
Here U is the uniform distribution and τ is the so called tightness ratio. The parameter ρ
governs the degree of correlation between the objective coefficients ci and the constraint
coefficients a1j through amj . Instance coefficients and sizes were all combinations of:
n ∈ {30, 32, 34, ..., 48}
m ∈ {5, 6, 7, 8}
τsv ∈ {25, 35, 45, 55, 65, 75}
τnsv ∈ {65, 70, 75, 80, 85}
ρ ∈ {40, 45, 50, 55, 60}
The set of values of τ for the non-supervised method, denoted by τnsv , is different from
the supervised method which is denoted by τsv . One training instance was generated for
each of the different combinations of parameter values, with a total of 1200 instances for
the supervised method and 1000 instances for the non-supervised method.
4.6. Summary
This chapter outlines in detail the methods developed for this thesis. The (1+1)CMAES algorithm is explained and its central role in the methods used to design branching
rules, both embedded in a feature selection wrapper in a supervised method and as a tool
for inferring weights for branching rules constructed from features selected beforehand.
The performance measures used for fitness evaluation during the evolutionary search
are explained as well as the features used and how they are normalized for increased
compatibility between different problem instances.
In the next two chapters, the developed methods are extensively tested and their performance compared against that of the current state-of-the-art branching rules.
34
5. Experimental Study: Supervised
Method
In this chapter, results of experiments as well as the experimental setups for the supervised
method are introduced and analysed. For analysis of the performance of the branching
rules created during this study a number of test instances were generated and solved by
implementing the rules in SCIP.
5.1. Experimental Setup
As a measure of comparison, two branching rules implemented in SCIP, reliability branching (RB), and the default SCIP rule, hybrid reliability/inference branching (HIB), were
also applied to the solution of the instances as was random branching (RDB). As the
name implies, random branching rule chooses the next branching variable at random.
During testing all SCIP parameters, except those parameters which govern the choice of
branching rule, were set at their default values.
Some of the instances used for testing purposes were generated using the same generation
parameters, see 4.5, as the instances used for training purposes. Further instances generated by using a different choice of generation parameters were also used for evaluation of
rule performance over more general classes of instances.
The testing instances were split into problem sets according to which sets of combinations
(table 5.1) of the parameters n, m, τ, ρ and amax were used to generate them. The sets of
parameters n, m, τ and ρ used in this study can be seen in table 5.1 below.
Table 5.2 shows the parameters by which each problem set was generated. Npar is the
number of problems generated for each combination of parameters for a total of |Pte | problems generated per problem set. The third problem set consists of unbounded (integer)
knapsack problems while the second class have less sparse constraint coefficient matrices
than do instances of other classes. The first set is composed of instances generated with
exactly the same combinations of parameters as were the training instances while the
fourth set comprises similar instances except that they are larger in size.
For testing, five rules were chosen for each performance measure used for training. These
35
5. Experimental Study: Supervised Method
Table 5.1: Combinations of parameters used for generation of testing instances for supervised method
Set
n48
n53
n98
m8
m10
τ75
ρ60
Parameters
n ∈ {30, 32, 34, ..., 48}
n ∈ {35, 37, 39, ..., 53}
n ∈ {90, 92, 94, 96, 98}
m ∈ {5, 6, 7, 8}
m ∈ {7, 8, 9, 10}
τ ∈ {25, 35, 45, ..., 75}
ρ ∈ {40, 45, 50, 55, 60}
Table 5.2: Combinations of parameters used for generation of testing instances for supervised methods
Set nr.
1
2
3∗
4
T r†
*
|Pte | Npar
1200
1
1200
1
1200
1
1200
1
12000
1
n
n48
n48
n53
n98
n48
m
m8
m8
m8
m10
m8
τ
τ75
τ75
τ75
τ75
τ75
ρ
ρ60
ρ60
ρ60
ρ60
ρ60
amax
9
999
9
9
9
Unbounded integer knapsack instances
†
Training instances
were rules that had the best validation results among rules trained using the same performance measure, however some variety in rule structure was maintained among the rules
that were chosen. To investigate if the rules were effective in the branch-and-bound tree
ˆ at which the
at depths exceeding the root node, i.e. for d > 0, the maximum depth, d,
rules would be applied, was varied as dˆ ∈ {0, 2, 4, ∞}, with dˆ = ∞ denoting usage of the
new rules throughout the whole tree.
Table 5.3: List of rules, generated with time as performance measure, used for testing and
their feature make-up
y(r)
RCC
t1
−0.72
t2
−0.45
t3
t4
t5
ncc
che
g−
0.52
g+
0.84
g−
1.19
g+
0.85
psd
psc
−0.53−
rel
eff
−0.45
2.10max
1.45w
0.38
0.81max
1.48w
rcc(Q̌)
1.00max
0.28av
2.12max
1.83max
1.37
0.66+
1.50max
0.19
0.97w
0.18
2.05max
1.63max
1.91max
fra
d−
1.11
−0.35
−0.18−
d−
0.91
Tables 5.3 through 5.5 show these rules and their feature component structure as well
as the corresponding weight values. A +/− subscript denotes an up-/downward feature
type, av,min and max subscript denotes average, minimum and maximum feature type
respectively, a subscript a stands for active constraint and subscript w stands for a weigted
feature type. For further explanation of the various feature types, see appendix A. Also
36
5.1. Experimental Setup
denoted are the feature normalizations in each case. An overscore denotes a historic
normalization (see subsection 4.2.2), a tilde denotes a fractional variable set normalization
while a circumflex denotes a full variable set normalization. For comparison, a rule made
up only of the single feature with the most prediction capability (see appendix A), rccQ̌ ,
was also tested. The first line of 5.3, labeled RCC, shows the simple structure of that
rule.
Table 5.4: List of rules, generated with nodes as performance measure, used for testing
and their feature make-up
y(r)
η1
red0 ncc
g −0.60a
0.68
η2
−0.53a
η3
−0.88a
η4
−0.47a
η5
che
0.91+
g−
−0.28
psd
lpv
psc
g
d
\
1.29−
1.30min −0.03max
d+
0.59
d min
0.11
g
d max
0.76−
0.67
d+
0.96
d min
1.85
g
d min 0.47
d max
−0.01− −0.61
d
0.70+
d min
−0.37
g
2.10−
0.42−
d+
d max
0.56
−0.39
1.70−
g max
0.09
rel
2.30−
rcc
0.45−
rcc(Q̌)
2.09max
1.77max
3.95max
1.59−
1.47max
d min
−1.26
0.29av
2.47max
All of the features used for testing rely heavily upon the rcc features or the derived
efficiency. Among the rules, the maximum of the constraint values of the rcc is used
in most cases, but also the minimum as well as the average. These features tend to be
associated with weight values that are large in magnitude. Another class of constraint
related features, the ncc features is also a prominent rule component with most associated
weight values being negative. All of the rules tested, except for the simple RCC rule, rely
upon pseudocost related features, i.e. either the pseudocost dive psd, pseudocost psc,
average pseudocost psh, child estimate che, or the reliability features rel. Interestingly
only one of the rules has either the rel or psc score type feature, which rather than
the up/down or max/min types is used by default in SCIP. Also of interest is that six
of the rules do not use any type of strong branching feature of the rel class so they
do not rely on strong branching at all. The feature selection algorithm of Algorithm 3
was encouraged to select pseudocost features over reliability features if the usefulness or
relevance (see definition 4.3.1) was measured as being similar. An important detail here
is that even though the training data was collected at the root node only, the pseudocosts
are in fact already initialized (Achterberg, 2007) through application of primal heuristics
as well as restarts (see section 2.3). For rules that do have strong branching features, the
pseudocosts are very similar to the corresponding rel features since the strong branching
greatly influences the pseudocosts. For the implementation of the inferred rules a special
care was taken so as to have the control of the number of strong branching iterations
performed and candidates initialized, identical to the RB and HIB rules (see section 3.3
and subsection 3.5.1).
37
5. Experimental Study: Supervised Method
Table 5.5: List of rules, generated with iterations as performance measure, used for testing
and their feature make-up
y(r) ncc
ς1 −0.44
(w)
1.43a
ς2 −0.42
(w)
0.56a
ς3 −0.38
ς4 −0.48
(w)
ga
0.46
ς5
psh
che
0.44+
psc
1.89max
0.94min
1.52max
1.18min
0.98−
rel
eff
0.58w
rcc
0.50av
1.81w
1.00av
1.65max
2.20max
g min
−0.66
1.19−
1.05−
1.11+
0.65s
rcc(Q̌)
1.65max
g min
−1.14
0.78max
g min
−0.73
d min
−0.90
1.30max
5.2. Results
In table 5.6 below, the results for test sets 1 through 4 are presented. Except for the three
comparison rules for each of the test sets, the results which are shown for the supervised
ˆ gave
rules, are the results for which the setting of the maximum applicable depth, d,
the best performance for each rule. The full test results can be seen in appendix C.
0
0
Also shown in table 5.6 are the average training and validation performances, ȳtr and ȳva
0
respectively, where for rule r, ȳ(r) = ȳ(r)/ȳ(HIB) is the ratio of the average performance
values of rule r and the HIB rule. Here ȳ1000 stands for average performance in units
of thousands with η denoting number of nodes and ς number of simplex iterations while
t denotes runtime. Rule labels are on the form yk where y stands for the performance
measure optimized for creating the rule and k being a rule index number. Shown in bold
are results for rules that outperform both the HIB and RB rules. Overscored numbers
state difference in median from the better performing SCIP rule at the 95% significance
level using the Wilcoxon signed-rank test with the resulting p-values not adjusted for
multiple comparisons. All values for nodes and iterations are shown in thousands while
the runtime is shown in seconds. As can be seen in table 5.6, the rules generated by the
supervised method outperform the comparison rules in most cases over test set 1. What is
more, they are more effective if applied below their intended range at the root node. The
same can be said for test sets 2 and 4 where most of the rules generated not using time as
a measure of performance are outperforming the comparison rules, while the maximum
depth of successful application is generally less than that for test set 1. For test set 2 and
especially test set 3, there is a tendency among the rules to generate less nodes than the
comparison rules while not performing fewer simplex iterations or spending less time for
solving.
38
5.3. Summary and Discussion
Table 5.6: Test results for test sets 1-4, for all tested rules with dˆ set at optimal level in
each case
Rule
0
ȳtr
0
ȳva
Set 1
ˆ
d η̄1000 ς¯1000
t̄
Set 2
ˆ
d η̄1000 ς¯1000
t̄
Set 3
ˆ
d η̄1000 ς¯1000
t̄
Set 4
ˆ
d η̄1000 ς¯1000
t̄
RB
0.479
1.97
0.311
1.17
4.53
0.531
0.837
2.72
0.381
15.0
64.2
5.60
HIB
0.479
1.98
0.311
1.17
4.54
0.532
0.830
2.68
0.376
14.9
63.3
5.53
0.314 0
1.20
4.76
0.544 0 0.809 2.72
0.380 0
15.2
65.2
5.68
4.65 0.527 4 0.785 2.71 0.372 4 14.9
64.3
5.58
RDB
0 0.475 2.07
RCC
2 0.431 2.00 0.301 2 1.11
t1
0.982 0.985 2 0.451 1.91 0.310 2 1.17 4.50 0.534 2 0.818 2.68 0.382 2 14.8 63.1 5.54
t2
0.983 0.986 4 0.446 1.89 0.307 2 1.16 4.50 0.533 2 0.811 2.69
0.381 4 14.8 62.4 5.51
t3
0.989 0.986 0 0.459 1.94 0.309 0
0.381 0
1.18
4.59
0.538 0 0.809 2.69
15.0
63.8
5.59
t4
0.983 0.986 4 0.442 1.86 0.306 2 1.16 4.49 0.534 0 0.796 2.65 0.375 4 14.8 62.9 5.52
t5
0.989 0.987 2 0.428 1.97 0.300 0
1.18
4.69
0.537 0 0.791 2.65 0.373 0
15.0
64.7
5.64
η1
0.969 0.972 0 0.459 1.95 0.310 0
1.18
4.60
0.538 0 0.803 2.66 0.378 0
14.9
63.9
5.59
1.18
4.54
0.537 0 0.803 2.66 0.376 2 14.6 62.6 5.48
η2
0.971 0.972 4 0.445 1.91 0.310 2
η3
0.970 0.972 2 0.424 1.98 0.301 2 1.10
4.60 0.524 0 0.800 2.67 0.376 2 14.4 61.7 5.40
η4
0.972 0.973 2 0.448 1.91 0.310 0
4.60
1.18
0.538 0 0.796 2.64 0.376 2 14.8 63.2 5.54
η5
0.974 0.975 4 0.400 2.07 0.298 0 1.13 4.51 0.526 2 0.811 2.73
ς1
0.972 0.975 2 0.422 1.94 0.299 0 1.15
ς2
0.973 0.975 2 0.425 1.94 0.301 0 1.14 4.52 0.527 2 0.800 2.69 0.376 2 14.5 62.4 5.44
ς3
0.974 0.977 0 0.461 1.96 0.310 2 1.15 4.46 0.529 2 0.822 2.70
0.383 0 14.8 63.2 5.53
ς4
0.973 0.977 4 0.445 1.88 0.309 2 1.12 4.35 0.522 0 0.822 2.72
0.385 2 14.7 62.4 5.48
ς5
0.975 0.977 2 0.449 1.89 0.309 0
0.384 2 14.7 62.7 5.50
|Pte |
1200
1.18
0.378 2 14.5 62.3 5.44
4.57 0.531 0 0.805 2.68 0.378 4 14.4 62.2 5.43
4.61
0.539 0 0.820 2.72
1200
1200
600
5.3. Summary and Discussion
The results of the section above show that it is possible to apply a supervised method to
enhance branching rules, even though the training is performed on labelled data gathered
at the root node only. Perhaps surprisingly, the rules are effective at greater depths of
the branch-and-bound tree than at the root node. The rules that do not use strong
branching, t5 , η3 , η5 , ς1 , ς2 , ς3 and even RCC, do not seem to have any worse performance
than the rules that do. This is interesting since strong branching is the current state of
the art in modern solvers (Lodi et al., 2011). The rules generated by optimizing 4.19
over runtime t do seem to perform more poorly than the other rules when applied to
the larger instances of test set 4. This could be an indication of the latter rules being
more effective on instances where the trees can potentially become much larger, their
tendency to reduce the number of nodes or iterations thus becoming more important. In
other words, generating rules that, minimize runtime for such small instances used as the
training instances, might not be an effective strategy for building rules for cutting down
the runtime of larger instances.
39
6. Experimental Study:
Non-Supervised Method
In this chapter, results of experiments as well as the experimental setups for the nonsupervised method are introduced and analysed. First, validation sets were generated
and used to filter out the most promising rules generated by the method. This final set of
rules was then tested on separate test sets and their performance compared against the
SCIP default branching rules.
6.1. Experimental Setup
As in section 5.1, for comparison, the reliability branching rule (RB) and the hybrid
reliability/inference branching rule (HIB), were also applied to the solution of the instances
as was random branching (RDB). Again, during testing, all SCIP parameters, except those
parameters which govern the choice of branching rule, were set at their default values.
The generation and choice of test instances was very similar to the case of the experimental
setup for the supervised method. The sets of the parameters n, m, τ, ρ and amax used were
only slightly different. The sets of parameters n, m, τ and ρ used can be seen in table 6.1
below. However an additional fifth class of instances has been added. This class contains
literature sized instances having similar values and combinations of parameters as the
instances used in Puchinger et al. (2010). The sets on instances used for this study are
listed in table 6.2 below.
Before solving the test sets, four independent and separate validation sets were generated
using the same parameters used for generating test sets 1 through 4. Each validation set
had a size of |Pva | = 1000. These were then solved by all of the evolved rules. The best 12
rules according to each performance measure in fυ ∈ {η, ς, t} over the validation results
were then chosen for further testing. The best performing rules were chosen as the ones
with the highest rank of all rules, with respect to the performance value, averaged over
all the validation sets. For each performance measure, each rule r ∈ R was thus given
rank according to
X
4
1
rank (yl (r))
(6.1)
Oy (r) = rank
R
4 l=1 R
41
6. Experimental Study: Non-Supervised Method
Table 6.1: Combinations of parameters used for generation of testing instances for nonsupervised methods
Set
n48
n58
n108
n500
m8
m10
m30
τ85
τ75
ρ60
Parameters
n ∈ {30, 32, 34, ..., 48}
n ∈ {40, 42, 44, ..., 58}
n ∈ {90, 92, 94, ..., 108}
n ∈ {100, 250, 500}
m ∈ {5, 6, 7, 8}
m ∈ {7, 8, 9, 10}
m ∈ {5, 10, 15, 20, 25, 30}
τ ∈ {65, 70, 75, 80, 85}
τ ∈ {25, 50, 75}
ρ ∈ {40, 45, 50, 55, 60}
Table 6.2: Combinations of parameters used for generation of testing instances for nonsupervised methods
Set nr.
1
2
3∗
4
5
T r†
*
|Pte |
1000
1000
1000
1000
540
1000
Npar
1
1
1
1
10
1
n
n48
n48
n58
n108
n500
n48
m
m8
m8
m8
m10
m30
m8
τ
τ85
τ85
τ85
τ85
τ75
τ85
ρ
ρ60
ρ60
ρ60
ρ60
50
ρ60
amax
9
999
9
9
9
9
Unbounded integer knapsack instances
†
Training instances
where yl (r) is the performance value of rule r over validation set l and with
rank (π(r)) = 1 + |{s ∈ R : π(s) < π(r)}|
R
+
1
× |{s ∈ R : π(s) = π(r) ∧ s 6= r}|
2
(6.2)
being the fractional ranking of rule r over measure π. The lowest value of Oy (r) corresponds to the highest rank. Since some of the rules were among the highest ranking for
more than one performance measure, the subset Rte ⊂ R of rules used for testing was of
size |Rte | = 29 rules. However, for the purpose of performance analysis, all of the rules in
R were tested on test set 5, although these results were not used for rule validation.
6.2. Analysis of Validation Results
Naturally, many of the evolved rules exhibit more relative differences in performance
from the SCIP default rules over the more computationally expensive problems of each
set. For this reason all of the validation sets as well as the test sets are divided into equally
sized subsets of "easy" and "hard" problems according to the weighted average of problem
42
6.2. Analysis of Validation Results
specific runtimes. For problem p of validation/test set l ∈ {1, 2, 3, 4} the weighted average
t̂l (p) is thus
1
1 1 X
1
tl (r, p)
(6.3)
t̂l (p) = tl (HIB, p) + tl (RB, p) +
4
4
2 |R| r∈R
where tl (HIB, p) and tl (RB, p) are the runtimes for problem p, for the HIB and RB rules
respectively. A problem is then classified as "easy" if
ed
t̂l (p) < t̂M
l
(6.4)
ed
is the median value over all problems p. The rationale behind eq.
and else hard, where t̂M
l
6.3 for determining instance hardness is that the comparison rules should weigh equally
against the evolved rules, so that a "hard" instances should be hard, on average to both
evolved rules as well as comparison rules. For test set 5, the instances that could be solved
by all rules within a time limit of tmax = 500s were classified as "easy" while others were
classified as "hard".
Fig. 6.1 shows the distribution of results for all rules over each of the validation sets
as well as test set 5. Shown separately are distributions of average number of nodes η̄,
iterations ς¯, as well as runtime t̄. For test set 5, results are shown for both easy (e)
and hard (h) instances. For the hard instances of set 5, the average relative gap %̄LP in
percentages is used instead of the runtime. The relative gap of the obtained minimum
integer feasible solution value z, to the optimal objective value zLP of the LP-relaxation,
is defined as the relative gap (Puchinger et al., 2010):
%LP = (zLP − z)/zLP
(6.5)
Except for the hard instances of set 5, the majority of the evolved rules seem to perform
better than the HIB rules over all sets when considering number of iterations performed.
The opposite is true for number of nodes performed, the evolved rules generally did worse
than the HIB rule. When considering runtime, validation set 4 is the only set where
majority of the rules did better than the HIB rule while about half the rules did better
over validation sets 1 and 2. Looking at the distributions of the geometric means shown
in the rightmost boxplots, it is clear that a majority of the rules perform better than the
HIB rule when considering iterations. However, a majority of the evolved rules perform
less well than the HIB rule over when considering both the number of nodes as well as
runtime.
Of interest for future study is to see if and how on average the different parameters of
table 4.2 used for generation of rules, affect the performance of the rules. Of interest as
well is if there is a change in performance of each rule associated with the number of
iterations performed while evolving each rule as well as with the final fitness value of each
rule. To examine these effects, for each problem set and each performance measure in
fυ ∈ {η, ς, t}, the proposed performance distribution for rule r and individual problem p
is
ỹd (r, p) ∼ T F µ(r, p), σ(r, p), ν(r, p)
(6.6)
43
t̄ →
η̄ →
ς¯ →
6. Experimental Study: Non-Supervised Method
Set 1 Set 2 Set 3 Set 4 Set 5e Set 5h Mean
Figure 6.1: Distribution of average performances of all rules by problem set as well as
the geometric mean ("Mean") over all sets. Averages for the HIB rule are shown for
comparison (dashed lines).
where
yd (r, p) = y(SCIP, p) − log y(r, p)
ỹd (r, p) = yd (r, p)/σy
y(SCIP, p) = min {log y(HIB, p), log y(RB, p)}
(6.7)
(6.8)
(6.9)
Here ỹd (r, p) is the scaled version of yd (r, p) after dividing by the standard deviation σy of
yd (r, p), and y(HIB, p) and y(RB, p) are the performance values for the hybrid inference
rule and the reliability rule respectively. The distribution T F of ỹd is the t Location-Scale
with parameters µ, σ and ν.1 For the model of eq. 6.6 we have
X
µ(r, p) = β0 + βp + βr +
βl l(r, p)
(6.10a)
l∈χµ
log σ(r, p) = α0 + αp +
X
αl l(r, p)
(6.10b)
γl l(r, p)
(6.10c)
l∈χσ
log ν(r, p) = γ0 +
X
l∈χν
where βp and αp are the problem specific intercepts, βr is a rule specific intercept and
where
χµ = {n, m, τ, |P|5k , |P|10k , EB2 , EB3 , EB4 , υς , k ∗ , υ̃ ∗ }
χσ = {n, m, τ }
χν = {n, m, τ }
1
Estimation of parameters was done according to Stasinopoulos and Rigby (2007)
44
6.2. Analysis of Validation Results
are the sets of independent variables on which the distribution parameters are regressed.
The variables |P|v and EBw are indicator functions for number of training problems used
and the rule type of rule r respectively, with |P|v = 1 if |P| = v and |P|v = 0 else,
and with EBw = 1 if r is of type EBw and EBw = 0 else. υς is a indicator function
indicating the type of variable used for fitness evaluation during evolution of the rule r
with υς = 1 for iterations and υς = 0 else. For rule r, k ∗ is the total number of iterations
performed by the evolutionary algorithm and υ̃ ∗ is a normalization of the final fitness
value. The variables taking other values than just 0, 1 are scaled by dividing by their
standard deviation. Some of the two-way interactions of the variables are also used in
the model. Figure 6.2 shows the confidence intervals for the estimated parameters β̂l of
Set:
l=0
0.0
-0.1
-0.2
1
0.125
0.100
0.075
0.050
0.025
0.000
95% CI β̂l
l = EB2
2
3
4
5
y:
η
l = |P|5k
0.00
-0.1
-0.05
l = |P|10k
0.05
0.00
-0.05
l = EB4
0.10
0.05
-0.10
0.00
-0.15
-0.05
-0.2
l = k∗
l = υς
t
0.10
l = EB3
0.0
ς
l = υ̃ ∗
0.00
0.10
0.06
0.05
0.04
-0.01
0.00
0.02
-0.02
-0.05
0.00
-0.03
Figure 6.2: Confidence intervals of estimated parameters β̂l of model 6.10
the model for µ of eq. 6.10 for all three performance measures. A positive coefficient
β̂l will indicate that an increase in variable l for rule r, at least for the interval of this
variable in the data, will be associated with less runtime for that rule on average. The
negative value of the estimates β̂0 is indicative of the average performance of the evolved
rules being worse than that of the SCIP rules for all performance measures. There is
strong indication that rules trained on |P| = 5000 training instances perform better on
average than rules trained on |P| = 1000 or |P| = 10000 instances. It should not be
surprising that training on |P| = 5000 instances is preferable to training on |P| = 1000
instances, since this should result in a rule with better generalization capabilities. Why
there is a lack in relative performance of rules trained with |P| = 10000 instances is not
45
6. Experimental Study: Non-Supervised Method
as obvious, possibly with so many training problems the ES algorithm is more likely to
end up in a relatively local minimum. The two type of rules EB1 and EB4 with dim ω = 5
do perform significantly better than the more complex rule EB4 with dim ω = 16 of which
EB4 performs best of all the types. This would be in accordance with the less complex
rules generalize better than more complex ones, as could be the case here. The significant
positive values of the β̂τ parameters could have been expected since the training set was
generated with parameter values τ ∈ τ65 while for training set 5 the values are τ ∈ τ25
(see table 6.1). There is evidence that an increase in the number of iterations k of the
ES algorithm is associated with better performance of the corresponding rule over test
set 5. There is less evidence, but some, that that an increase in the final fitness value υ
is associated with worse performance of such a rule over the same set.
6.3. Test Results
Tables 6.3 and 6.4 of subsections 6.3.1 and 6.3.2 respectively, show the test results for the
three of the highest ranking rules in Rte according to Oy of eq. 6.1 of the validation results.
Shown are the three highest ranking rules for each performance measure in fυ ∈ {η, ς, t}.
Values shown are the average number of nodes η̄, average number of simplex iterations
ς¯ and the average runtime t̄ over each test sets. For the harder instances of test set 5,
only the average remaining gaps %̄LP are shown. For comparison, the performances of
the hybrid inference/reliability branching rule (HIB), reliability branching rule (RB) and
for test sets 1 through 4, random (RDB) branching rule, are presented. For each rule
considered is shown the rule type, number of training problems |P| iterated over and the
type fυ of fitness measure optimized, denoted by η for nodes and ς for iterations, as well
as the rank Oy . As for the results of section 5.2, ȳ1000 stands for average performance in
units of thousands with η denoting number of nodes and ς number of simplex iterations
while t denotes runtime. The results for all of the rules used for testing can be seen in
tables C.5 through C.7 in appendix C.
Shown in bold are results for rules that outperform both the HIB and RB rules. Overscored
numbers state difference in median from the better performing SCIP rule at the 95%
significance level using the Wilcoxon signed-rank test. For test sets 1-4 the significance is
estimated over the hard instances only. The resulting p-values from significance tests are
not adjusted for multiple comparisons. All values for nodes and iterations are shown in
thousands while the runtime is in seconds and the average gap in percentages.
For examination of the effect of problem sizes on performance, the instances of test set
5 were divided into six subsets, according to instance size, while taking into account the
relative hardness of each problem size. All instances of test set 5 were solved with a time
limit set at tmax = 500 seconds. Due to unforeseen shortage in computing power, test
results for some evolved rules for the larger problems of test set 5 are unavailable.
46
6.3. Test Results
6.3.1. Results for Test Sets 1 through 4
Table 6.3: Test results for test sets 1-4, for three highest ranking rules (Oy ) with respect
to each performance measure
Rule/
|Ptr | fυ Oy
Set 1
η̄1000 ς¯1000
Type
Set 2
t̄
η̄1000 ς¯1000
Set 3
t̄
η̄1000 ς¯1000
Set 4
t̄
η̄1000 ς¯1000
t̄
RB
-
-
-
0.280 1.16 0.253 0.567 2.17 0.352 1.24 3.84 0.552 8.16 34.4 3.37
HIB
-
-
-
0.283 1.18 0.255 0.567 2.17 0.351 1.24 3.84 0.557 8.19 34.5 3.38
RDB
-
-
-
0.636 2.64 0.325 1.153 4.53 0.467 2.57 7.00 0.967 18.78 64.6 6.22
EB4
5k
η 1η 0.270 1.14 0.252 0.555 2.13 0.349 1.23 3.76 0.550 8.06 33.3 3.27
EB3
5k
η 2η 0.265 1.12 0.307 0.557 2.12 0.416 1.25 3.72 0.624 8.08 32.6 3.47
EB1
5k
η 3η 0.278 1.17 0.255 0.567 2.18 0.353 1.27 3.95 0.572 8.34 35.7 3.46
EB4
10k ς
1ς 0.283 1.12 0.252 0.572 2.03 0.345 1.29 3.59 0.551 8.79 30.9 3.28
EB3
5k
ς
2ς 0.282 1.12 0.308 0.579 2.04 0.415 1.35 3.74 0.652 8.82 31.1 3.59
EB2
5k
ς
3ς 0.281 1.13 0.307 0.571 2.04 0.411 1.32 3.62 0.633 8.80 31.2 3.52
EB4
1k
ς
1t 0.281 1.11 0.250 0.567 2.04 0.344 1.27 3.63 0.548 8.48 31.4 3.23
EB4
5k
ς
2t 0.277 1.12 0.251 0.559 2.03 0.344 1.28 3.65 0.551 8.35 30.9 3.21
EB4
10k η 3t 0.271 1.13 0.250 0.561 2.13 0.351 1.25 3.75 0.549 8.08 32.9 3.26
|Pte |
1000
1000
1000
1000
For the results for test sets 1-4 of table 6.3, the three fastest rules, on average over the
validation sets (rules labeled 1t through 3t under Oy ) are indeed more or less faster than
the rules with the fewest nodes (labeled 1η through 3η ) and iterations (labeled 1ς through
3ς ) over the validation sets. The fastest rule over the validation sets, 1t , significantly
outperforms the HIB and RB rules over all test sets 1-4 and rules 2t and 3t are significantly
faster over three out of four of sets 1-4. The same can not be said for rule 1ς though.
Except for rules 1η and 3η , all of the evolved rules perform fewer simplex iterations than
the HIB and RB rules and rules 1t through 3t perform only slightly more iterations than
rules 1ς through 3ς on average. Rules 1η and 2η generate significantly fewer nodes than
the HIB and RB rules over test sets 1,2 and 4 while only rule 1η seems to be any faster.
Analogously, rules 2ς and 3ς do not seem to be any faster than the HIB and RB rules
even though they perform significantly fewer simplex iterations over all test sets 1-4.
6.3.2. Results for Test Set 5
The results for test set 5 of table 6.4 are perhaps more mixed than the results of test
sets 1-4. Six of the rules generate significantly fewer simplex iterations than the HIB and
RB rules over the instances corresponding to the first three columns of table 6.4. While
there is hardly any reduction in nodes generated compared to the HIB and RB rules
there is some decrease in runtime for all of the rules over the instances corresponding to
the first two columns. Rules 1η and 2η show a significant reduction in runtime over the
47
6. Experimental Study: Non-Supervised Method
Table 6.4: Test results for test set 5, for three highest ranking rules (Oy ) with respect to
each performance measure
Set 5 (tmax = 500 sec)
Rule/
|Ptr | fυ Oy
Type
100×
100×
250 × {5, 10}
{5, 10, 15}
{20, 25, 30}
500 × 5
η̄1000 ς¯1000
t̄
η̄1000 ς¯1000
t̄
η̄1000 ς¯1000
t̄
†
‡
∗
%̄LP
%̄LP
%̄LP
RB
-
-
-
33.8 165 7.56
172
1057 49.3
285
1381 77.9 0.182 0.150 0.171
HIB
-
-
-
32.2 156 7.34
172
1050 48.1
295
1398 80.6 0.182 0.155 0.165
EB4
5k
η 1η 31.0 147 6.94 178
1062 50.0
304 1328 77.8 0.183 0.154 0.174
EB3
5k
η 2η 32.2 150 7.26 176 1016 47.7 304 1287 76.2 0.192 0.149 0.157
EB1
5k
η 3η 32.6 158 7.25 176
1081 50.3
EB4
10k
ς
1ς
34.2 137 6.79 196
961 47.7 362 1360 84.1 0.231 0.153 0.175
EB3
5k
ς
2ς
35.1 140 7.21 196
962 46.4 369 1368 84.5 0.196 0.167 0.169
EB2
5k
ς
3ς
37.1 148 7.30 211 1016 50.3
EB4
1k
ς
1t
EB4
5k
ς
EB4
10k η
|Pte |
290 1355 79.0 0.205 0.153 0.154
350 1306 76.6 0.248
NA
NA
32.4 138 6.69 189
990 47.1 334 1304 79.2 0.232
NA
NA
2t
32.5 138 6.93 186
970 45.5 362
3t
32.4 151 7.41
90
1405 81.2 0.234 0.152 0.177
178 1026 47.3 311 1366 76.4 0.194 0.144 0.177
90
90
90
90
90
†{250 × {15, 20, 25}}, ‡{250 × 30} ∪ {500 × {10, 15}}, ∗{500 × {20, 25, 30}}
instances corresponding to the third column and along with rule 1t , spend significantly
less runtime solving instances corresponding to two out of three columns among the first
three. There is not much evidence of the evolved rules being better than the HIB and RB
when it comes to solving instances of the types corresponding to the last three columns
of table 6.4. Three of the evolved rules have less average gaps for instances corresponding
to the last two columns. However, since none of the rules of table 6.4 show any better
performance over the smaller instances corresponding to the fourth column of table 6.4,
this could easily be due to the fact that there have, on average, less nodes been generated
while solving the larger instances corresponding to the last two columns (see table C.5 of
appendix C for average number of nodes generated), resulting in fewer branching decisions
being made in the same amount of time which is the 500s time limit. Fewer branching
decisions made would mean that a potentially worse (or better) branching rule could not
have had the opportunity to make as big an impact on the final performance relative to
smaller instances where more nodes have been generated and therefore more branching
decisions have been made.
48
6.4. Summary and Discussion
6.4. Summary and Discussion
The results for runtime are especially encouraging for test set 4 as well as being promising
for set 5. Set 4 consists of instances that were generated with the same sets of parameters
τ ,ρ and amax as the training problems while the values of n and m are larger. The runtimes
for test set 1, the instances of which were generated with the same sets of parameters as
the training set, are on average not as good when compared to the HIB rule. Although
the rules did better on average on set 4 with respect to iterations, it does not by it self
explain the overall reduction in runtime. However, since the rules are optimized so as to
minimize the sum, or average of the fitness values over the training instances, the rules
might be better adapted to the hard instances of the training problems. Solving a set
of larger instances, generated with otherwise the same set of parameters as the training
instances, might perhaps be a task well suited for these rules.
As can be seen in tables C.5 through C.7, for problem set 1 and for all measures of
performance, the evolved rules seem to be more likely to outperform the SCIP rules on the
instances determined to be hard. This should not come as a surprise since the evolved rules
have been optimized as to minimize the sum of the fitness values over all instances, and
for harder instances there is simply much more to be gained by improving the branching
rules being used. It is therefore likely that the ES algorithm will be more influenced by
the potentially larger gains in performance when solving the harder instances. Somewhat
surprising, however, is that this pattern seems to be repeated, at least partially, in other
test sets. This is the case even if the easy instances of set 4, for example, are on average
considerably harder than the hard instances of set 1. This perhaps suggests that regardless
of size of the instance, an easy instance’s solution time can not be improved upon as much
as the solution time of a hard instance of the same size. Another possibility is that the
evolved rules are better suited to solve instances that are harder than instances of similar
size and type.
According to the results of figure 6.2, there is evidence that 1000 instances are not sufficient
for effectively training a branching rule. Thus it might be beneficial to train with e.g.
5000 instances. From the same results one may deduce that less complex rules are likely
to perform better than more complex ones.
49
7. Conclusion and Future Work
As was shown in fig. 6.1 for the validation results, the evolved rules of the non-supervised
method were in general well suited to be able to reduce the overall number of simplex
iterations performed. However, the same can not be said for reducing the number of
nodes generated however. Furthermore, most of the evolved rules took longer to solve the
validation instances than the comparison rules. When looking at overall performance in
runtime of the evolved rules, they are not particularly good at solving the integer instances
of test set 3 or the harder problems of test set 5. The geometric mean of average runtimes
over all test sets might indicate that the evolved rules outperform the comparison rules
only by chance. However, as can be seen from tables 6.3 and 6.4, some of the rules, as
well as spending less runtime than the HIB and RB rules solving the easier instances of
set 5, significantly outperform with respect to runtime, both the HIB and RB rules over
at least three out of four of the remaining test sets. When looking at the full results of
tables C.5 through C.7, the evolved rules are outperforming the HIB/RB rules much more
often than one would expect to be the case due to chance alone.
The evolved rules prove to be well suited for solving the instances of test sets 1 and 4 and
the results for runtime spent on solving the easy instances of set 5 are rather encouraging.
However, the results for the hard instances of test set 5 are less encouraging. This might
be an indication that the method of applying (1+1)CMA-ES by using multiple small
instances in order to train branching rules or even other decision rules, to be effective
at solving instances of large size, might not be sound. On the other hand it might
be possible to train decision rules using the relative gap as a measure of fitness while
solving small parts of larger instances, perhaps only the upper part of the tree. This
could be investigated in a possible future study. A further study might also involve
adapting (1+1)CMA-ES or another algorithm so that it can select which features to use
and optimize the weights, thus eliminating the need for choosing the features beforehand.
Another possible remedy would be to develop some restrictions or scaling preventing the
less promising features from having weights as large in magnitude as the more important
ones and thus serve as tie-breakers, as is the case for the HIB rule.
For the supervised method this work has revealed that learning branching rules from
data collected at the root by solving multidimensional knapsack problems is plausible.
The rules created this way are even more effective than the evolved rules of the nonsupervised method, and they are effective at greater depths than they were designed for,
i.e. they are not only effective at the root. Some of the supervised rules that do not use
strong branching at the upper levels of the tree even outperform the HIB and RB rules
that do so. This is surprising given the status of strong branching as a state-of-the-art
51
7. Conclusion and Future Work
branching rule in non-commercial solvers as well as commercial ones. It seems that very
simple branching rules using only a very simple heuristic at the upper levels of the tree can
outperform rules that use strong branching at the same levels of depth. This might be an
indication that for at least some type of MIP instances, it is better to pick the "obviously"
strong candidates at the upper levels while "saving" strong branching iterations for the
more ambiguous choices at the deeper levels. This could also be an indication that using
more than one branching rule for solving an instance might be beneficial. For training
of branching rules for tackling larger sized instances and the benchmark instances of the
literature, it might be plausible to collect data at the root node with the same method
as is done in this work. Of course the large instances can not be solved to optimality for
each branching candidate. Probably a time limit of a few seconds at most, perhaps even
minutes depending on the available computing power, would have to be enforced. Then
the relative gap could possibly be used as a measure of performance during training.
This work has shown that it is possible to use the (1+1)CMA-ES algorithm, both in a
supervised manner as well as in a non-supervised way, for training efficient branching rules
for a special class of mixed integer programs, i.e. multidimensional knapsack problems.
Furthermore, it has been shown that it is possible to use as a fitness variable the number
of nodes generated, as well as the number of simplex iterations performed, to create
branching rules that are faster than the HIB and RB rules in solving the test instances.
Finally the evolved rules generalize to a wider class of instances as well as those larger in
size. It remains to be seen whether this method, or a modification of it, is practical for
the creation of effective branching rules for solving instances of even larger sizes and of
more varied types.
52
Bibliography
Achterberg, T. (2007). Constraint Integer Programming. PhD thesis.
Achterberg, T. (2009). SCIP: solving constraint integer programs. Mathematical Programming Computation, 1(1):1–41.
Alvarez, A. M., Louveaux, Q., and Wehenkel, L. (2014). A Supervised Machine Learning
Approach to Variable Branching in Branch-And-Bound.
Applegate, D., Bixby, R., Chvatal, V., and Cook, W. (1995). Finding cuts in the tsp.
Balas, E., Ceria, S., Cornuéjols, G., and Natraj, N. (1996). Gomory cuts revisited. Operations Research Letters, 19(1):1–9.
Benichou, M., Gauthier, J. M., Girodet, P., Hentges, G., Ribiere, G., and Vincent, O.
(1971). Experiments in mixed-integer linear programming. Mathematical Programming,
1(1):76–94.
Berthold, T. and Salvagnin, D. (2013). Cloud Branching. Technical Report January.
Borndörer, R. and Grötschel, M. (2012). Designing telecommunication networks by integer
programming.
Burke, E. K., Gendreau, M., Hyde, M., Ochoa, G., Kendall, G., Özcan, E., and Qu, R.
(2013). Hyper-heuristics : A Survey of the State of the Art. Journal of the Operational
Research Society.
Cossock, D. and Zhang, T. (2006). Subset ranking using regression. In Learning theory,
pages 605–619. Springer.
Dang, V. and Croft, B. (2010). Feature selection for document ranking using best first
search and coordinate ascent. In Sigir workshop on feature generation and selection for
information retrieval.
Daumé, H. H. H. and Eisner, J. (2014). Learning to Search in Branch-and-Bound Algorithms.
Ding, C. and Peng, H. (2005). Minimum redundancy feature selection from microarray gene expression data. Journal of Bioinformatics and Computational Biology,
03(02):185–205.
53
BIBLIOGRAPHY
Fischetti, M. and Monaci, M. (2011). Backdoor branching. Lecture Notes in Computer
Science, 6655:183–191.
Fischetti, M. and Monaci, M. (2012). Branching on nonchimerical fractionalities. Operations Research Letters, 40(3):159–164.
Floudas, C. A. and Lin, X. (2005). Mixed Integer Linear Programming in Process
Scheduling : Modeling , Algorithms , and Applications. Annals of Operations Research,
139:131–162.
Forrest, J., Hirst, J., and Tomlin, J. A. (1974). Practical solution of large mixed integer
programming problems with umpire. Management Science, 20(5):736–773.
Freville, A. and Plateau, G. (1994). An efficient preprocessing procedure for the multidimensional 0–1 knapsack problem. Discrete Applied Mathematics, 49(1-3):189–212.
Geng, X., Liu, T.-Y., Qin, T., and Li, H. (2007). Feature selection for ranking. In
Proceedings of the 30th annual international ACM SIGIR conference on Research and
development in information retrieval, pages 407–414. ACM.
Gomory, R. E. (1958). Outline of an algorithm for integer solutions to linear programs.
Guyon, I. and Elisseeff, A. (2003). An Introduction to Variable and Feature Selection.
Journal of Machine Learning Research, 3:1157–1182.
Hansen, N., Arnold, D. V., and Auger, A. (2013). Evolution Strategies. In Handbook of
Computational Intelligence.
Hutter, F. (2009). Automated Configuration of Algorithms for Solving Hard Computational Problems. PhD thesis.
Kamishima, T., Kazawa, H., and Akaho, S. (2011). A survey and empirical comparison
of object ranking methods. In Preference learning, pages 181–201. Springer.
Kellerer, H., Pferschy, U., and Pisinger, D. (2004). Knapsack problems. Springer.
Koch, T., Achterberg, T., Andersen, E., Bastert, O., Berthold, T., Bixby, R. E., Danna,
E., Gamrath, G., Gleixner, A. M., Heinz, S., Lodi, A., Mittelmann, H., Ralphs, T.,
Salvagnin, D., Steffy, D. E., and Wolter, K. (2011). Miplib 2010. Mathematical Programming Computation, 3(2):103–163.
Kotthoff, L. (2012). Algorithm Selection for Combinatorial Search Problems : A survey.
Kuo, J.-W., Cheng, P.-J., and Wang, H.-M. (2009). Learning to rank from bayesian
decision inference. In Proceedings of the 18th ACM Conference on Information and
Knowledge Management, CIKM ’09, pages 827–836, New York, NY, USA. ACM.
Liberto, G. D., Kadioglu, S., Leo, K., and Malitsky, Y. (2013). DASH: Dynamic Approach
for Switching Heuristics.
54
BIBLIOGRAPHY
Linderoth, J. T. and Savelsbergh, M. W. P. (1999). A Computational Study of
Search Strategies for Mixed Integer Programming. INFORMS Journal on Computing,
11(2):173–187.
Liu, T.-Y. (2009). Learning to rank for information retrieval. Foundations and Trends in
Information Retrieval, 3(3):225–331.
Lodi, A. (2010). Mixed Integer Programming Computation. In Jünger, M., Liebling,
T. M., Naddef, D., Nemhauser, G. L., Pulleyblank, W. R., Reinelt, G., Rinaldi, G.,
and Wolsey, L. A., editors, 50 Years of Integer Programming 1958-2008, chapter 16,
pages 619–645. Springer Berlin Heidelberg, Berlin, Heidelberg.
Lodi, A., Ralphs, T. K., Rossi, F., and Stefano Smriglio (2011). Interdiction Branching.
Loughrey, J. and Cunningham, P. (2005). Using early-stopping to avoid overfitting in
wrapper-based feature selection employing stochastic search. Technical report, Trinity
College Dublin, Department of Computer Science.
Meindl, B. and Templ, M. (2012). Analysis of commercial and free and open source
solvers for linear optimization problems. Technical report, Technische Universität Wien
(Vienna University of Technology).
Moskewicz, M. W., Madigan, C. F., Zhao, Y., Zhang, L., and Malik, S. (2001). Engineering
an Efficient SAT Solver. In DAC ’01 Proceedings of the 38th annual Design Automation
Conference, pages 530–535.
Nilsson, N. J. (2014). Principles of artificial intelligence. Morgan Kaufmann.
Puchinger, J., Raidl, G. R., and Pferschy, U. (2010). The multidimensional knapsack
problem: Structure and algorithms. INFORMS Journal on Computing, 22(2):250–265.
Resende, M. G. and Werneck, R. F. (2004). A hybrid heuristic for the p-median problem.
Journal of heuristics, 10(1):59–88.
Savelsbergh, M. W. (1994). Preprocessing and probing techniques for mixed integer programming problems. ORSA Journal on Computing, 6(4):445–454.
Smith-Miles, K. (2008). Towards Insightful Algorithm Selection For Optimisation Using
Meta-Learning Concepts. In WCCI 2008: IEEE World Congress on Computational
Intelligence, pages 4118–4124. IEEE.
Spielman, D. A. and Teng, S.-H. (2004). Smoothed analysis of algorithms: Why the
simplex algorithm usually takes polynomial time. J. ACM, 51(3):385–463.
Stasinopoulos, D. M. and Rigby, R. A. (2007). Generalized Additive Models for Location
Scale and Shape ( GAMLSS ) in R. Journal of Statistical Software, 23(7).
Suttorp, T., Hansen, N., and Igel, C. (2009). Efficient covariance matrix update for
variable metric evolution strategies. Machine Learning, 75(2):167–197.
55
BIBLIOGRAPHY
Van Roy, T. J. and Wolsey, L. A. (1987). Solving mixed integer programming problems
using automatic reformulation. Operations Research, 35(1):45–57.
Wolsey, L. A. (2008). Mixed integer programming. Wiley Encyclopedia of Computer
Science and Engineering.
Xu, L., Hutter, F., Hoos, H. H., and Leyton-brown, K. (2011). Hydra-MIP : Automated
Algorithm Configuration and Selection for Mixed Integer Programming. In Proceedings
of the 18th RCRA Workshop on Experimental Evaluation of Algorithms for Solving
Problems with Combinatorial Explosion.
Yu, X. and Gen, M. (2010). Introduction to evolutionary algorithms. Springer Science &
Business Media.
56
A. Full List of Features
A.1. List of Features with Names and Explanations
Also used in this study were the maxima and minima of the upwards/downwards (+/−)
measures, see A.1.
che± Up-/downwards (+/−) estimate for the objective of the best feasible solution
of the resulting subtree after branching on the variable in question (Achterberg,
2007):
X
+
− −
± ±
min{Ψ+
(A.1)
eQ±
=
z
(
Q̌)
+
LP
i fi , Ψi fi } ± Ψj fj
j
i∈F
i6=j
conls Conflict length score, see section 3.4.
cons Conflict score, see section 3.4.
cuts Cutoff score, see section 3.4.
eff w Weighted efficiency measure as suggested by Freville and Plateau (1994):
cj
i=1 ri Aij
(A.2)
Pn
j=1 Aij − bi
ri = Pn
j=1 Aij
(A.3)
(w)
j
where
= Pm
is a relevance value with respect to the i-th constraint.
eff w(Q̌) Same as eff w except that the relevance value is calculated only over variables
that are not fixed to zero.
eff Relative efficiency measure:
j = cj
m
1 X
m i=1
1
n
A
Pn ij
j=1
−1
Aij
(A.4)
57
A. Full List of Features
fra Fractionality or θ(xj ) = min {fj− , fj+ }.
inf s Inference score, see section 3.4.
ncca Number of active constraints the variable is involved in:
|Aj (Q̌)| = |{i : Aij > 0 ∧ Ai x̂T − bi = 0}|
(A.5)
where Ai is the i-th row of the constraint coefficient matrix A and x̂ is the row
vector of LP relaxed solutions at the current node Q̌.
lpv± Up-/downwards (+/−) cost function value for variable j or cj fj± .
obc Objective coefficient value cj .
obc0 Best objective value w.r.t. root reduced cost propagation of the root LP (Achterberg, 2007).
psd± Up-/downwards (+/−) pseudocost dive value (Achterberg, 2007):
ψj± =
q 1 + Ψ∓
j
fj∓
1 + Ψ±
j
(A.6)
psh± Up-/downwards (+/−) average pseudocost Ψ±
j , see section 3.2.
±
psc± Up-/downwards (+/−) pseudocost objective change estimate Ψ±
j fj , see section
3.2.
pscs Pseudocost score, see section 3.2.
rccw
av Average relative weighted constraint coefficient:
0 (w)
Āj
m
1 X
ri Aij
m i=1
=
(A.7)
where ri is as in eq. A.3.
w(Q̌)
rccav
Same as rccw
av except that the relevance value is calculated only over variables
that are not fixed to zero.
rccav Average relative constraint coefficient:
m
1 X
Āj =
m i=1
0
(Q̌)
1
n
A
Pn ij
j=1
Aij
rccav Same as rccav except that the summation is over only non-fixed variables.
58
(A.8)
A.1. List of Features with Names and Explanations
rccmax Maximum relative constraint coefficient:
0 (max)
Aj
= max
1
n
i
A
Pn ij
j=1
Aij
(A.9)
(Q̌)
rccmax Same as rccmax except that the summation is over only non-fixed variables.
rccmin Minimum relative constraint coefficient:
0 (min)
Aj
= min
i
1
n
A
Pn ij
j=1
Aij
(A.10)
(Q̌)
rccmin Same as rccmin except that the summation is over only non-fixed variables.
red Reduced cost of a variable.
red0 Variable’s best reduced cost w.r.t. root reduced cost propagation at the root
node.
rel± Up-/downwards (+/−) reliability pseudocost, see subsection 3.5.1.
rels Reliability pseudocost score, see subsection 3.5.1.
w(Q̌)
ncca
Weighted count of active constraints the variable is involved in:
−1
X X
(w)
|Aj (Q̌)| =
Aij
(a)
i∈C(Q̌)j
(A.11)
j∈I(Q̌)(a)
(a)
where C(Q̌)j = {i : Aij > 0 ∧ Ai x̂T − bi = 0} is the set of active constraints, at
the current node Q̌, of which the j-th variable is involved in. I(Q̌)(a) is the set
of active (non-fixed) integer/binary variables at the current node Q̌.
sol The LP solution value x̂j of the variable j.
lpr0 Variable’s best LP objective value w.r.t. root reduced cost propagation at the
root node.
solav The average LP solution value x̂j over all nodes Q ∈ Q of the variable j.
59
A. Full List of Features
A.2. Full Feature Listing Along With Relative
Performances
Table A.1 shows full listing of features used for the learning methods of this study. Shown
is each feature’s three performance measures according to eq. 4.19, relative to the HIB
rule. The con, conl, cut, inf features are not available at the root node. The last row
shows the relative performance when the candidates are chosen at random.
60
A.2. Full Feature Listing Along With Relative Performances
Table A.1: Full list of features (excluding con, conl, cut, inf) along with performance
relative to the HIB rule
Name
(Q̌)
rccmax
rccmax
relmax
pscmax
pscs
psdmax
rels
psh−
psc−
red0
rel−
che−
psd+
psh+
psc+
rel+
che+
relmin
pscmin
psd−
solav
eff w(Q̌)
lpv−
eff w
psdmin
eff
lpvmax
w(Q̌)
ncca
obc
(Q̌)
rccav
rccav
ncc
lpv+
ncca
fra
lpvmin
sol
rccw
av
w(Q̌)
rccav
red
RND
(Q̌)
rccmin
rccmin
lpr0
obc0
t/tHIB
η/ηHIB
ς/ςHIB
0.992
0.993
0.998
0.998
0.999
0.999
1
1.001
1.001
(−)1.001
1.001
1.002
1.004
1.007
1.008
1.008
1.010
1.011
1.012
1.013
1.013
1.015
1.015
1.015
(−)1.017
1.018
1.019
(−)1.020
1.021
1.022
1.023
(−)1.023
1.023
(−)1.024
(−)1.024
1.025
1.025
1.026
1.026
(−)1.029
1.033
(−)1.042
(−)1.042
1.048
1.076
0.985
0.986
1.002
1.002
0.999
1.004
1
1.003
1.004
(−)1.004
1.004
1.005
1.008
1.010
1.011
1.011
1.013
1.011
1.011
1.018
1.018
1.015
1.016
1.015
(−)1.020
1.019
1.024
(−)1.013
1.024
1.027
1.028
(−)1.010
1.027
(−)1.010
(−)1.030
1.027
1.031
1.031
1.030
(−)1.025
1.030
(−)1.018
(−)1.018
1.032
1.035
0.988
0.988
0.995
0.995
1
0.996
1
0.997
0.995
(−)0.995
0.995
0.996
0.999
1.009
1.010
1.010
1.012
1.016
1.016
(−)1.014
1.015
1.013
1.014
1.014
(−)1.015
1.017
1.022
(−)1.016
1.024
1.030
1.030
(−)1.014
1.030
(−)1.013
(−)1.026
(−)1.033
1.025
1.033
1.033
(−)1.025
1.031
(−)1.019
(−)1.019
1.031
1.036
61
B. Model results
In tables B.1 through B.3 below, all of the fitted parameters of equation 6.10 from the t
location scale model of eq. 6.6 are listed. Fits for interaction terms included in the model
are also listed.
63
B. Model results
Table B.1: Fitted coefficients β̂l of model 6.10
l
0
|P|5k
|P|10k
EB2
EB3
EB4
υς
τ
n
ρ
m
n/m
k∗
ς˜∗
|P|5k × υς
|P|10k × υς
EB2 × υς
EB3 × υς
EB4 × υς
yυ
η
ς
t
η
ς
t
η
ς
t
η
ς
t
η
ς
t
η
ς
t
η
ς
t
η
ς
t
η
ς
t
η
ς
t
η
ς
t
η
ς
t
η
ς
t
η
ς
t
η
ς
t
η
ς
t
η
ς
t
η
ς
t
η
ς
t
Set 1(hard)
Set 2(hard)
2.71e-03
-6.06e-03
-1.51e-01***
1.44e-02*
2.47e-02***
1.40e-02*
1.55e-02
5.67e-03
9.54e-03
-4.20e-02***
-3.29e-02***
-3.66e-02***
-4.48e-02**
-2.32e-02.
-5.27e-02**
-3.50e-02***
-1.12e-02
-7.21e-03
-4.49e-02***
-4.54e-03
-2.69e-02**
1.08e-02***
1.45e-02***
1.59e-02***
5.29e-02***
2.21e-01***
1.06e-01***
4.08e-02***
3.89e-02***
3.99e-02***
-5.75e-02***
-1.82e-01***
-4.37e-02***
-1.18e-01***
-2.96e-01***
-1.30e-01***
1.04e-02*
8.75e-03.
1.72e-02**
-1.89e-02***
-1.68e-02***
-9.20e-03*
2.26e-02**
9.60e-04
-2.22e-03
-3.05e-02*
1.04e-02
3.82e-02*
-1.09e-01***
4.73e-02***
3.93e-02**
-7.30e-02***
3.94e-02***
5.87e-02***
-1.21e-01***
2.93e-04
7.46e-03
-7.93e-02***
-5.33e-02***
1.98e-02*
6.43e-02***
1.09e-01***
5.75e-02***
4.06e-02***
-9.73e-03
7.83e-02***
-7.25e-02***
-2.43e-02**
-2.35e-01***
-7.79e-02***
-4.45e-02**
-8.48e-02***
-1.80e-03
9.62e-02***
-2.81e-02**
-2.43e-02**
1.08e-01***
-3.92e-02***
7.60e-02***
4.40e-02***
1.25e-02***
5.77e-03
-7.80e-03
9.64e-02***
-2.94e-02***
-3.98e-02***
-4.61e-02***
-6.37e-02***
-1.87e-02
-2.65e-01***
-3.31e-02.
-2.58e-03
-2.75e-01***
2.72e-02***
4.49e-02***
3.00e-02***
-2.00e-02***
-2.27e-02***
-1.60e-02***
1.54e-02.
-7.24e-02***
-1.26e-02
-5.76e-02***
7.27e-02***
-9.11e-02***
-2.67e-01***
2.86e-01***
8.04e-02***
-1.55e-01***
2.71e-01***
3.29e-02*
-2.52e-01***
2.01e-01***
9.49e-02***
Set 3(hard)
β̂l
-7.36e-02***
-6.24e-02***
-1.48e-01***
4.87e-02***
2.31e-02***
2.45e-02***
5.67e-02***
-4.33e-02***
-6.99e-03
-1.71e-01***
-3.79e-02***
-6.71e-01***
-1.29e-01***
-3.95e-02*
-7.13e-01***
-4.04e-02***
1.11e-01***
4.51e-02***
-2.16e-02*
8.34e-02***
3.03e-02**
-5.07e-02***
1.50e-01***
4.14e-02***
6.44e-01***
1.93e-01***
1.59e-01***
-2.69e-02***
-8.93e-02***
1.29e-01***
-6.29e-01***
1.46e-02
-1.61e-01***
-8.26e-01***
-7.48e-02***
-1.83e-01***
1.83e-02**
2.49e-02***
2.07e-02***
-7.97e-03.
-7.44e-03.
-1.13e-02**
1.45e-02
-1.09e-02
4.60e-04
-5.16e-02**
7.14e-02***
1.92e-02
-9.34e-02***
1.25e-01***
-1.09e-02
-1.12e-01***
1.08e-01***
-8.37e-03
-1.67e-01***
1.61e-02
-7.89e-02***
Set 4(hard)
Set 5(easy)
-2.41e-01***
-2.24e-01***
-2.47e-01***
6.23e-02***
8.60e-02***
6.45e-02***
6.12e-02***
1.24e-02
2.08e-02.
-5.31e-02***
-7.16e-03
-2.48e-02*
-1.23e-01***
-8.95e-02***
-1.30e-01***
2.93e-02***
1.10e-01***
8.52e-02***
-2.14e-02**
6.54e-02***
3.53e-02***
1.13e-01***
9.50e-02***
8.46e-02***
-1.83e-01***
-1.39e-01***
-1.75e-01***
NA
NA
NA
3.01e-02***
6.09e-02***
4.97e-02***
1.34e-01***
1.41e-01***
1.76e-01***
5.04e-02***
5.95e-02***
5.80e-02***
-6.88e-03*
-1.70e-02***
-1.26e-02**
1.64e-02*
-4.19e-02***
-4.17e-03
-7.73e-02***
2.55e-02*
-1.08e-02
-2.75e-01***
1.67e-01***
6.18e-03
-1.58e-01***
2.01e-01***
9.24e-02***
-2.24e-01***
1.22e-01***
1.87e-02
-6.01e-02***
-9.38e-02***
-2.46e-01***
3.24e-02***
5.22e-02***
4.30e-02***
1.24e-02
-5.83e-03
9.40e-03
-5.00e-02***
-3.71e-02***
-4.38e-02***
-2.42e-02
-4.83e-02**
-1.04e-01***
1.93e-02.
7.23e-02***
5.44e-02***
-3.89e-02***
1.48e-02
4.44e-03
3.05e-01***
2.94e-01***
1.96e-01***
-3.73e-01***
-2.63e-01***
-5.59e-01***
-7.66e-02***
-6.34e-02***
-8.33e-02***
9.33e-02***
-1.56e-01***
1.89e-01***
4.49e-01***
9.28e-02***
4.20e-01***
2.02e-02***
2.71e-02***
2.61e-02***
-1.23e-02**
-9.26e-03*
-1.31e-02**
1.52e-02
-2.82e-02**
-7.74e-03
-2.94e-03
5.64e-02***
2.95e-02.
-1.98e-02
2.72e-01***
1.64e-01***
-2.58e-02.
2.18e-01***
1.03e-01***
-8.58e-02***
1.59e-01***
8.07e-02***
P r(< |t|): ∗ ∗ ∗ < 0.001, ∗ ∗ < 0.01, ∗ < 0.05, . < 0.1
64
Table B.2: Fitted coefficients α̂l of model 6.10
l
0
τ
n
m
ρ
n/m
τ ×n
τ ×m
τ ×ρ
n×m
n×ρ
m×ρ
yυ
η
ς
t
η
ς
t
η
ς
t
η
ς
t
η
ς
t
η
ς
t
η
ς
t
η
ς
t
η
ς
t
η
ς
t
η
ς
t
η
ς
t
Set 1(hard)
Set 2(hard)
-5.59e-01***
-5.99e-01***
-6.77e-01***
3.12e-02***
-4.71e-02***
-2.89e-02***
1.84e-01***
2.30e-01***
1.87e-01***
-3.02e-01***
-2.24e-01***
-2.31e-01***
1.76e-03
-7.72e-02***
-2.91e-02***
-2.94e-01***
-2.37e-01***
-2.16e-01***
2.26e-02***
5.69e-02***
2.30e-02***
3.90e-02***
4.50e-02***
2.50e-02***
8.03e-03**
1.63e-02***
1.22e-02***
-2.19e-02***
-1.48e-02***
-1.22e-02*
-1.08e-02***
-1.09e-02***
-2.18e-02***
1.25e-02***
1.37e-02***
-2.48e-03
-5.56e-01***
-5.83e-01***
-5.40e-01***
-6.90e-03*
-7.61e-03**
-9.88e-03***
3.57e-02**
1.33e-02
-2.48e-02*
-1.49e-01***
-1.23e-01***
-4.30e-02
-3.70e-03
-9.31e-03***
-1.14e-02***
-1.30e-01***
-9.96e-02***
-1.98e-02
3.05e-02***
4.23e-02***
3.35e-02***
1.13e-02***
1.49e-02***
7.21e-03*
1.63e-02***
1.19e-02***
4.02e-03
-2.14e-02***
-3.23e-02***
-1.03e-02**
2.12e-02***
3.11e-02***
1.30e-02***
-1.40e-04
2.67e-03
4.00e-03
Set 3(hard)
α̂l
-4.40e-01***
-4.76e-01***
-4.75e-01***
5.38e-02***
6.56e-02***
3.39e-02***
-2.33e-01***
-2.34e-01***
-2.31e-01***
2.58e-01***
3.25e-01***
2.80e-01***
2.62e-02***
1.47e-02***
3.41e-04
3.36e-01***
3.65e-01***
3.59e-01***
3.62e-03
-9.79e-03***
9.50e-03***
-2.45e-02***
-2.06e-02***
-2.39e-03
-3.46e-02***
-9.66e-03***
-1.57e-03
2.74e-02***
4.09e-02***
4.56e-02***
-3.72e-03
-5.55e-03*
-2.01e-03
4.40e-02***
5.26e-02***
3.08e-02***
Set 4(hard)
Set 5(easy)
-5.71e-01***
-5.45e-01***
-2.11e-01***
6.99e-03.
3.25e-02***
2.20e-03
3.47e-01***
3.51e-01***
3.41e-01***
1.08e-01***
9.95e-02***
5.05e-01***
NA
NA
NA
4.94e-01***
4.05e-01***
9.46e-01***
-2.74e-02***
-3.07e-02***
-7.13e-03
1.11e-02*
1.31e-02**
4.71e-02***
NA
NA
NA
4.85e-01***
4.54e-01***
9.54e-01***
NA
NA
NA
NA
NA
NA
-4.41e-01***
-4.21e-01***
-3.56e-01***
-7.67e-03**
-4.29e-02***
-3.12e-02***
-1.99e-01***
-5.35e-01***
-2.46e-01***
1.57e-01***
6.09e-01***
2.67e-01***
-5.60e-02***
-1.43e-01***
-7.75e-02***
2.58e-01***
8.54e-01***
4.39e-01***
-5.62e-03.
-6.15e-02***
-3.76e-02***
-2.91e-02***
-3.62e-02***
-9.94e-03***
3.29e-02***
4.47e-02***
2.13e-02***
9.17e-03*
9.59e-02***
4.04e-02***
-3.30e-03
-2.67e-02***
-5.67e-02***
-3.03e-02***
-9.21e-02***
-4.70e-02***
P r(< |t|): ∗ ∗ ∗ < 0.001, ∗ ∗ < 0.01, ∗ < 0.05, . < 0.1
65
B. Model results
Table B.3: Fitted coefficients γ̂l of model 6.10
l
0
n
m
τ
ρ
yυ
η
ς
t
η
ς
t
η
ς
t
η
ς
t
η
ς
t
Set 1(hard)
Set 2(hard)
3.26e+00***
3.65e+00***
1.25e+00***
-2.01e-02
1.25e-01.
1.50e-01***
-1.19e-01*
-1.01e-01
-6.57e-02***
3.76e-03
-1.32e-01.
-4.74e-02***
7.37e-02
-2.11e-01**
-4.61e-02***
3.34e+00***
3.69e+00***
2.74e+00***
3.06e-02
-1.23e-02
-2.02e-01***
-1.25e-01*
-2.21e-01**
-5.21e-02
3.73e-01***
4.28e-01***
3.97e-02
4.26e-01***
3.86e-01***
3.56e-02
Set 3(hard)
γ̂l
3.48e+00***
3.86e+00***
4.45e+00***
-3.27e-01***
-2.68e-01***
-3.57e-01**
-4.21e-01***
-5.15e-01***
-5.49e-01**
1.13e-01*
2.77e-01***
3.46e-01**
1.34e-01*
-7.14e-02
1.26e-01
Set 4(hard)
Set 5(easy)
2.91e+00***
3.08e+00***
2.80e+00***
2.66e-01**
-7.11e-03
9.97e-01***
-4.03e-02
-8.84e-03
7.12e-01***
-8.06e-02
6.35e-02
-3.66e-03
NA
NA
NA
3.54e+00***
5.73e+00***
4.47e+00***
-3.33e-01***
-2.60e-01*
3.71e-02
-7.70e-02
3.34e-01***
4.12e-02
2.55e-01***
-3.23e-01***
8.70e-01***
5.22e-01***
1.99e+00***
2.12e-01.
P r(< |t|): ∗ ∗ ∗ < 0.001, ∗ ∗ < 0.01, ∗ < 0.05, . < 0.1
66
C. Full Test Results
C.1. Supervised Method
Table C.1: Test results for test set 1 for all tested rules for all tested maximum depths dˆ
dˆ = 0
Rule
dˆ = 2
dˆ = 4
dˆ = ∞
η̄1000
ς¯1000
t̄
η̄1000
ς¯1000
t̄
η̄1000
ς¯1000
t̄
η̄1000
ς¯1000
t̄
RB
0.479
1.97
0.311
0.479
1.97
0.311
0.479
1.97
0.311
0.479
1.97
0.311
HIB
0.479
1.98
0.311
0.479
1.98
0.311
0.479
1.98
0.311
0.479
1.98
0.311
RCC 0.460
2.01
0.308 0.431
2.00
0.301 0.421
2.25
0.304
0.648
2.83
0.321
RDB 0.475
2.07
0.314
0.504
2.29
0.325
2.74
0.342
1.044
4.31
0.416
t1
0.460 1.95
0.311
0.451 1.91 0.310 0.462 1.94
0.313
0.524
1.99
0.348
t2
0.464 1.96
0.311
0.449 1.89 0.308 0.446 1.89 0.307
0.482
1.89 0.327
t3
0.459 1.94 0.309 0.456 1.95 0.310 0.453 1.91 0.310
0.480
1.94 0.332
t4
0.462 1.97 0.310 0.449 1.90 0.308 0.442 1.86 0.306 0.451 1.86 0.318
t5
0.456
η1
0.459 1.95 0.310 0.455 1.94
η2
0.461 1.96
0.311
η3
0.457
0.308 0.424
η4
0.459 1.95
0.311
η5
0.458
2.00
0.307 0.423 1.95 0.299 0.400
2.07
0.298
0.609
2.56
0.329
ς1
0.455
1.99
0.307 0.422 1.94 0.299 0.405
2.07
0.300
0.654
2.66
0.356
ς2
0.451
1.98
0.306 0.425 1.94 0.301 0.407
2.09
0.301
0.656
2.68
0.356
ς3
0.461 1.96 0.310 0.455 1.94 0.310 0.452 1.91
0.311
0.471 1.90 0.322
ς4
0.462 1.96
0.311
0.445 1.90 0.309 0.445 1.88 0.309
0.482
1.93 0.339
ς5
0.466
0.312
0.449 1.89 0.309 0.447 1.90 0.309
0.488
1.87 0.330
2.00
2.01
1.98
0.307 0.428
1.97
0.528
0.300 0.416
0.311
2.20
0.303
0.617
2.75
0.326
0.465 1.97
0.317
0.571
2.23
0.379
0.450 1.92 0.311 0.445 1.91 0.310 0.455 1.90 0.337
1.98
0.301 0.406
2.16
0.301
0.618
2.66
0.341
0.448 1.91 0.310 0.446 1.90 0.310 0.475 1.91 0.336
|Pte | = 1200
67
C. Full Test Results
Table C.2: Test results for test set 2 for all tested rules for all tested maximum depths dˆ
dˆ = 0
Rule
η̄1000 ς¯1000
dˆ = 2
t̄
η̄1000 ς¯1000
dˆ = 4
t̄
η̄1000 ς¯1000
t̄
η̄1000 ς¯1000
t̄
RB
1.17
4.53
0.531
1.17
4.53
0.531
1.17
4.53
0.531
1.17
4.53
0.531
HIB
1.17
4.54
0.532
1.17
4.54
0.532
1.17
4.54
0.532
1.17
4.54
0.532
RCC
1.15
4.55
0.530 1.11
4.65
0.527 1.10
5.00
0.532
1.48
5.79
0.575
RDB
1.20
4.76
0.544
1.24
5.17
0.564
1.31
5.93
0.596
2.34
8.78
0.766
t1
1.17
4.58
0.537
1.17 4.50
0.534
1.18
4.49
0.537
1.43
4.91
0.649
t2
1.17
4.58
0.536
1.16 4.50
0.533
1.18
4.50
0.536
1.34
4.62
0.596
t3
1.18
4.59
0.538
1.19
4.65
0.542
1.25
4.84
0.559
1.68
5.65
0.706
t4
1.17
4.59
0.537
1.16 4.49
0.534
1.18
4.53
0.538
1.28
4.70
0.588
t5
1.18
4.69
0.537
1.19
4.92
0.546
1.24
5.60
0.574
1.96
7.34
0.734
η1
1.18
4.60
0.538
1.19
4.58
0.540
1.24
4.73
0.557
1.63
5.49
0.734
η2
1.18
4.59
0.538
1.18
4.54
0.537
1.20
4.61
0.546
1.32
4.78
0.636
η3
1.15
4.55
0.529 1.10
4.60
0.524 1.10
4.97
0.535
1.49
5.68
0.637
η4
1.18
4.60
0.538
1.19
4.57
0.541
1.22
4.69
0.551
1.47
5.00
0.662
η5
1.13 4.51 0.526 1.13
4.64
0.530 1.15
5.05
0.547
1.80
6.45
0.710
ς1
1.15
0.531 1.14
4.71
0.534
1.17
5.11
0.552
2.04
6.62
0.805
ς2
1.14 4.52 0.527 1.13
4.67
0.531 1.17
5.09
0.552
1.95
6.46
0.776
ς3
1.16 4.53
1.15 4.46 0.529 1.16 4.50
0.535
1.26
4.49 0.573
ς4
1.15 4.50 0.531 1.12 4.35 0.522 1.12 4.36 0.526
1.21
4.31 0.594
ς5
1.18
2.06
6.36
4.57
4.61
0.533
0.539
1.21
4.71
0.548
1.28
|Pte | = 1200
68
dˆ = ∞
4.94
0.568
0.811
C.1. Supervised Method
Table C.3: Test results for test set 3 for all tested rules for all tested maximum depths dˆ
dˆ = 0
Rule
dˆ = 2
dˆ = 4
dˆ = ∞
η̄1000
ς¯1000
t̄
η̄1000
ς¯1000
t̄
η̄1000
ς¯1000
t̄
RB
0.837
2.72
0.381
0.837
2.72
0.381
0.837
2.72
0.381
0.837 2.72 0.381
HIB
0.830
2.68
0.376
0.830
2.68
0.376
0.830
2.68
0.376
0.830 2.68 0.376
RCC 0.817
2.72
0.378
0.798 2.67 0.373 0.785 2.71 0.372 1.106 2.82 0.431
RDB 0.809
2.72
0.380
0.859
2.90
0.398
0.907
3.15
0.416
1.722 4.99 0.635
t1
0.814
2.70
0.383
0.818 2.68
0.382
0.824 2.71
0.385
1.156 3.30 0.547
t2
0.821
2.73
0.385
0.811
2.69
0.381
0.814 2.70
0.382
1.129 3.31 0.520
t3
0.809
2.69
0.381
0.815
2.71
0.384
0.820 2.72
0.385
0.951 2.84 0.463
t4
0.796 2.65 0.375 0.812 2.68
0.381
0.829 2.74
0.387
1.028 3.08 0.472
t5
0.791 2.65 0.373 0.803
2.70
0.375 0.813 2.87
0.387
1.062 3.16 0.466
η1
0.803 2.66
2.74
0.384
0.841
2.78
0.393
1.199 3.29 0.567
η2
0.803 2.66 0.376 0.807 2.68
0.378
0.832
2.78
0.393
1.035 3.03 0.505
η3
0.800 2.67
0.801
2.71
0.378
0.818 2.82
0.386
1.217 3.12 0.524
η4
0.796 2.64 0.376 0.828
2.77
0.389
0.846
2.80
0.396
1.271 3.51 0.584
η5
0.810
2.71
0.379
0.811
2.73
0.378
0.809 2.83
0.384
1.131 3.13 0.503
ς1
0.805 2.68
0.378
0.817
2.75
0.384
0.819 2.81
0.387
1.193 3.20 0.548
ς2
0.809
2.70
0.380
0.800
2.69
0.376 0.802 2.74
0.382
1.155 3.14 0.535
ς3
0.821
2.72
0.384
0.822
2.70
0.383
0.829 2.74
0.387
1.047 3.00 0.470
ς4
0.822
2.72
0.385
0.823
2.72
0.386
0.847
2.77
0.393
1.151 3.16 0.539
ς5
0.820
2.72
0.384
0.825
2.72
0.385
0.828 2.72
0.388
1.144 3.10 0.525
0.378
0.376
0.826
η̄1000 ς¯1000
t̄
|Pte | = 1200
69
C. Full Test Results
Table C.4: Test results for test set 4 for all tested rules for all tested maximum depths dˆ
dˆ = 0
Rule
η̄1000 ς¯1000
dˆ = 2
t̄
η̄1000 ς¯1000
dˆ = 4
t̄
η̄1000 ς¯1000
t̄
η̄1000
ς¯1000
t̄
RB
15.0
64.2
5.60
15.0
64.2
5.60
15.0
64.2
5.60
15.0
64.2
5.60
HIB
14.9
63.3
5.53
14.9
63.3
5.53
14.9
63.3
5.53
14.9
63.3
5.53
RCC
14.9
64.1
5.59
15.0
65.0
5.65
14.9
64.3
5.58
23.0
78.6
8.26
RDB
15.2
65.2
5.68
15.9
68.7
5.93
16.5
71.7
6.17
37.7
129.4 11.50
t1
15.1
64.8
5.65
14.8 63.1
5.54
15.2
64.7
5.68
24.8
83.2
11.00
t2
14.9
63.8
5.59
14.8 62.8 5.52 14.8 62.4 5.51
20.5
71.3
8.35
t3
15.0
63.8
5.59
15.0
63.8
5.59
15.3
5.71
21.0
76.0
9.23
t4
15.0
64.2
5.61
14.8 63.1
5.54
14.8 62.9 5.52
18.2
70.3
7.73
t5
15.0
64.7
5.64
15.0
64.9
5.65
15.2
65.6
5.69
22.1
77.0
9.05
η1
14.9
63.9
5.59
14.9
63.9
5.60
15.3
65.6
5.74
26.8
96.2
12.94
η2
14.9
63.6
5.56
14.6 62.6 5.48 14.8 63.2
5.55
18.1
69.8
8.86
η3
14.9
63.8
5.57
14.4 61.7 5.40 14.6 63.1 5.49
20.6
69.3
9.32
η4
14.9
63.7
5.57
14.8 63.2
5.57
20.0
74.0
9.26
η5
14.8 63.3 5.53 14.5 62.3 5.44 14.5 62.5 5.45
19.8
66.5
8.95
ς1
14.8
14.5 62.1 5.43 14.4 62.2 5.43
20.6
67.9
10.36
ς2
14.7 63.3 5.52 14.5 62.4 5.44 14.5 62.7 5.47
20.2
67.8
10.16
ς3
14.8 63.2 5.53 14.9 63.1
5.60
19.3
70.5
7.68
ς4
14.8
63.3
5.54
14.7 62.4 5.48 14.8 62.8 5.52
19.7
70.2
9.44
ς5
14.9
63.3
5.55
14.7 62.7 5.50 14.8 63.1
20.5
69.8
8.76
63.6
5.56
5.54
5.54
14.8
15.0
|Pte | = 600
70
dˆ = ∞
65.2
63.6
63.9
5.55
C.2. Non-Supervised Method
C.2. Non-Supervised Method
Tables C.5 through C.7 show the test results for all of the rules Rte used for testing.
Values shown are the performance values corresponding to which performance measure
the rules were ranked among the highest. For comparison, the performances of the hybrid
inference/reliability branching rule (HIB), reliability branching rule (RB) and for test sets
1 through 4, random (RDB) branching rule, are presented. Shown for each rule considered
the rule type, number of training problems |P| iterated over and the type fυ of fitness
measure optimized, denoted by η for nodes and ς for iterations. The order of the rules is
in accordance to increasing Oy of eq. 6.1 so that according to the validation results, the
better performing rules are above the less well performing rules.
Shown in bold are results for rules that outperform both the HIB and RB rules. Overscored numbers state difference in median from the better performing SCIP rule at the
95% significance level using the Wilcoxon signed-rank test. The resulting p-values from
significance tests are not adjusted for multiple comparisons. All values for nodes and
iterations are shown in thousands while the runtime is in seconds and the average gap in
percentages.
71
C. Full Test Results
C.2.1. Average Number of Nodes
Table C.5 shows the average number of nodes η̄ generated over all test sets, in thousands,
for twelve of the highest ranking rules according to Oy=η of eq. 6.1.
Table C.5: Average number of generated nodes η̄ in thousands (time in seconds)/[proportional gap in percentages]
Set 5 (tmax = 500 sec)
Rule/
Type
|P|
RB
HIB
RDB
EB4
EB3
EB1
EB1
EB1
EB1
EB1
EB1
EB4
EB1
EB4
EB4
5k
5k
5k
1k
5k
5k
1k
5k
5k
1k
5k
10k
Rule/
Type
|P|
RB
HIB
RDB
EB4
EB3
EB1
EB1
EB1
EB1
EB1
EB1
EB4
EB1
EB4
EB4
5k
5k
5k
1k
5k
5k
1k
5k
5k
1k
5k
10k
dim ω fυ
5
16
5
5
5
5
5
5
5
5
5
5
η
η
η
η
η
ς
η
η
η
η
η
η
5
5
4
5
2
6
6
1
7
3
8
4
dim ω fυ
r
5
16
5
5
5
5
5
5
5
5
5
5
η
η
η
η
η
ς
η
η
η
η
η
η
Number of problems:
72
r
5
5
4
5
2
6
6
1
7
3
8
4
Set 1
easy
hard
0.054
0.054
0.196
0.051
0.051
0.054
0.055
0.055
0.054
0.055
0.054
0.052
0.054
0.053
0.051
0.506
0.512
1.076
0.490
0.478
0.502
0.500
0.497
0.499
0.502
0.505
0.489
0.506
0.485
0.490
Set 2
easy
hard
0.127
0.127
0.343
0.123
0.122
0.125
0.125
0.125
0.125
0.126
0.125
0.125
0.124
0.124
0.124
1.01
1.01
1.96
0.99
0.99
1.01
1.01
1.01
1.00
1.01
1.01
1.00
1.00
0.99
1.00
500
500
100×
{5, 10, 15}
100×
{20, 25, 30}
250 × {5, 10}
500 × 5
easy
Set 4
hard
1.96
1.99
4.35
1.97
1.96
2.04
2.01
2.10
2.07
2.03
2.02
1.97
2.08
1.94
1.98
14.4 (5.74)
14.4 (5.75)
33.2 (10.85)
14.1 (5.54)
14.2 (5.84)
14.6 (5.89)
14.3 (5.69)
14.7 (5.81)
14.4 (5.70)
14.5 (5.83)
14.3 (5.63)
14.1 (5.46)
14.6 (5.80)
14.2 (5.54)
14.2 (5.51)
33.8 (7.56)
32.2 (7.34)
NA
31.0 (6.94)
32.2 (7.26)
32.6 (7.25)
31.9 (7.25)
32.7 (7.25)
32.7 (7.34)
32.5 (7.36)
32.6 (7.17)
31.7 (6.82)
31.5 (7.14)
33.3 (7.36)
32.4 (7.41)
172 (49.3)
172 (48.1)
NA
178 (50.0)
176 (47.7)
176 (50.3)
173 (48.9)
172 (48.7)
175 (49.7)
175 (49.7)
175 (49.4)
181 (50.2)
174 (49.5)
180 (49.5)
178 (47.3)
285 (77.9)
295 (80.6)
NA
304 (77.8)
304 (76.2)
290 (79.0)
306 (78.3)
289 (78.0)
291 (77.5)
302 (79.5)
299 (79.9)
300 (76.1)
290 (77.5)
308 (77.9)
311 (76.4)
easy
Set 3
hard
250×
{15, 20, 25}
250 × 30
500 × {10, 15}
500×
{20, 25, 30}
921 [0.182]
928 [0.182]
NA
978 [0.183]
1126 [0.192]
996 [0.205]
867 [0.222]
956 [0.191]
966 [0.211]
842 [0.225]
957 [0.184]
1046 [0.184]
907 [0.230]
999 [0.190]
997 [0.194]
680 [0.150]
675 [0.155]
NA
744 [0.154]
826 [0.149]
666 [0.153]
705 [0.155]
675 [0.154]
666 [0.157]
685 [0.159]
676 [0.151]
793 [0.153]
672 [0.160]
756 [0.152]
787 [0.144]
349 [0.171]
348 [0.165]
NA
358 [0.174]
392 [0.157]
359 [0.154]
359 [0.173]
344 [0.155]
358 [0.164]
368 [0.171]
366 [0.152]
398 [0.176]
373 [0.173]
361 [0.176]
361 [0.177]
90
90
90
0.168 2.31 (0.975)
0.167 2.31 (0.984)
0.379 4.77 (1.744)
0.169 2.29 (0.970)
0.169 2.33 (1.067)
0.166 2.38 (1.015)
0.165 2.33 (0.985)
0.168 2.33 (0.981)
0.165 2.34 (0.988)
0.168 2.36 (1.010)
0.166 2.31 (0.979)
0.165 2.30 (0.974)
0.167 2.36 (0.996)
0.167 2.32 (0.968)
0.168 2.33 (0.968)
500
500
C.2. Non-Supervised Method
C.2.2. Average Number of Simplex Iterations
Table C.6 shows the average number of simplex iterations ς¯ performed over all test sets,
in thousands, for twelve of the highest ranking rules according to Oy=ς of eq. 6.1.
Table C.6: Average number of performed simplex iterations ς¯ in thousands (time in seconds)/[proportional gap in percentages]
Set 5 (tmax = 500 sec)
Rule/
Type
|P|
RB
HIB
RDB
EB4
EB3
EB2
EB3
EB2
EB4
EB4
EB2
EB3
EB4
EB4
EB4
10k
5k
5k
5k
5k
10k
1k
5k
5k
5k
5k
10k
Rule/
Type
|P|
RB
HIB
RDB
EB4
EB3
EB2
EB3
EB2
EB4
EB4
EB2
EB3
EB4
EB4
EB4
10k
5k
5k
5k
5k
10k
1k
5k
5k
5k
5k
10k
dim ω fυ
5
16
8
16
8
5
5
8
16
5
5
5
ς
ς
ς
ς
ς
ς
ς
ς
ς
ς
ς
ς
dim ω fυ
5
16
8
16
8
5
5
8
16
5
5
5
ς
ς
ς
ς
ς
ς
ς
ς
ς
ς
ς
ς
Number of problems:
r
9
2
7
5
8
7
3
3
1
8
4
2
r
9
2
7
5
8
7
3
3
1
8
4
2
easy
Set 4
hard
100×
{5, 10, 15}
100×
{20, 25, 30}
250 × {5, 10}
500 × 5
1.98
2.01
4.32
1.88
1.89
1.90
1.95
1.85
1.87
1.87
1.89
1.87
1.89
1.89
1.89
8.25
8.33
16.35
7.68
7.64
7.60
7.74
7.49
7.55
7.77
7.61
7.60
7.59
7.76
7.51
60.5 (5.74)
60.6 (5.75)
112.9 (10.85)
54.0 (5.55)
54.6 (6.06)
54.8 (5.94)
54.1 (5.87)
54.0 (5.79)
54.7 (5.61)
55.1 (5.48)
55.4 (5.94)
54.4 (5.85)
54.2 (5.43)
54.9 (5.59)
53.7 (5.54)
165 (7.56)
156 (7.34)
NA
137 (6.79)
140 (7.21)
148 (7.30)
149 (7.41)
144 (6.99)
137 (6.76)
138 (6.69)
142 (7.00)
143 (7.03)
138 (6.93)
139 (7.07)
141 (7.00)
1057 (49.3)
1050 (48.1)
NA
961 (47.7)
962 (46.4)
1016 (50.3)
1000 (47.5)
968 (47.1)
979 (47.9)
990 (47.1)
978 (47.3)
964 (45.1)
970 (45.5)
980 (46.9)
972 (47.9)
1381 (77.9)
1398 (80.6)
NA
1360 (84.1)
1368 (84.5)
1306 (76.6)
1303 (79.9)
1280 (75.7)
1335 (81.9)
1304 (79.2)
1178 (72.8)
1302 (78.4)
1405 (81.2)
1373 (80.5)
1304 (81.0)
Set 2
easy hard
easy
Set 3
hard
250×
{15, 20, 25}
250 × 30
500 × {10, 15}
500×
{20, 25, 30}
Set 1
easy hard
0.343
0.344
0.948
0.353
0.350
0.353
0.355
0.355
0.355
0.349
0.352
0.350
0.352
0.357
0.353
0.543
0.544
1.610
0.542
0.535
0.539
0.538
0.546
0.537
0.544
0.541
0.546
0.547
0.545
0.543
3.80
3.79
7.45
3.52
3.54
3.54
3.56
3.55
3.47
3.53
3.54
3.59
3.51
3.50
3.51
0.680
0.675
1.337
0.661
0.660
0.674
0.686
0.680
0.659
0.674
0.684
0.673
0.669
0.666
0.683
6.99 (0.975)
7.01 (0.984)
12.66 (1.744)
6.52 (0.970)
6.81 (1.122)
6.56 (1.084)
6.52 (1.087)
6.54 (1.061)
6.54 (0.972)
6.58 (0.964)
6.85 (1.113)
6.64 (1.065)
6.64 (0.970)
6.51 (0.963)
6.72 (1.008)
8021 [0.182]
8174 [0.182]
NA
7814 [0.231]
7863 [0.196]
7921 [0.248]
8389 [0.208]
7827 [0.226]
7062 [0.242]
7370 [0.232]
7460 [0.229]
8024 [0.202]
7647 [0.234]
7364 [0.230]
7529 [0.221]
7125 [0.150]
7129 [0.155]
NA
6955 [0.153]
6809 [0.167]
NA
6790 [0.159]
NA
7303 [0.155]
NA
NA
6970 [0.162]
7378 [0.152]
7447 [0.156]
7042 [0.153]
5992 [0.171]
5949 [0.165]
NA
6515 [0.175]
6577 [0.169]
NA
6425 [0.166]
NA
6743 [0.182]
NA
NA
6522 [0.165]
6604 [0.177]
6656 [0.180]
6793 [0.178]
500
500
500
500
90
90
90
73
C. Full Test Results
C.2.3. Average Runtime
Table C.7 shows the average runtime t̄ spent in seconds/relative gap %̄LP in percentages
over all test sets, for twelve of the highest ranking rules according to Oy=t of eq. 6.1.
Table C.7: Average computation times t̄ in seconds/[proportional gap %̄LP in percentages]
Set 5 (tmax = 500 sec)
Rule/
Type
|P|
dim ω
RB
HIB
RDB
EB4
EB4
EB4
EB4
EB4
EB4
EB4
EB4
EB4
EB4
EB4
EB4
1k
5k
10k
10k
10k
5k
10k
5k
10k
10k
5k
5k
5
5
5
5
5
5
5
5
5
5
5
5
ς
ς
η
η
ς
ς
η
ς
ς
η
η
η
3
8
4
7
7
7
6
9
9
3
5
8
Rule/
Type
|P|
dim ω
fυ
r
RB
HIB
RDB
EB4
EB4
EB4
EB4
EB4
EB4
EB4
EB4
EB4
EB4
EB4
EB4
1k
5k
10k
10k
10k
5k
10k
5k
10k
10k
5k
5k
5
5
5
5
5
5
5
5
5
5
5
5
fυ
ς
ς
η
η
ς
ς
η
ς
ς
η
η
η
Number of problems:
74
r
3
8
4
7
7
7
6
9
9
3
5
8
Set 1
easy
hard
0.150
0.150
0.171
0.150
0.151
0.150
0.150
0.151
0.151
0.150
0.151
0.151
0.150
0.151
0.151
0.357
0.359
0.480
0.350
0.351
0.351
0.352
0.351
0.353
0.352
0.352
0.354
0.352
0.354
0.350
Set 3
easy
hard
Set 2
easy
hard
1.00
1.01
1.59
0.99
0.98
1.00
0.99
0.99
0.98
1.00
0.99
1.00
1.00
1.00
0.98
5.74
5.75
10.85
5.48
5.43
5.51
5.48
5.61
5.45
5.58
5.47
5.55
5.61
5.54
5.54
Set 4
easy
hard
100×
{5, 10, 15}
100×
{20, 25, 30}
250 × {5, 10}
500 × 5
7.56
7.34
NA
6.69
6.93
7.41
7.23
6.76
6.99
7.25
7.19
6.79
7.46
6.94
7.36
49.3
48.1
NA
47.1
45.5
47.3
48.1
47.9
45.9
48.1
45.1
47.7
47.8
50.0
49.5
77.9
80.6
NA
79.2
81.2
76.4
77.5
81.9
74.6
77.6
77.2
84.1
77.7
77.8
77.9
250×
{15, 20, 25}
250 × 30
500 × {10, 15}
500×
{20, 25, 30}
0.199
0.199
0.241
0.198
0.198
0.199
0.197
0.197
0.199
0.198
0.198
0.197
0.198
0.198
0.199
0.505
0.504
0.693
0.491
0.490
0.502
0.504
0.489
0.496
0.502
0.493
0.493
0.502
0.500
0.500
0.130
0.130
0.190
0.132
0.132
0.130
0.129
0.129
0.132
0.129
0.132
0.132
0.130
0.130
0.131
0.975
0.984
1.744
0.964
0.970
0.968
0.982
0.972
0.968
0.967
0.983
0.970
0.981
0.970
0.968
0.182 [0.182]
0.182 [0.182]
NA
0.232 [0.232]
0.234 [0.234]
0.194 [0.194]
0.186 [0.186]
0.242 [0.242]
0.231 [0.231]
0.188 [0.188]
0.239 [0.239]
0.231 [0.231]
0.190 [0.190]
0.183 [0.183]
0.190 [0.190]
0.150 [0.150]
0.155 [0.155]
NA
NA
0.152 [0.152]
0.144 [0.144]
0.157 [0.157]
0.155 [0.155]
0.154 [0.154]
0.157 [0.157]
0.152 [0.152]
0.153 [0.153]
0.151 [0.151]
0.154 [0.154]
0.152 [0.152]
0.171 [0.171]
0.165 [0.165]
NA
NA
0.177 [0.177]
0.177 [0.177]
0.176 [0.176]
0.182 [0.182]
0.174 [0.174]
0.171 [0.171]
0.180 [0.180]
0.175 [0.175]
0.175 [0.175]
0.174 [0.174]
0.176 [0.176]
500
500
500
500
90
90
90