Analysing and characterising optimization problems using length scale

Soft Comput
DOI 10.1007/s00500-015-1878-z
METHODOLOGIES AND APPLICATION
Analysing and characterising optimization problems using length
scale
Rachael Morgan1 · Marcus Gallagher1
© Springer-Verlag Berlin Heidelberg 2015
Abstract Analysis of optimization problem landscapes is
fundamental in the understanding and characterisation of
problems and the subsequent practical performance of algorithms. In this paper, a general framework is developed for
characterising black-box optimization problems based on
length scale, which measures the change in objective function
with respect to the distance between candidate solution pairs.
Both discrete and continuous problems can be analysed using
the framework, however, in this paper, we focus on continuous optimization. Length scale analysis aims to efficiently
and effectively utilise the information available in black-box
optimization. Analytical properties regarding length scale are
discussed and illustrated using simple example problems. A
rigorous sampling methodology is developed and demonstrated to improve upon current practice. The framework is
applied to the black-box optimization benchmarking problem set, and shows greater ability to discriminate between
the problems in comparison to seven well-known landscape
analysis techniques. Dimensionality reduction and clustering techniques are applied comparatively to an ensemble of
the seven techniques and the length scale information to gain
insight into the relationships between problems. A fundamental summary of length scale information is an estimate
of its probability density function, which is shown to capture
salient structural characteristics of the benchmark problems.
Communicated by V. Loia.
B
Rachael Morgan
[email protected]
Marcus Gallagher
[email protected]
1
School of Information Technology and Electrical
Engineering, University of Queensland, Brisbane 4072,
Australia
Keywords Length scale · Fitness landscape analysis ·
Black-box optimization · Search space diagnostics
1 Introduction
The black-box optimization problem setting has been the
focus of a large amount of research in evolutionary computation and other fields. Theoretically, it is well-known that
under mild conditions, all algorithms perform equally on
average across all possible problems (Macready and Wolpert
1996). In stark contrast, thousands of algorithms have been
proposed in the literature and have shown good performance
on a range of benchmark and other problems. Indeed, it would
be expected that a hill-descending/local search algorithm
would easily outperform a hill-climbing algorithm on any
reasonable set of minimization problems. Here, the definition of reasonable is critical; it has been proven that the “No
Free Lunch” scenario does not hold if mild conditions (e.g.
smoothness) are imposed on the space of optimization problems (Whitley and Watson 2005). Therefore, understanding
the relationship between optimization algorithms and problems of interest emerges as a key research question for both
theory and practice. Well-defined and measurable problem
properties and characteristics are essential components in
facilitating this understanding.
There have been many attempts to characterise or measure
properties of optimization problems in the literature. In mathematical programming, key assumptions are made regarding
the objective function (e.g. convexity, differentiability) and
algorithms are developed to take full advantage of these properties (Fletcher 1987). By doing so, the algorithms implicitly
assume a “model” of the landscape. Since their behaviour
on the model problem can often be understood theoretically, their performance on real-world/black-box problems
123
R. Morgan, M. Gallagher
can be viewed as a reflection of how closely the problem
landscape matches the problem model. In metamodel or
surrogate-based optimization (known as fitness approximation in evolutionary computation), an explicit model is built
(e.g. via regression) from a sample of solutions evaluated
using the true objective function (assumed to be expensive)
(Jin 2005; Forrester et al. 2008; Bartz-Beielstein et al. 2010).
Search is then performed more economically on the surrogate model. If the model matches the problem, good solutions
on the model should correspond to good solutions on the
true objective function. In metaheuristics, researchers have
developed practical fitness landscape analysis techniques and
derived other metrics for measuring problem “difficulty”.
As a longer-term goal, if problems can be characterised
and robust problem metrics developed, then this information
could be used to partially or fully automate the algorithm
selection problem (Rice 1976; Smith-Miles 2008).
In this paper, we propose a new approach to analysing
and characterising black-box optimization problems. The
essence of the work is remarkably simple: the notion of length
scale as the observed change in the objective function with
respect to the distance between two candidate solutions. In
contrast to previous research, our work builds on efficiently
and effectively utilising the information available in the
black-box setting, rather than deriving features based on specific characteristics of algorithms or conjecture about what
constitute important landscape properties. Section 2 gives
an overview of relevant literature on optimization problem
analysis. Since our work is focused on (but is not limited to)
continuous problems, we review and discuss the key issues in
this setting (extensive recent reviews for the discrete case are
cited). Section 3 formally introduces length scale, discusses
its fundamental properties and presents examples. Related
work based on similar quantities (i.e. finite differences and
Lipschitz constants) but with fundamentally different aims is
also discussed. Statistical and information-theoretic methods
to analyse, summarise and interpret length scale information
are discussed and illustrated on simple example problems in
Sect. 4. Considerations for sampling length scales in practice are discussed in Sect. 5, and a methodology is developed
and compared to commonly used sampling approaches. In
Sect. 6, the length scale framework is applied to the noiseless
black-box optimization benchmarking (BBOB) problem set
(Hansen et al. 2010) and compared with seven well-known
landscape analysis techniques. Conclusions and avenues for
future work are discussed in Sect. 7. This paper is an extension of the work presented in Morgan and Gallagher (2012),
with the addition of significant theoretical properties of
length scale, a rigorous methodology for sampling length
scales, experimental comparisons between landscape analysis techniques in the literature, and powerful extensions to the
framework that allow clustering and visualisations of problem similarity.
123
2 Analysing and characterising optimization
problems
It is widely accepted that the comparative performance of
algorithms may vary across problems within a particular
problem class (Macready and Wolpert 1996). In light of
this, there has been an increasing interest in developing
features/properties to characterise problem structure and to
relate these features to the behaviour and performance of
algorithms (Hutter et al. 2006; Smith-Miles 2008).
A considerable amount of problem analysis has been
developed for combinatorial optimization problems. Predominately, the analysis of the travelling salesman problem
(TSP) has resulted in a number of features that utilise
problem-specific knowledge and have been shown to contribute to problem difficulty (Zhang and Korf 1996; Zhang
2004; Gent and Walsh 1996; Stadler and Schnabl 1992;
van Hemert 2005). For example, Cheeseman et al. (1991)
and Ridge and Kudenko (2007) show that increasing the
standard deviation of the distances between cities for randomly generated TSP instances increases problem difficulty
for a number of algorithms. Features such as these are very
problem-specific and in most cases cannot be easily transferred to other problem classes. Hence, problem-specific
features have been developed for other problem classes such
as graph colouring, boolean satisfiability, time-tabling and
knapsack problems. Comprehensive reviews of discrete fitness landscape and problem-specific techniques can be found
in (Pitzer and Affenzeller 2012; Reidys and Stadler 2002;
Smith-Miles and Lopes 2011; Talbi 2009).
Problem-specific features are very insightful and useful,
however, they do not allow comparisons between blackbox problems. Typically, problem-specific features cannot be
applied to different problem classes. No domain knowledge
is available in the black-box scenario, therefore, analysis
should be based on candidate solutions, x , from the feasible search space, S, and their respective objective function
values, f (x). The notion of the objective function as a (ddimensional) landscape or surface defined over S is intuitive
and can be found across the literature in optimization and
related areas. A discrete fitness landscape is formally defined
by f and a graph, G representing S. The edges in G can
be defined by a given move operator and induce a neighbourhood in the space. Topological properties of the fitness
landscape can be precisely defined in this framework; for
example a (strict) local optimum is a point x where all
neighbouring solutions have worse fitness than f (x ). For
any discrete or combinatorial problem instance, it is possible to computationally determine whether or not x is a
local optimum by exhaustive evaluation of all neighbouring
points. However, complete enumeration of the search space
is often impractical due to the finite, but very large number of
candidate solutions. Hence, fitness landscape analysis tech-
Analysing and characterising optimization problems using length scale
niques typically employ random, statistical or other sampling
methods to examine a set of points of interest (and/or their
fitness values) from a landscape. Features based on the landscape metaphor are inherently very general and can also be
applied to problems of different classes and/or non-black-box
problems.
In a continuous search space, topological landscape features that are conceptually similar to the discrete case can be
defined mathematically [see Pitzer and Affenzeller (2012)
for a review], but evaluating these features on a real problem instance is problematic. Contrary to discrete problems,
each solution in continuous space has an infinite number of neighbours in theory, and a finite but extremely
large number in practice due to finite-precision representation of floating-point numbers. Hence, discrete problem
features that are reliant on neighbourhood information—
including autocorrelation (Weinberger 1990), correlation
length (Stadler 1996), fitness evolvability portraits (Smith
et al. 2002), fitness clouds (Collard et al. 2004) and variants of information content (Vassilev et al. 2000; Steer et al.
2008)—must introduce additional assumptions and parameters to be used in a continuous space, such as the size
and distribution of the neighbourhood, as well as methods for adequately sampling the neighbourhood. Common
recommendations and approaches include sampling from a
(bounded) uniform neighbourhood distribution, discretising
the space (Smith et al. 2002), varying the neighbourhood size
(Malan and Engelbrecht 2009) and utilising algorithm trajectories (Muñoz et al. 2012b). However, the validity of these
assumptions and their effect on empirical results is not well
understood.
Another significant difference between discrete and continuous landscapes is tied to the distance between points in
S (using some appropriate metric). For a discrete landscape,
the minimum possible pairwise distance will occur between a
point and one of its neighbours. There will also be a finite set
of possible distance values between all pairs of vertices in G.
For a continuous landscape, the minimum distance between
points can be made arbitrarily small (in practice until the
limit of precision is reached) and the number of possible
distance values is infinite. Consider a problem with binary
representation, S = [0, 1]d . To solve the problem we determine whether each variable xi ∈ x should take the value 0 or
1, and there is no notion of the scale of xi . For a continuous
problem, however, finding an appropriate scale for each xi
is important (e.g. does f (x) vary in a significant way with
changes in xi of order 103 , 10−3 or 10−30 ?). Discrete fitness
landscape analysis techniques do not capture such information because it is not relevant for the discrete case.
Some researchers have adapted landscape analysis techniques by modifying the distance metrics used. For example,
fitness distance correlation (FDC) was originally proposed
using Hamming distance (Jones and Forrest 1995), however,
it is generally used with Euclidean distance in the continuous setting (Gallagher 2000; Müller and Sbalzarini 2011;
Muñoz et al. 2012a). Problem analysis techniques originating in the continuous domain—e.g. Dispersion, a measure of
the average distance between pairs of high quality solutions
(Lunacek and Whitley 2006)—also typically utilise Euclidean distance.
Gallagher calculated FDC for the error surface (i.e. fitness landscape) of the training problem for a multi-layer
perceptron neural network (Gallagher 2000, 2001). For the
specific learning task considered (student–teacher model),
the global optimum was known, however, this would not
normally be the case for a neural network training problem. Solutions were sampled within a specified range around
the global optimum. Wang and Li calculated the FDC of
a continuous NK-landscape model and on some standard
test problems (Wang and Li 2008). The FDC of the CEC
2005 benchmark problem set calculated on solutions uniformly sampled from S has also been analysed (Müller and
Sbalzarini 2011). While the results from these papers show
some interesting structure and differences between problems,
the limitations of FDC that have been noted for discrete problems remain (e.g. Müller and Sbalzarini (2011) conclude that
FDC alone is not sufficient for problem design or measuring problem difficulty). How the results are impacted by the
numerous experimental/implementation details that are specific to the continuous case (see the discussion above) remain
unknown. Information content and its variants have also been
adapted and analysed on a variety of continuous problems.
In Malan and Engelbrecht (2009), variants of information
content are estimated on seven continuous benchmark problems in 1 and 30 dimensions using an increasing-step random
walk, while Muñoz et al. (2012b) used samples generated by
instances of a (1+1) covariance matrix adaptation evolution
strategy (CMA-ES), particle swarm optimization (PSO) and
random search to estimate variants of information content
on 2D continuous problems from the BBOB problem set
(Hansen et al. 2010).
While many techniques have been adapted from discrete to continuous, some landscape analysis techniques
have originated from the continuous problem domain. One
such technique is dispersion, which indicates the degree
to which high-quality solutions are concentrated/clustered,
where quality is determined by sampling n solutions and
using truncation selection to retain the fittest tn solutions,
where t ∈ (0, 1]. Dispersion has been shown to be a useful metric in studying the performance of algorithms relative
to particular problems (and their structure); CMA-ES, hybrid
PSO / CMA-ES algorithms, pattern search methods and local
search have been analysed on a number of benchmark problems (Lunacek and Whitley 2006; Whitley et al. 2006; Müller
et al. 2009). The dispersion values of 2D, 5D, 10D and
20D problems from BBOB’10 have also been used in the
123
R. Morgan, M. Gallagher
feature-set of an algorithm prediction model (Muñoz et al.
2012b). Dispersion makes only limited use of the fitness
values of solutions via the value of t used to produce the
sample of solutions. Pairwise distances between solutions
have also been analysed in samples of approximate local
minima for multi-layer perceptron training (Gallagher 2000;
Gallagher et al. 2002) and for combinatorial problems (Solla
et al. 1986). Mersmann et al. (2011) proposed six feature
classes (with 50 features in total) that aim to characterise
the structure of continuous problems. The features use various statistical and machine learning techniques to indicate
the degree of separability, global structure, modality, global
to local optima contrast, basin size homogeneity, search
space homogeneity, variable scaling and plateaus in the landscape. The 50 features were used to discriminate between the
expert-assigned problem categories in the BBOB problem
set.
Regardless of whether techniques originate in the discrete
or continuous domain, there are some important issues that
limit the ability of the techniques to characterise problems.
Arguably the most pertinent is the failure of existing techniques to utilise all information available in the black-box
scenario. This stems from the approaches commonly used
in the development of new landscape analysis techniques.
Designers typically aim to capture a particular problem structure of interest. For example, the density of states (Rosé
et al. 1996) estimates a probability distribution of objective
function values—no other structures or topological features
within the landscape are intended to be captured. With a
given structure/topological property of interest, designers
then decide how best to capture the structure (e.g. what
type of sample should be used as well as how the sample
should be filtered and analysed). Herein lies a major issue;
by only considering a certain aspect of landscape structure
and by filtering the information available, information that
describes (and potentially differentiates) a landscape may
be lost. Furthermore, many techniques compress the information gleaned into a single scalar value (e.g. correlation
length) representing the structural feature of interest (e.g.
ruggedness). Such compression leads to further information
loss. Hence, in the context of characterising and differentiating problems, existing problem features (when considered
individually) are inherently limited. Features have been combined in an attempt to characterise problems [see Mersmann
et al. (2011) or the literature concerning automatic algorithm
selection, e.g. (Hutter et al. 2006)], however, the coverage
of structural information and interaction between features
is complex and remains unclear. Open questions include
whether existing techniques encapsulate the same information (leading to redundancy in the ensemble of features), and
whether the ensemble of features captures all of the information required to characterise problems. Indeed, selecting
a subset of relevant landscape features is analogous to the
123
feature selection problem in machine learning (Guyon and
Elisseeff 2003).
In summary, there is significant need for developing more
powerful analysis techniques for black-box continuous optimization problems. Existing methods have a number of
limitations due to their (original) discrete formulation and/or
their specific foci. In the next section, we propose a new
approach towards addressing these concerns.
3 Length scale
A continuous fitness landscape is defined by three components; a search space, S ⊆ Rn , a distance metric, d :
S × S → R, that defines a notion of similarity between
solutions, and an objective function, f : S → R (Stadler
2002; Pitzer et al. 2012). Together, the search space and
distance metric form a metric space. We aim to develop a
framework that can be used to study the characteristics of a
landscape independent of any particular algorithm, but that
can also be potentially used to evaluate algorithm behaviour.
We make no assumptions regarding the structure within landscapes, nor do we intend to capture pre-defined landscape
properties. Rather, we utilise all information provided in the
black-box setting (i.e. the ability to evaluate f at any point
x ∈ S) to implicitly capture the landscape structure required
to characterise and differentiate problems. The framework
can be applied in practice and is amenable to statistical and
information-theoretic analysis.
Definition 1 Let xi and x j be two distinct (xi = x j ) solutions in S with corresponding objective function values f (xi )
and f (x j ). The length scale, r , is defined as
r : [0, ∞) =
| f (xi ) − f (x j )|
d(xi , x j )
(1)
The length scale intuitively measures how much the objective function changes with respect to a step between two
points in the search space. Because r relies on two feasible solutions, it is applicable to both unconstrained and
constrained problems (assuming feasible solution pairs can
be sampled), as well as bounded and unbounded problems.
Length scale is defined as a magnitude over a finite interval
in the search space; directional information regarding a step
from xi to/from x j is not considered. The development of
the length scale framework is motivated by the analysis of
continuous problems, however, because r relies solely on S,
d and f , length scale analysis of discrete problems is equally
possible.
In some cases, a simple expression for all length scales
of a landscape can be derived, as illustrated by the following
examples. The resulting expressions may be either used to
Analysing and characterising optimization problems using length scale
infer structure directly (as in the following examples), or
indirectly via further analysis techniques (see Sect. 4.2).
Example 1 1D linear function
Given f (x) = ax where x, a ∈ R and d(x i , x j ) =
i
|x − x j |, the length scale between x i and x j is |a|. Here,
r captures the intuition that any step in S will be accompanied by a proportional change in f . The length scale of a
d-dimensional neutral/flat landscape (i.e. f (x) = c, c ∈ R)
is also a special case of this where r = 0.
For most continuous problems, r will not be a constant
over S. In different regions of the space, r will depend on the
local topology of the fitness landscape (varying slope, basins
of attraction, ridges, saddle points, etc.).
Example 2 1D quadratic (sphere) function
Given f (x) = ax 2 where x, a ∈ R and d(x i , x j ) =
i
|x − x j |, the length scale between x i and x j is:
| f (x i ) − f (x j )|
r=
d(x i , x j )
i2
|ax − ax j2 |
=
|x i − x j |
|a||(x i − x j )(x i + x j )|
=
|x i − x j |
= |a||x i + x j |
The derived expression for r indicates that the relative
change in fitness is dependent on the location of x i and x j .
For example, r is small only when both |x i | and |x j | are small
(near the optimum, x = 0) or x i ≈ −x j (directly across the
basin). This suggests that an algorithm needs to reduce its
step size to successfully approach the optimum (e.g. gradient
descent can work well on such a problem because the gradient
smoothly approaches 0 at the optimum).
The derivation of a simple and intuitive length scale
expression is not always feasible, in particular if the problem
is black-box and its definition is unknown. Instead, length
scale values can be sampled from the landscape, and the
sample analysed to gain insight into the problem. A number of suitable analysis techniques are given in Sect. 4, and
sampling considerations and methodologies are discussed in
Sect. 5.
3.1 Length scale distribution
By considering r as a continuous random variable, the length
scales in a landscape can be summarised by their distribution.
Definition 2 The length scale distribution is defined as the
probability density function p(r ).
The length scale distribution, p(r ), describes the probability of observing different length scale values for a given
fitness landscape. Consider Example 1 (1D linear function);
since r = a, p(r ) is a Dirac delta function with a spike at
r = |a|. The d-dimensional flat function also results in a
Dirac delta function with a spike at r = 0.
Now reconsider Example 2 (1D quadratic function), where
r = |a||x i + x j |. Assuming uniform enumeration of S, r is
the absolute value of the sum of two independent, continuous
uniform random variables. Introducing lower (bl ) and upper
(bu ) bounds for f , let Z = X + Y , where X, Y ∼ U[bl , bu ]
are independent. The distribution for Z is triangular (Grinstead and Snell 2012):
⎧ z−2b
l
,
⎪
⎪
⎨ (bu −bl )2
2bu −z
p Z (z) =
,
(bu −bl )2
⎪
⎪
⎩
0
2bl ≤ z ≤ (bl + bu )
(bl + bu ) < z ≤ 2bu
otherwise
Hence, p(r ) for Example 2 is a “folded” triangular distribution:
p(r ) = | p Z (r )|
= p Z (r ) + p Z (−r ).
For simple problems such as these where the problem
definition is known, the exact form of p(r ) can be derived.
Generally, it can be approximated by fitting a probability
density estimator, p̂(r ), to a finite sample of length scales.
Indeed, in Sect. 6.3, the length scale distributions of problems
in the BBOB problem set are estimated and analysed in detail.
3.2 Properties of length scale
From a structural perspective, the regularities and structures that fundamentally define a landscape are captured
by distance, and are therefore, invariant to isometric (distance preserving) mappings such as translation, rotation and
reflection (Borenstein and Poli 2006). We define an equivalence relation (Rosen 1999) (in terms of structure and
information) between two problems if there is an isometric
mapping between the search spaces as well as between the
objective functions. For example, consider f : R D → R,
and let g(x) = f (x − α1 ) + α2 where x, α1 ∈ R D and
α2 ∈ R. While both S and f have been translated (by α1
and α2 , respectively) to define g, the structure of the landscape has not changed, and so from a landscape analysis
perspective—and from the point of view of any reasonable
optimization algorithm— f and g are equivalent (denoted
f ≡ g). Because equivalent problems share the same structure, they should be characterised equivalently, and so it
follows that problem characteristics with an invariance to isometric mappings are attractive. This is analogous to algorithm
123
R. Morgan, M. Gallagher
design, where algorithms are invariant to transformed, but
equivalent problems [e.g. the invariance of CMA-ES (Hansen
2000)].
Length scale is the ratio of two distance functions, hence
it is invariant to isometric mappings including translation,
rotation and reflection. Length scale is sensitive to other nondistance preserving transformations, such as shearing and
scaling of S. For uniform scaling of S by α ∈ R \ {0}, the
length scale values are scaled by a factor of α1 . Likewise, for
scaling of f by α, the length scale values are scaled by α.
From Definition 1, if two length scales are identical, then
the objective functions are equivalent between the pairs of
solutions under consideration. That is, given solution pairs
Sa = {xi , x j } and Sb = {x p , xq }, and functions f a : Sa →
R and f b : Sb → R, if
| f a (xi ) − f a (x j )|
| f b (x p ) − f b (xq )|
=
d(xi , x j )
d(x p , xq )
then f a ≡ f b . Additionally, two non-equivalent functions
will always produce non-equal length scales. Hence length
scale is a very useful indicator of functional equivalence (over
pairs of solutions) in the context of optimization. Assuming enumeration of all solutions (which is possible in the
discrete case), two equivalent problems will yield identical
length scales, and therefore, identical length scale distributions. Thus, p(r ) is a unique characteristic of landscapes.
Problems with local regions of equivalence will have length
scales in common, and so problem similarity can be measured by the degree to which problems share common length
scales. Indeed, in Sect. 6 we exploit this idea and use statistical techniques based on p(r ) to measure the similarity of
problems within the BBOB problem set.
3.3 Related work
Our definition of r is related to the difference quotient (also
known as Newton’s quotient and is a generalisation of finite
difference techniques) from calculus and numerical analysis.
f (x)
, and can
The difference quotient is defined as f (x+h)−
h
be used to estimate the gradient at a point x, as h → 0
(Overton 2001). Implementations of gradient-based algorithms utilise approximations of this form if the gradient of
f is not available. Finite difference methods are widely used
in the solution of differential equations, but are not directly
related to this paper.
Length scale is also related to the Lipschitz constant,
defined as a constant, L ≥ 0, where the Lipschitz condition
| f (xi )− f (x j )| ≤ Lxi −x j , ∀xi , x j is satisfied (Sergeyev
and Kvasov 2010). The ideal Lipschitz constant is the smallest L for which the Lipschitz condition holds, and functions
satisfying the condition are known as Lipschitz continuous.
123
The definition of the Lipschitz constant is very similar to
length scale, indeed, max (r ) is equivalent to the ideal Lipschitz constant. However, the differences in the definitions
of r and L define vastly different values; valid Lipschitz
constants may overestimate the largest rate of change, while
length scales will always underestimate or be equal to the
largest rate of change. Consider again Example 2: the 1D
quadratic function. From above, r = |a||x i + x j |. For convenience, we introduce bounds x ∈ [0, 1] and let a = 1
(hence f = x 2 ). Given that x i , x j ∈ [0, 1], valid r values
are in (0, 2). For bounds [0, 1], Lipschitz constants in [1, ∞)
satisfy the Lipschitz condition. Hence, values for r and L are
very different for this simple function.
Lipschitzian optimization algorithms are a class of global
optimization algorithms that use knowledge of Lipschitz constants to solve Lipschitz continuous problems (Horst and Tuy
1996). A function’s Lipschitz constant is not always known a
priori (e.g. in the black-box scenario), and so various methods
have been developed in order to estimate L from the landscape. Estimating L is itself a global optimization problem,
and so heuristics are used to actively search the landscape
for L. Heuristics vary between methods, with some calculating what are essentially length scales between solution pairs,
and refining the search to solutions with large/promising
length scales (Strongin 1973; Wood and Zhang 1996; Beliakov 2006). Because the aim is to accurately estimate L,
only the supremum of the length scales in the search data
is of interest and retained. The remaining length scale samples are not used in Lipschitzian optimization, and to our
knowledge have never been analysed in the context of problem analysis. Consequently, notions and concepts similar to
length scale analysis are absent from the Lipschitzian optimization literature. While the techniques used to estimate L
illustrate potential approaches for sampling r , the sample is
biased due to the heuristics used to search for L.
To summarise, length scale analysis captures information
regarding all rates of change, over a wide variety of intervals (distances) on the problem. While the concept of length
scale is similar to the difference quotient and the Lipschitz
constant, the utilisation of this information at all scales is
novel in the context of optimization problem analysis.
4 Analysing and interpreting length scales
Length scale values capture information about the structure
of a fitness landscape. In the following, we outline several
methods to analyse, summarise and interpret the length scales
obtained from a landscape.
4.1 Length scale heatmaps of 1D problems
When S ⊆ R, we can visualise the length scale data using a
heatmap, where the axes correspond to candidate solutions
Analysing and characterising optimization problems using length scale
5
0
23.4032
4
1
3
2
3.5802
0.5477
2
0.0838
xj
f
1
0
3
0.0128
4
0.0020
−1
−2
0.0003
−3
5
0
−4
6
−5
0
1
2
3
4
5
6
0
1
2
3
x
xi
(a)
(b)
4
5
6
Fig. 1 Problem analysis using a heatmap of r . a 1D “mixed-structure” function. b r enumerated over the “mixed-structure” function
x i and x j , and the cells on the map are coloured according to
the respective r values. To obtain the length scales, the objective function must be evaluated at x i and x j , and the values
of x i and x j can be enumerated within specified bounds and
precision. Heatmaps will appear symmetric across the diagonal (drawn from top left to bottom right) which follows from
the symmetric definition of r .
To illustrate the richness of length scale information, we
construct an artificial 1D function with a variety of different
topological features, including neutrality, linear slopes and
both convex and concave basins of attraction.
Example 3 1D “mixed-structure” function defined as follows and shown in Fig. 1a.
⎧
−1,
⎪
⎪
⎪
⎪
50(x − 1.75)2 − 4.15,
⎪
⎪
⎪
⎪
5.125x − 11.25,
⎪
⎪
⎨
50(x − 3.25)2 + 1,
f (x) =
0.75(x − 4.35)2 + 3.583,
⎪
⎪
⎪
⎪
⎪
3 log(|x − 5.6|) + 5.5,
⎪
⎪
⎪
⎪
3 log(|x − 5.4|) + 5.5,
⎪
⎩
0
1 ≤ x < 1.5
1.5 ≤ x < 2
2≤x <3
3 ≤ x < 3.5
3.5 ≤ x < 5
5 ≤ x < 5.5
5.5 ≤ x < 6
otherwise
where S = [0, 6].
Figure 1b shows the length scales calculated between pairs
of points, x i , x j , at increments of 10−3 across the search
space. We have shaded/coloured the values using a logarithmic scale to better visualise magnitudes of change in r .
It is clear from the length scales in Fig. 1b that there
are many different structures within the landscape. While
it may not be immediately obvious what the structures are,
the boundaries between them can be identified by the sudden transitions in shade/colour and pattern. For example, the
transition from black to light at x = 1.5 on the diagonal of
the figure clearly indicates a change in structure. Likewise,
the subtle transition at x = 5 shows a change in structure.
The nature of the structures within the landscape can be
further understood by analysing the lower (or upper) diagonal
in Fig. 1b. Darker shades represent small changes in objective
function values, while lighter shades indicate large jumps
in objective function value. Solid blocks of colour, such as
between [0, 1], signal a constant change in objective function
over a region. The dark lines and curves show steps in the
space where f (x i ) ≈ f (x j ), e.g. moving from a point on
one side of a basin to a point on the other side of the minimum
at the same fitness. Approximate locations of optima, such
as x = 3.25, may be located by observing dark lines where
the area on either side of the line changes to lighter shades.
The visualisation in Fig. 1b also shows how simple structures, such convex and concave basins of attraction, can
combine to give a complex objective function. Consider steps
within [3, 6]; the change in objective function values varies
significantly in a complex, unpredictable manner, which is
the type of information that black-box optimization algorithms contend with when solving a given problem.
4.2 Length scale distribution and its related summaries
Visual analysis of r over a large set of values is challenging
for higher dimensions. One possibility is to summarise the
values of r that occur over a given problem landscape. This
summary may be used to compare and characterise problems as well as predict the values of r obtained if further
exploration of the landscape is conducted (particularly if the
sampling technique is fixed).
Our hypothesis is that problems with structure of similar
complexities should yield similar length scale distributions.
Subsequently, statistical summaries of p(r )—such as measures of central tendency (e.g. mean, median and mode),
shape (e.g. skewness and kurtosis) and variability (e.g. range,
percentiles and standard deviation)—are potentially very
useful for characterising problems. However, such statistics
123
R. Morgan, M. Gallagher
are compressed representations of r values, and so it follows
that they may not be unique and will vary depending on the
structure present in a problem.
Concepts from information theory can also be used to
characterise p(r ). Shannon entropy is used as a measure of
the uncertainty of a random variable (Cover and Thomas
1991). The entropy, h(r ), measures the expected amount of
information needed to describe a random variable:
∞
h(r ) = −
p(r ) log2 p(r ) dr.
(2)
0
The Dirac delta function minimises differential entropy,
meaning that the d-dimensional flat and 1D linear functions
minimise h(r ). The uniform density function (in a bounded
region) has the largest entropy of any other density function
bounded within the same region. To obtain a uniform p(r ),
there must be length scales of uniformly varying size, e.g.
random noise functions. Other landscapes will yield h(r )
within these extremes.
Information theory also provides tools to directly compare
two length scale distributions. The similarity between two
length scale distributions can be calculated using the KLdivergence (Kullback and Leibler 1951):
D K L ( p || q) =
0
∞
p(r ) log2
p(r )
dr .
q(r )
(3)
KL-divergence is not a symmetric measure; in general,
D K L ( p || q) = D K L (q || p). A symmetric alternative,
known as J-divergence, can be used instead:
D J ( p || q) = D K L ( p || q) + D K L (q || p).
(4)
Given that the J-divergence can be used to measure
the similarity between two distributions, the J-divergence
between length scale distributions is a proxy for comparing the structural characteristics between landscapes. This is
a general and potentially very powerful tool; the structural
similarity between two problems can be quantified without explicitly measuring and comparing specific structural
properties.
In conclusion, the length scale distribution provides a
unique summary of the landscape that is amenable to
numerous statistical and information-theoretical analysis
techniques. The analysis techniques discussed in this section
are not exhaustive, and we expect that additional techniques
and/or extensions may provide further insights into problem
structure.
123
5 Sampling length scales
When exact derivation and/or enumeration of the length
scales for a problem is infeasible, a representative sample
of length scales can be analysed instead. Let p(r ) be the
true length scale distribution, and let p̂(r ) be the length
scale distribution estimated from a finite sample of length
scales. Intuitively, as the number of sampled length scales
increases towards complete enumeration, p̂(r ) converges to
p(r ). However, in practice, the methodology used to sample r
and the overall size of the sample will affect the convergence
of p̂(r ) to p(r ).
Two solutions are required to compute a single r value, and
so a sample of solution pairs is required to construct a sample
of r . In Morgan and Gallagher (2012), a sample of solution
pairs was generated from all unique pairwise combinations of
a sample of solutions. More specifically, with an initial sam ple of m solutions (assumed to adequately cover S), all m2
unique combinations of pairs are used to construct a sample
length scales
of length scales. Using this technique, m(m−1)
2
are sampled from m unique solutions in the landscape. While
this approach uses the maximum information available from
m solutions, it is limited in that a length scale sample of size
√
O(n) is based on only O( n) unique solutions. We conjecture that to obtain a sample of length scales representative
of the true distribution, r should be sampled from as wide a
variety of solutions in S as possible. In the extreme case, n
length scales can be calculated using n pairs of unique solutions, i.e. 2n unique solutions. If computational effort/storage
is an important consideration, the number of unique solutions used to generate the length scales can be reduced. For
example, n length scales can be generated via a sample of n
unique solutions by pairing each solution with exactly two
other solutions. One way to achieve this is by considering the
sampled solutions as a random tour and calculate r between
“steps” along the tour.
The method used to generate the solutions is an important
aspect of the length scale framework and deserves careful consideration. For example, uniform random sampling
of high-dimensional problems can yield a sparse sample
where the Euclidean distances between solutions are similar (Beyer et al. 1999; Aggarwal et al. 2001; Morgan and
Gallagher 2014). Hence, a uniform random sample is not
ideal for generating length scales in high dimensions; the
denominator of r (the distance between solutions) would
be similar across all sampled r , thereby essentially reducing r to the magnitude of change in the objective function.
The purpose of the length scale framework is to analyse the
objective function at a variety of scales (i.e. distances), and
so a sample of solutions at varying distances apart in S is
required. In this paper and Morgan and Gallagher (2012), we
obtain a sample of solutions by using a Lévy random walk,
where steps are taken in a random, isotropic direction and
Analysing and characterising optimization problems using length scale
step sizes are sampled from a Lévy distribution (Shlesinger
et al. 1987). The Lévy distribution pertaining to step size
is defined by scale (γ ) and location (δ) parameters. The
value of δ determines the minimum possible step size, and is
therefore, set to 0 in our experiments. The value of γ essentially controls the magnitudes of step sizes generated. To
determine appropriate values of γ , we examined the distributions of distances between solutions generated and adjusted
γ to ensure that steps spanning the diameter of S were
obtained.
The sample size required to produce an adequate sample will vary based on the structure of the landscape. For
the 1D linear and constant objective functions, any pair
of solutions will yield the single length scale value that
completely characterises the problem. However, for a needlein-a-haystack problem (where the objective function is a
constant value for all but one solution), both the needle
and solutions at unique distances away from the needle
must be sampled. Of course, the underlying structure of
the problem is unknown in the black-box scenario, and so
choosing an appropriate sample size is difficult in practice. We suggest that sample sizes are made as large as
practically possible and that p̂(r ) is examined across samples of increasing sizes in order to assess convergence, and
hence, adequacy of the sample. If p(r ) is known, the KLdivergence can be used to directly measure convergence,
since D K L p || p̂ = 0 when p̂(r ) = p(r ). Often, p(r )
is unknown, and so the KL-divergences
between different
sample sizes—i.e. D K L p̂n+1 || p̂n —can be assessed as an
indicator for convergence. That is, once an adequate sample
size is achieved, subsequent sampling will not drastically
alter the distribution, and so the KL-divergence between the
subsequent sample size and the current sample size will be
negligible.
The experiments in Morgan and Gallagher (2012) calculated length scales using all n2 unique solution pairs in a
sample of n solutions. The following experiment investigates
our conjecture that length scale is affected by the variety of
the solutions in the sample. Specifically, we compare the following sampling methodologies:
– MU 1 : generate a uniform random sample of n solutions
and calculate n2 length scales from all pairwise solution
combinations.
n – MU 2 : generate a uniform random sample
2 solutions,
nof
choose a random tour and calculate 2 length scales
between “steps” along the tour.
– M L1 : generate
n a Lévy random walk of n solutions and
calculate 2 length scales from all pairwise solution combinations.
– M L2 : generate a Lévy random walk of n2 solutions,
choose a random tour and calculate n2 length scales
between “steps” along the tour.
Note that all four methodologies generate a sample of n2
length scales, however, they differ in the type and size of the
original sample of candidate solutions.
To evaluate the different methodologies, length scales
are calculated for Example 2 (1D quadratic function),
where the analytic
form of p(r ) is known. Note that
D K L p || p̂ → 0 as the sampling adequacy increases.
Length scale samples are sized according to n2 , where
n = [10, 50, 100, 500, 1000, 5000, 10, 000], and 30 different samples are generated for each size. For both Lévy walks,
γ = 10−3 . p̂(r ) is estimated from the samples via kernel
density estimation with a Gaussian kernel, using the “solvethe-equation plug-in” method (Sheather and Jones 1991) for
bandwidth selection. The KL-divergence is estimated via
numerical approximation (over the sample) of Eq. 3. Figure 2a shows
the mean and standard deviation (as error bars)
of D K L p || p̂ for each sample size. In the black-box scenario, p(r ) is unknown, and so the KL-divergence between
the distributions for each sample size and its subsequent sample size (e.g. n = 10 and n = 50) is calculated. The mean and
standard deviation of the divergences between sample sizes is
shown in Fig. 2b. Since KL-divergence is non-negative, error
bars yielding negative values are omitted from the figure.
Figure 2a shows that for small samples of r , both uniform
random sampling methods are superior to the Lévy random
walks on this problem. This is not surprising; large steps in a
Lévy walk are not as probable as small steps, and so it can take
a number of samples before Lévy walks adequately explore
the landscape. Interestingly, by conducting the Lévy walk
and calculating r between steps along a random tour of the
resulting sample, we are able to achieve
a well-represented
length scale sample (i.e. D K L p || p̂ for M L2 is ≤ 1) much
faster than M L1 , which was used in Morgan and Gallagher
(2012). Furthermore, on this problem M L2 is competitive
to uniform random sampling after 103 samples. Thus our
conjecture that a variety of solutions yields well-represented
length scales also seems to be well-founded; for both uniform
random sampling and Lévy random walks, generating a sample, choosing a random tour and calculating r between steps
in the tour gives a more accurate sample for almost all sample
sizes. However, because uniform random sampling in high
dimensions yields solutions with similar Euclidean distances
between them (Beyer et al. 1999; Aggarwal et al. 2001; Morgan and Gallagher 2014), we advocate and use Lévy random
walks to sample the length scales of continuous problems.
The divergences of the p̂(r )’s from a black-box perspective are shown in Fig. 2b. Both variants of the uniform random
sample have small divergences, and are therefore, quite stable, even for a low number of samples. M L2 achieves a very
low divergence (≤ 0.1 bits) after approximately 103 samples,
whereas M L1 does not achieve low divergence until 106 samples. Small KL-divergences does not necessarily mean that
p̂(r ) has converged to p(r ), however, they do indicate how
123
10
2
10
0
10
−2
10
−4
10
MU 1
MU 2
ML1
ML2
MU 1
MU 2
ML1
ML2
2
10
DKL (p(r)||p̂(r))
DKL (p(r)||p̂(r))
R. Morgan, M. Gallagher
0
10
−2
10
−4
1
10
2
10
3
10
4
10
5
10
6
10
7
10
8
10
1
10
2
10
3
10
4
5
10
10
6
10
Number of Length Scales
Number of Length Scales
(a)
(b)
7
10
8
10
Fig. 2 Estimating sampling adequacy via convergence of p̂(r ). Lines represent the mean (over 30 trials),
while error bars
representthe standard
deviation. Since KL-divergence is non-negative, error bars yielding negative values are omitted. a D K L p || p̂ . b D K L p̂n+1 || p̂n
much length scale information (in terms of bits) might be
gained by sampling further. If little can be gained, either all
the important structure has been sampled (in which case,
p̂(r ) is a good estimate of p(r )), or there exists important
structure that is hard to find (e.g. a needle in a haystack). In
the former case the sample is adequate, but in the latter case
any landscape analysis technique would struggle, and length
scale is no exception.
The trends in Fig. 2b closely follow those in Fig. 2a, suggesting that the black-box methodology proposed provides a
good summary of convergence, and hence can reliably assist
practitioners in determining the adequacy of their samples.
This technique is used in Sect. 6 to determine adequate sample sizes for the BBOB problem set.
6 Experimental analysis of BBOB problems
In this section, we analyse the noiseless BBOB problem set
using the length scale framework. In particular, we compare
the length scale analysis to other landscape features from
the literature, investigate the robustness of the framework,
extend the framework to visualise and classify the relationships between problems and examine problems’ length scale
distributions. The methodology used in these experiments is
general and can be easily applied to other black-box problems. Source code is available at https://github.com/RachM/
length-scale.
6.1 Comparing length scale to existing features
The following experiment investigates how well a subset of
problem features from the literature—namely the correlation
length, FDC, information content, partial information content, information stability, and dispersion—characterise the
BBOB problem set. These features were chosen because they
yield scalar values, are easy to interpret and are widely used in
123
the problem analysis context. We then compare the features to
the entropy of the length scale distribution, h(r ) (other summaries of r —such as the mean and variance—could similarly
be calculated). 2D, 5D, 10D and 20D problems are analysed
and Euclidean distance is used as the distance metric between
solutions. Sample sizes of 1000d 2 , 5000d 2 and 10,000d 2
were tested on instances from all the BBOB problems, across
2D, 5D, 10D and 20D, and there was only a negligible (≤ 1
bit) difference in sampling more than 1000d 2 solutions. Thus
in the following experiments, all features are calculated from
a sample of solutions obtained using a Lévy random walk
(γ = 10−3 ) of 1000d 2 solutions in S = [−5, 5]d .
The robustness of the features is investigated by examining them over varying instances of the problems, and varying
samples of those instances. 30 problem instances are produced by supplying seeds 1 to 30 to the BBOB problem
generator. For each instance, 30 different samples (of size
1000d 2 ) of S are generated, meaning for a given problem in
dimension d (e.g. 2D Sphere), there are 30 × 30 samples of
the problem. Hence each feature is calculated 900 times for
a single problem.
The BBOB problem generator applies several linear and
non-linear transformations to both x and f on many of
the problems. Consequently, the structure between problem
instances may vary (e.g. structure within the bounds on one
instance may be removed on another). Thus while we expect
the features to characterise the underlying nature of the problems, we do not expect the features to be invariant between
instances.
FDC is calculated using the global optimum, x∗ , as well
as the best solution in the sample, x̂∗ (the latter gives insight
into how well FDC performs in the black-box scenario).
Each variant is denoted F DCx∗ and F DCx̂∗ respectively.
The choice in for information content, partial information
content and information stability is highly dependent on the
problem, as results vary depending on the value of chosen.
In the black-box scenario, the search for an appropriate value
Analysing and characterising optimization problems using length scale
of is highly challenging, and is an optimization problem
in itself. In these experiments, information content, partial
information content and information stability are estimated
with = 0, meaning transitions in objective function are
“neutral” if the change in objective function value is 0. Dispersion is calculated using the best 5 % of solutions in the
sample, and normalised using bound-normalisation (Morgan
and Gallagher 2014). The correlation length, l, is calculated from the random walk autocorrelation function r (d)
(Weinberger 1990). Given the variety of distances between
solutions in the samples, the correlation length is defined as
∞
l =
d=0 r (d) (Stadler 1996). Length scale distributions
are estimated via kernel density estimation as described in
Sect. 5. Figure 3 displays the mean and one standard deviation (as error bars) for each feature across the 24 BBOB
problems and d.
Correlation length is typically used to indicate ruggedness,
and it specifically captures the maximum distance between
solutions such that the correlation between objective function values is significant. The correlation lengths shown in
Fig. 3a are generally high across the BBOB problem set,
and apart from problems F16 (Weierstrass), F21 (Gallagher’s
101), F22 (Gallagher’s 21) and F23 (Katsuura), the correlation lengths are of very similar values. While most values
of l are similar across d, F17 (Schaffers F7), F18 (Schaffers
F7 moderately ill-conditioned), F21 and F22 vary slightly.
The standard deviation in correlation length is small, mostly
around 0.05 (2D F18 varies the most, with a standard deviation of 0.11), and decreases slightly as d increases for all
problems.
High dispersion values indicate that the “good” solutions
in the sample are well separated and distributed throughout S, thus implying a rugged landscape. Since both l and
dispersion are measures of ruggedness, it is not surprising
that the problems in Fig. 3b with higher dispersion values
mostly correspond to the problems in Fig. 3a with low l.
Both Fig. 3a, b show F16 and F23 to be more rugged than the
other problems. Furthermore, F1 (Sphere) to F15 (Rastrigin)
are generally quite smooth compared to F16 to F24 (Lunacek
bi-Rastrigin). However, in contrast to l, the dispersion varies
with d. Because bound-normalisation is used, this is unlikely
to be an artefact of the “curse of dimensionality” discussed in
Morgan and Gallagher (2014) (indeed, the dispersions of F24
and F25 are invariant to d), and so the variations are likely
caused by structural changes as d increases. Interestingly, as
d increases, the differences in dispersion between the problems becomes less pronounced (e.g. the range of dispersions
for 2D problems is 0.26, compared to 0.09 for 20D). Overall,
l and dispersion provide very limited ability to characterise
and differentiate between the BBOB problems.
High FDC values indicate a strong correlation between
the fitness of solutions and their distance from the global
optimum. Figure 3c, d show considerably different values
across the problem set. For some problems, the FDC variants
actually indicate conflicting landscapes, e.g. F DCx∗ indicates F4 (Büche-Rastrigin) is slightly deceptive (which it is),
while this is not the case with F DCx̂∗ . This is an important
result, as it demonstrates that F DCx̂∗ is not always a reliable
approximation of F DCx∗ and so conclusions based on the
theory of FDC should not be drawn from F DCx̂∗ results. F6
(Attractive Sector) and F24 are also deceptive problems for
some algorithms, however, neither FDC variant suggest this.
F DCx∗ and F DCx̂∗ are largely invariant across d, however
F DCx̂∗ is typically larger than F DCx∗ , and F DCx∗ has a
lot more variation between samples. F DCx̂∗ exhibits similar trends to l; F16, F21, F22 and F24 are far less correlated
than the other problems. The similarity between F DCx̂∗ and
l is not surprising as they both correlate the fitness values of
points within the sample.
Information content and partial information content are
shown in Fig. 3e, f respectively, and it is clear that they are
largely unable to differentiate the BBOB problems. Information content measures the variety of fluctuations in f
along the sample. A value of log6 2 (≈ 0.3869) indicates
a highly rugged landscape with no neutral regions, and the
results (erroneously) suggest that the BBOB problems are all
highly rugged. Indeed, F1 and F5 (linear slope) are extremely
smooth, and yet their information content values suggest otherwise. Partial information content indicates the degree of
modality by measuring the variety of non-neutral regions in
the sample Vassilev et al. (2000), and the results in Fig. 3f are
very similar to information content. With the exception of F7
(step ellipsoidal), the partial information content is invariant to both the problems and d, and has extremely small
variance between samples. F7 contains numerous neutral
regions, which both information content and partial information content appear to have captured. Figure 3f and further
analysis of F7’s information content (not shown here) suggest
that the 5D problem contain significantly more transitions
between neutral and non-neutral regions than the 2D, 10D
and 20D problems. Subsequent results for 7D, 9D, 11D and
19D F7 problems showed similar behaviour to 5D, suggesting that in general, F7 problems with odd-d contain fewer
neutral regions than even-d. Figure 3f also indicates F17,
F19 (Composite Griewank–Rosenbrock), F23 and F24 are
more multi-modal than the other problems.
The information stability is the largest transition in
objective
function values
encountered along the walk, i.e.
max | f (xi ) − f (x j )| . Hence, it is conceptually very similar to max (r ). Figure 3g, h show the information content
and h(r ) respectively; both contain very similar trends, however, there are some minor differences (e.g. h(r ) is more
varied across D for F21 to F23). Both information stability
and h(r ) are generally well-correlated with the conditioning
of the problem; high conditioned problems, like F12 (Bent
Cigar), have high information stability and h(r ) values.
123
R. Morgan, M. Gallagher
1
2D
5D
10D
20D
1
2D
5D
10D
20D
0.8
Dispersion
Correlation Length
1.5
0.5
0.6
0.4
0
0.2
−0.5
0
5
10
15
20
0
0
25
5
Problem ID
(a)
1
F DC xˆ∗
−0.5
2D
5D
10D
20D
−1
0
5
10
15
20
−1
0
25
2D
5D
10D
20D
5
10
Partial Information Content
Information Content
0.6
0.4
0.2
15
20
0.8
0.6
0.4
0.2
0
0
25
2D
5D
10D
20D
5
Problem ID
10
10
20
25
(f)
30
2D
5D
10D
20D
2D
5D
10D
20D
20
h(r)
Information Stability
10
15
Problem ID
(e)
10
25
(d)
0.8
10
20
1
2D
5D
10D
20D
5
15
Problem ID
(c)
1
15
25
0
−0.5
Problem ID
10
20
0.5
0
0
0
15
(b)
1
0.5
F DCx∗
10
Problem ID
10
5
0
10
0
0
5
10
15
Problem ID
(g)
20
25
−10
0
5
10
15
20
25
Problem ID
(h)
Fig. 3 Problem analysis features for the BBOB problems. a Correlation length. b Dispersion. c F DCx∗ . d F DCx̂∗ . e Information content. f Partial
information content. g Information stability. h Length scale entropy
123
Analysing and characterising optimization problems using length scale
The length scale entropy will vary depending on the distribution’s shape and location. The magnitudes of the length
scale values varies significantly across the BBOB problems,
however, the values generally yield unimodal, long-tailed
distributions. Consequently, the similarity between h(r )
and information stability is likely due to the similarity
in the shapes of the length scale distributions across the
BBOB problems. Statistical measures such as the maximum, median, mean and variance of r also exhibit similar
trends to h(r ) and information stability. While the results
here show similar trends between h(r ) and information stability, length scale distributions with different shapes and
locations can produce different trends. Encouragingly, both
features are clearly able to distinguish the similarities and
differences between the BBOB problems. In addition, both
features exhibit invariance to d and show very little variance
between samples. Thus in terms of characterising the BBOB
problems, h(r ) and information stability are clearly superior
to the other techniques analysed.
The expected amount of information to describe r is quantified in h(r ), hence problems with similar h(r ) values are
likely to have similar degrees of structural complexity. For
example, F8 (Rosenbrock) and F9 (Rotated Rosenbrock) are
structurally identical (they differ only by a rotation of S), and
Figure 3h shows they have almost identical h(r ) values. In
practice, h(r ) can be used to identify similar (or dissimilar)
problems, and further information can be obtained by comparing the p(r )’s. This is utilised in Sect. 6.3, where length
scale distributions are compared both visually and using
J-divergence.
To summarise, the features compared in these experiments
show some ability to capture landscape structure, however,
most features are unable to detect the known differences
across the BBOB problem set. In terms of characterising and
distinguishing problems, the existing features seem limited.
There were two notable exceptions—information stability
and h(r )—that produced a wide range of consistent, reliable
values with clear relationships to the problems.
6.2 Comparing length scale with an ensemble of features
The length scale analysis in Sect. 6.1 was clearly able to
characterise the BBOB problems, while many of the existing features struggled. Collectively, the existing features may
offer a greater ability to characterise and distinguish problems (Hutter et al. 2006; Smith-Miles and Tan 2012), and
so in the following we evaluate the ability of an ensemble
of the features to characterise problems, and compare this
to the length scale framework. In addition, we propose a
novel—yet natural—extension to the length scale framework
that can be used to classify and visualise the problems, and
we compare this approach to the feature-ensemble approach.
To do so, we analyse the high- dimensional “feature space”
defined by the features, and the relationships between problems in such a space. Conceptually, the feature space is not
novel; e.g. Smith-Miles and Tan (2012) generated problem
feature vectors for a variety of TSP problems and used principal component analysis to visualise the instances in the
(reduced) feature space. The following experiments use the
20D BBOB results from Sect. 6.1.
In the feature-ensemble approach, each problem is represented by a 7D feature vector consisting of the correlation
length, dispersion, F DCx∗ , F DCx̂∗ , information content,
partial information content and information stability, averaged across the seeds/walks. The features are normalised
by their appropriate bounds, and because information stability is unbounded, we normalised the values by the range
of information stability values obtained across the problems.
The 20D BBOB problems are thus located in a 7D (unit
hypercube) feature space that can be further analysed via
clustering, classification and visualisation techniques.
As discussed in Sect. 4.2, the J-divergence between two
length scale distributions is a natural measure of problem similarity. Hence we propose using the J-divergences
between problems to implicitly define the problem space.
That is, the J-divergences can be used to infer certain properties of the problem space, without knowledge of the explicit
locations of problems within the space. In contrast to the
feature-ensemble approach, this approach does not explicitly define—and hence constrain—the problem space.
Dimensionality reduction techniques are frequently used
to visualise high-dimensional data in two or three dimensions (Lee and Verleysen 2007). To visualise the problem
spaces resulting from the length scale and the featureensemble approaches, we employ t-distributed stochastic
neighbourhood embedding (t-SNE) (Van der Maaten and
Hinton, 2008). t-SNE is a state of the art, probabilistic,
non-linear dimensionality reduction technique that aims to
distribute points in a lower-dimensional space such that
the original, high-dimensional neighbourhood relationships
are preserved. More specifically, a non-convex cost function modelling the discrepancy between the low and highdimensional relationships is minimised using a variant of
stochastic gradient descent. The algorithm is parameterised
by a perplexity term, which essentially controls the number
of effective neighbours near a given point. Based on the recommendations in Van der Maaten and Hinton (2008) and
exploratory experimentation, the perplexity was set to 5 for
all visualisations in this paper.
The explicit locations the high-dimensional points are not
required by t-SNE, rather, t-SNE relies on a notion of dissimilarity between points. Thus in the following, the average
J-divergences between the 24 BBOB problems (across the
different walks/seeds) were used to calculate a 24 × 24 dissimilarity matrix. To apply t-SNE with the feature-ensemble
approach, a 24 × 24 distance matrix was generated by cal-
123
R. Morgan, M. Gallagher
2000
19
9
24
8
1000
0
−1000
1
6
3
F1 − 5
F6 − 9
F10 − 14
F15 − 19
F20 − 24
13
5
20
15
1500
1000
500
21
18
12
4
−2000
−2000
−1000
9
23
12
19
2
4
11
10
1000
2000
16
17
18
−1500
−2000
(a)
5
1
−1000
11
0
22
21
20
−500
10
17
7
0
14 2
22
6
F1 − 5
F6 − 9
F10 − 14
F15 − 19
F20 − 24
3
13
8
7
23
16
15
−1000
0
14 24
1000
2000
(b)
Fig. 4 Feature spaces of 20D BBOB problems reduced via t-SNE. a Feature-ensemble approach. b Length scale approach
0.7
60
0.6
50
Distance
Distance
0.5
0.4
0.3
30
20
0.2
10
0.1
0
40
171812 2 1410212211 1 8 3 1520 6 13 5 4 9 2419 7 1623
0
1619 1 5 17 6 8 9 142418 2 10112012212223 3 15 7 13 4
Problem ID
Problem ID
(a)
(b)
Fig. 5 Dendrograms of the 20D BBOB problems. a Feature-ensemble approach. b Length scale approach
culating the Euclidean distance between problems’ feature
vectors. Due to the stochastic nature of t-SNE, 1000 different trials were conducted, with a maximum of 5000 iterations
for each trial. Figure 4 shows the best t-SNE visualisation (in
terms of the final cost) from the trials for each approach. The
cost of a t-SNE solution indicates the discrepancy between
the neighbourhood relationships in the two-dimensional
visualisation and the original high-dimensional data, and
were quite consistent across the 1000 trials. Specifically,
the feature-ensemble approach ranged between 0.1064 and
0.5522 with a median cost of 0.1314, while the length scale
approach ranged between 0.1979 and 0.3244 with a median
cost of 0.2034.
While the feature-ensemble approach shown in Fig. 4a
has lower error, the relationships between structurally similar
problems are not evident, and overall, it lacks the ability to discriminate between problems. For example F8 and
F9 (i.e. Rosenbrock problems) are well separated in the
space, despite the problems differing only by rotation. Figure 4b demonstrates the ability of the length scale framework
approach to capture structurally similar problems—not only
are F8 and F9 close, but so are F2 (ellipsoidal) and F10
(ellipsoidal) as well as F3 (Rastrigin) and F15 (i.e. Rastrigin problems).
123
To determine if specific cluster structure is evident in
the data, hierarchical clustering was applied to the Jdivergence matrix and the feature vector distance matrix
using unweighted average distance linkages. The resulting
dendrograms for the 20D BBOB problems can be seen in
Fig. 5. Problems whose linkage is less than half the maximum linkage share common shaded/coloured lines.
Interestingly, the clusters yielded from both hierarchical
clustering correspond well with the clusters in the visualisations produced by t-SNE. However, the clusters and
relationships suggested by the t-SNE visualisations and dendrograms do not correspond to the problem categories in the
original BBOB specification. The BBOB problems and categories were defined by researchers with the aim of providing
a wide variety of test problems with different landscape
structures. This was done using intuition in two and three
dimensions, modifying previously proposed problems and
experience with algorithms. Hence, the categories might not
capture the important underlying relationships between problems, or perhaps be the best categorization for algorithm
evaluation. The feature-ensemble approach shown in Fig. 5a
lacks strong clusters, although there are perhaps three weak
clusters (F7, F16/F23 and the remaining problems). While
many similar problems are close in proximity (e.g. F3/F15,
Analysing and characterising optimization problems using length scale
F9/F19, F17/F18 and F21/F22), many structurally dissimilar
problems are also close in proximity (e.g. F1/F8, F3/F20,
F12/F18, F24/F9). In contrast, the J-divergence of length
scales (Fig. 5b) clusters the problems into many different
classes that correspond well with the underlying structures
of the problems. For example, F8 and F9 are together, as are
the elliptically structured F2, F10 and F11 (Discus), and the
Gaussian-constructed F21 and F22. There are some exceptions; F17 and F18 are separated despite both being Schaffer
F7 functions (F18 is moderately ill-conditioned).
6.3 Interpreting length scale distributions
The results above indicate that there are some strong similarities between problems, and so in this section, we examine
the length scale distributions, p(r ), in greater detail. Similar to Sect. 5, each p(r ) is a kernel density estimator with a
Gaussian kernel and bandwidth selected with the “solve-theequation plug-in’ method. For each problem there are 900
(30 seeds × 30 walks) p(r ) estimates, and so we show the
average (with shading of one standard deviation) of the distributions across the seeds/walks. For most problems, the length
scales did not vary significantly between dimensions (e.g. see
Fig. 6b). F5 (linear slope), however, varied quite a lot between
2D and the other dimensions (see Fig. 6a). More specifically,
the J-divergence between F5 in 2D compared to 5D, 10D
and 20D are 2.23, 4.55 and 5.36 respectively. In comparison,
F5’s next largest J-divergence is very low (0.68) between 5D
and 20D. The F5 problem is a linear slope with a gradient
that varies logarithmically between 1 and 10 for each xi . If
we were to only consider length scales of axis-aligned steps,
the length scale values yielded are within [1, 10], with an
increasing presence of “small” length scales as dimensionality increases (due to the logarithmic distribution of gradient
values along the solution variables). Introducing non-axisaligned steps simply means that length scales between the
extreme gradient values will also occur. The length scale distributions yielded for F5 closely reflects the structural change
over dimensionality. In 2D there are many length scales in
x 10
0.5
2D
5D
10D
20D
0.3
0.2
−4
2D
5D
10D
20D
2.5
2
p(r)
p(r)
0.4
1.5
1
0.5
0.1
0
the problem of approximately 10, and the remaining length
scales vary quite uniformly between 1 and 10. As dimensionality increases, p(r ) reflects the presence of more “smaller”
length scales.
The length scale distribution indicates the changes in
objective function values between solutions that can be
expected to occur. The scaling on the r axis of Fig. 6a, b
illustrates that r can be extremely different between two problems. From the length scales sampled, F20 (Schwefel) has
length scales that are 3 orders of magnitude larger than F5.
More specifically, F5 has changes in objective function values that are at most 20 times a given step size, while F20
has changes in objective function values that are at most 76
300 times a given step size. Such information may be useful
in the selection of algorithm parameters or interpretation of
algorithm search trajectories.
Given that the distributions for the BBOB problems were
found to be mainly similar across d, for the remainder of the
length scale distribution analysis we examine problems in
20D. The resulting length scale distributions are quite varied across the problems, however, there are a few notable
similarities, some of which are shown in Fig. 7.
Figure 7a shows p(r ) for F3 (Rastrigin), F4 (BücheRastrigin) and F15 (Rastrigin). F15 is a non-separable and
less regular variant of F3, and as a result, they have almost
identical length scales and a low J-divergence of 1.14. On
the other hand, F4 contains slightly different structure to F3
and F15, however, since the global structure is similar across
the problems, we observe some similarity in p(r ). The Jdivergence between F4 and F3 is 0.92, compared to the high
value of 19.95 between F4 and F15.
The length scale distributions for F6 (attractive sector), F8
(Rosenbrock) and F9 (Rosenbrock Rotated) can be seen in
Fig. 7b. The larger peak in the F9 distribution indicates that
there are more low-valued length scales. Both F8 and F9 are
variations of Rosenbrock and we observe similar changes in
fitness, and hence similar distributions of length scales. The
J-divergence between F8 and F9 is very low (0.24), indicating
that the two problems are almost identical. Contrary to this,
0
5
10
r
15
20
0
(a)
0
1
2
3
r
4
5
6
7
8
x 10
4
(b)
Fig. 6 Examples of p(r ) that have large and small variation with d. a F5. b F20
123
R. Morgan, M. Gallagher
−5
6
p(r)
p(r)
0.01
F6
F8
F9
4
2
0.005
0
x 10
F3
F4
F15
0.015
0
500
1000
1500
2000
2500
3000
3500
0
0
5
10
r
r
(a)
(b)
15
20
x 10
4
Fig. 7 Examples of similar p(r ) estimated on BBOB problems. a 20D F3, F4 and F15. b 20D F6, F8 and F9
the J-divergence between F6 and F8 is 1.56, and between F6
and F9 is 1.46.
A number of other problems (e.g. F17 and F18) were also
found to have similar length scale distributions (with low
J-divergence). The difference in shape and scale between
Fig. 7a, b illustrates the potential differences between length
scale distributions. It is clear that problems with similar structure have similar p(r )s and low J-divergences between them,
while problems with different structure have different p(r )s
and large J-divergences. This demonstrates p(r ) as a powerful tool for characterising and differentiating optimization
problems. In addition, the distributions in Figs. 6 and 7 are
surrounded by very thin shading, indicating that the standard
deviation across samples is low.
7 Conclusions and future work
This paper has proposed length scale as a fundamental
property of optimization problems. The framework is very
general, and particularly applicable for black-box problems
as it is based on sampling candidate solutions and their
objective functions values. The simplicity of length scale
means it can be readily calculated from samples of the
landscape and/or algorithm search data. Source code for all
experiments in this paper are available at https://github.com/
RachM/length-scale.
Analytical properties of length scale have been discussed
and techniques for problem analysis were proposed using statistical and information-theoretic summaries of length scale.
Length scale analysis on simple example problems illustrated
the framework’s ability to capture important problem structures and the complexity of their interactions. The framework
was also used to assess the adequacy of samples varying in
size from the landscape, thereby providing insight into appropriate sample sizes. This methodology could be used in other
applications where a notion of sampling adequacy is needed.
Experimental results on the BBOB problem set shows
that the length scale analysis provides a greater ability to
123
discriminate between the problems in comparison to seven
well-known landscape analysis techniques. In addition, summaries of r were shown to be statistically robust to different
samples of given problem instances. Valuable insights into
the complex relationships between the BBOB problems were
gained using dimensionality reduction and hierarchical clustering techniques applied to the length scale data.
In practice, using a sample of r values may result in different landscapes yielding the same sets of length scales, and
hence length scale summaries. Identical sets of r values can
be obtained from two landscapes sharing similar structure,
where the structure discriminating them is not captured in
the sample. However, any practical landscape analysis technique is limited to the information obtainable via sampling.
Compressing O(n) length scale values into a single summary value (e.g. the mean of sampled r values or h(r )) may
incur information loss. This too is an unavoidable issue for
many existing landscape techniques. Hence, while the use
of length scale summaries may aid in characterising and
analysing problems, they are not necessarily unique for individual problem instances.
We believe that there is considerable scope to apply a range
of other techniques for modelling and summarising length
scale information. It should be possible to explore the relationship between well-defined topological properties such as
landscape modality and the shape of p(r ). Length scale is
sensitive to the scale of x and f . As discussed in Sect. 3.2,
functions that differ by a scaling factor, α ∈ R, yield length
scale values that also differ by a scale of α. Hence the current,
un-normalised approach yields values that distinguishes the
scaling between functions, but also affords analysis that can
recognise when functions differ merely by a scaling factor. If
upper bounds on | f (xi −x j )| and d(xi , x j ) are known (which
is not usually the case in black-box optimization), then the
length scale could theoretically be normalised. For functions
differing by a scaling factor, α, the normalised length scale
values are equal. We envisage that the normalisation of length
scale values could be useful in situations where the scaling
of f or x is not important.
Analysing and characterising optimization problems using length scale
Our experiments used entropy to summarise p(r ), but
other ideas from statistics and information theory (including
those already discussed in Sect. 4.2) deserve investigation.
In particular, the current methods used to analyse length
scales ignore spatial information, such as the locations of
the solutions. Our experimental results assume that the sampling methodology used produces a representative sample
of the search space. While the convergence of different samples’ length scale distributions and the low variance of length
scale between samples indicates the sample size was sufficient, whether or not the sample is an adequate representation
of the landscape requires further quantification. The set of
solutions evaluated by an algorithm during a run could also
be analysed in the length scale framework. Comparisons
between the landscape’s length scales and the algorithm’s
resulting length scales can be made, and possible insights into
algorithm behaviour may be drawn. In addition, this information could potentially be used to make online algorithm
parameter adjustments. Similar to the comparisons of problem length scale distributions conducted in this paper, the
length scale distributions of the solutions produced by algorithm instances could also be directly compared to each other.
We are also interested in exploring the relationship between
algorithm performance and length scale metrics (e.g. h(r )),
as it may be possible to incorporate the length scale framework into algorithm prediction models.
Compliance with ethical standards
Conflicts of interest None.
References
Aggarwal CC, Hinneburg A, Keim DA (2001) On the surprising behavior of distance metrics in high dimensional spaces. In: Proceedings
of the 8th international conference on database theory. Springer,
London, pp 420–434
Bartz-Beielstein T, Chiarandini M, Paquete L, Preuss M (2010) Experimental methods for the analysis of optimization algorithms.
Springer, New York
Beliakov G (2006) Interpolation of lipschitz functions. J Comput Appl
Math 196(1):20–44
Beyer KS, Goldstein J, Ramakrishnan R, Shaft U (1999) When is ”nearest neighbor” meaningful? In: Proceedings of the 7th international
conference on database theory. Springer, London, pp 217–235
Borenstein Y, Poli R (2006) Kolmogorov complexity, optimization and
hardness. In: IEEE congress on evolutionary computation (CEC
2006), pp 112–119
Cheeseman P, Kanefsky B, Taylor W (1991) Where the really hard problems are. In: Proceedings of 12th international joint conference on
AI, Morgan Kauffman, pp 331–337
Collard P, Vérel S, Clergue M (2004) Local search heuristics: fitness
cloud versus fitness landscape. In: The 2004 European conference
on artificial intelligence, IOS Press, pp 973–974
Cover TM, Thomas JA (1991) Elements of information theory. Wiley,
New York
Fletcher R (1987) Practical methods of optimization, 2nd edn. Wiley,
Hoboken
Forrester A, Sóbester A, Keane A (2008) Engineering design via surrogate modelling: a practical guide. Wiley, Hoboken
Gallagher M (2000) Multi-layer perceptron error surfaces: visualization, structure and modelling. PhD thesis, Department of Computer
Science and Electrical Engineering, University of Queensland
Gallagher M (2001) Fitness distance correlation of neural network
error surfaces: a scalable, continuous optimization problem. In:
Raedt LD, Flach P (eds) European conference on machine learning, Singapore, Lecture notes in artificial intelligence, vol 2167,
pp 157–166
Gallagher M, Downs T, Wood I (2002) Empirical evidence for ultrametric structure in multi-layer perceptron error surfaces. Neural
Process Lett 16(2):177–186
Gent I, Walsh T (1996) The TSP phase transition. Artif Intell 88(1–
2):349–358
Grinstead CM, Snell JL (2012) Introduction to probability. American
Mathematical Society, Providence
Guyon I, Elisseeff A (2003) An introduction to variable and feature
selection. J Mach Learn Res 3:1157–1182
Hansen N (2000) Invariance, self-adaptation and correlated mutations
in evolution strategies. In: Schoenauer et al M (ed) Parallel problem solving from nature—PPSN VI. Lecture notes in computer
science, vol 1917, Springer, pp 355–364
Hansen N, Finck S, Ros R, Auger A (2010) Real-parameter black-box
optimization benchmarking 2010: noiseless functions definitions.
Technical Report, RR-6829, INRIA
Horst R, Tuy H (1996) Global optimization: deterministic approaches.
Springer, New Yotk
Hutter F, Hamadi Y, Hoos H, Leyton-Brown K (2006) Performance
prediction and automated tuning of randomized and parametric
algorithms. In: Benhamou F (ed) Principles and practice of constraint programming. Lecture notes in computer science, vol 4204,
Springer, pp 213–228
Jin Y (2005) A comprehensive survey of fitness approximation in evolutionary computation. Soft Comput 9(1):3–12
Jones T, Forrest S (1995) Fitness distance correlation as a measure
of problem difficulty for genetic algorithms. In: Proceedings of
the 6th international conference on genetic algorithms. Morgan
Kaufmann, San Francisco, CA, pp 184–192
Kullback S, Leibler RA (1951) On information and sufficiency. Ann
Math Stat 22(1):79–86
Lee JA, Verleysen M (2007) Nonlinear dimensionality reduction.
Springer, New York
Lunacek M, Whitley D (2006) The dispersion metric and the CMA
evolution strategy. In: Proceedings of the 8th annual conference
on genetic and evolutionary computation. ACM, New York, pp
477–484
Macready W, Wolpert D (1996) What makes an optimization problem
hard. Complexity 5:40–46
Malan K, Engelbrecht A (2009) Quantifying ruggedness of continuous landscapes using entropy. In: IEEE congress on evolutionary
computation, pp 1440–1447
Mersmann O, Bischl B, Trautmann H, Preuss M, Weihs C, Rudolph
G (2011) Exploratory landscape analysis. In: Proceedings of the
13th annual conference on genetic and evolutionary computation.
ACM, New York, pp 829–836
Morgan R, Gallagher M (2012) Length scale for characterising continuous optimization problems. In: Coello et al CAC (ed) Parallel
problem solving from nature—PPSN XII. Lecture notes in computer science, vol 7491, Springer, pp 407–416
Morgan R, Gallagher M (2014) Sampling techniques and distance metrics in high dimensional continuous landscape analysis: limitations
and improvements. IEEE Trans Evol Comput 18(3):456–461
123
R. Morgan, M. Gallagher
Muñoz MA, Kirley M, Halgamuge S (2012a) A meta-learning prediction model of algorithm performance for continuous optimization
problems. In: Coello et al CAC (ed) Parallel problem solving from
nature—PPSN XII. Lecture notes in computer science, vol 7491,
Springer, pp 226–235
Muñoz MA, Kirley M, Halgamuge SK (2012b) Landscape characterization of numerical optimization problems using biased scattered
data. In: IEEE congress on evolutionary computation, pp 1180–
1187
Müller C, Baumgartner B, Sbalzarini I (2009) Particle swarm CMA
evolution strategy for the optimization of multi-funnel landscapes.
In: IEEE congress on evolutionary computation, pp 2685–2692
Müller CL, Sbalzarini IF (2011) Global characterization of the CEC
2005 fitness landscapes using fitness-distance analysis. In: Proceedings of the 2011 international conference on applications of
evolutionary computation. vol Part I. Springer, Berlin, Heidelberg,
pp 294–303
Overton M (2001) Numerical computing with IEEE floating point arithmetic. Cambridge University Press, Cambridge
Pitzer E, Affenzeller M (2012) A comprehensive survey on fitness
landscape analysis. In: Fodor J, Klempous R, Suárez Araujo C
(eds) Recent advances in intelligent engineering systems, studies
in computational intelligence. Springer, New York, pp 161–191
Pitzer E, Affenzeller M, Beham A, Wagner S (2012) Comprehensive
and automatic fitness landscape analysis using heuristiclab. In:
Moreno-Díaz R, Pichler F, Quesada-Arencibia A (eds) Computer
aided systems theory–EUROCAST 2011. Lecture notes in computer science, vol 6927, Springer, pp 424–431
Reidys C, Stadler P (2002) Combinatorial landscapes. SIAM Rev
44(1):3–54
Rice JR (1976) The algorithm selection problem. Adv Comput 15:65–
118
Ridge E, Kudenko D (2007) An analysis of problem difficulty for a
class of optimisation heuristics. In: Proceedings of the 7th european conference on evolutionary computation in combinatorial
optimization, Springer, pp 198–209
Rosé H, Ebeling W, Asselmeyer T (1996) The density of states—a
measure of the difficulty of optimisation problems. In: Voigt et al
HM (ed) Parallel problem solving from nature PPSN IV. Lecture
notes in computer science, vol 1141, Springer, pp 208–217
Rosen K (1999) Handbook of discrete and combinatorial mathematics, 2nd edn., Discrete mathematics and its applicationsTaylor &
Francis, Routledge
Sergeyev YD, Kvasov DE (2010) Lipschitz global optimization. Wiley
Encycl Oper Res Manag Sci 4:2812–2828
Sheather S, Jones M (1991) A reliable data-based bandwidth selection
method for kernel density estimation. J R Stat Soc Ser B (Methodological) 53:683–690
Shlesinger MF, West BJ, Klafter J (1987) Lévy dynamics of enhanced
diffusion: application to turbulence. Phys Rev Lett 58:1100–1103
Smith T, Husbands P, O’Shea M (2002) Fitness landscapes and evolvability. Evol Comput 10(1):1–34
Smith-Miles K (2008) Cross-disciplinary perspectives on meta-learning
for algorithm selection. ACM Comput Surv 41(1):1–25
123
Smith-Miles K, Lopes L (2011) Measuring instance difficulty for combinatorial optimization problems. Comput Oper Res 39(5):875–
889
Smith-Miles K, Tan TT (2012) Measuring algorithm footprints in
instance space. In: IEEE congress on evolutionary computation
(CEC), pp 1–8
Solla SA, Sorkin GB, White SR (1986) Configuration space analysis
for optimization problems. In: et al EB (ed) Disordered systems
and biological organization, NATO ASI Series, vol F20, Springer,
Berlin, New York, pp 283–293
Stadler PF (1996) Landscapes and their correlation functions. J Math
Chem 20:1–45
Stadler PF (2002) Fitness landscapes. In: Lässig M, Valleriani A
(eds) Biological evolution and statistical physics. Lecture notes
in physics, vol 585, Springer, pp 183–204
Stadler PF, Schnabl W (1992) The landscape of the traveling salesman
problem. Phys Lett A 161(4):337–344
Steer K, Wirth A, Halgamuge S (2008) Information theoretic classification of problems for metaheuristics. In: Li et al X (ed) Simulated
evolution and learning, Lecture notes in computer science, vol
5361, Springer, pp 319–328
Strongin R (1973) On the convergence of an algorithm for finding a
global extremum. Eng Cybern 11:549–555
Talbi E (2009) Metaheuristics: from design to implementation., Wiley
series on parallel and distributed computingWiley, Hoboken
Van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J
Mach Learn Res 9(11):2579–2605
van Hemert J (2005) Property analysis of symmetric travelling salesman
problem instances acquired through evolution. Evol Comput Comb
Optim 3448:122–131
Vassilev VK, Fogarty TC, Miller JF (2000) Information characteristics
and the structure of landscapes. Evol Comput 8:31–60
Wang Y, Li B (2008) Understand behavior and performance of real
coded optimization algorithms via nk-linkage model. In: IEEE
world congress on computational intelligence, pp 801–808
Weinberger E (1990) Correlated and uncorrelated fitness landscapes
and how to tell the difference. Biol Cybern 63:325–336
Whitley D, Watson JP (2005) Complexity theory and the no free lunch
theorem. In: Search methodologies, Springer, pp 317–339
Whitley D, Lunacek M, Sokolov A (2006) Comparing the niches of
CMA-ES, CHC and pattern search using diverse benchmarks. In:
Runarsson et al TP (ed) Parallel problem solving from nature—
PPSN IX. Lecture notes in computer science, vol 4193, Springer,
pp 988–997
Wood GR, Zhang BP (1996) Estimation of the Lipschitz constant of a
function. J Glob Optim 8:91–103
Zhang W (2004) Phase transitions and backbones of the asymmetric
traveling salesman problem. J Artif Intell Res 21(1):471–497
Zhang W, Korf RE (1996) A study of complexity transitions on the
asymmetric traveling salesman problem. Artif Intell 81(1–2):223–
239