On the Use of Probabilistic Models and Coordinate Transforms in

On the Use of Probabilistic Models
and Coordinate Transforms
in Real-Valued Evolutionary Algorithms
Petr Pošı́k
Czech Technical University in Prague
Faculty of Electrical Engineering
Department of Cybernetics
Technická 2, 166 27 Prague 6, Czech Republic
[email protected]
January, 2007
ii
Rodičům, Marcele a Vlastimilovi.
Bez vás bych tuto práci nikdy neměl šanci začı́t psát.
Evičce a Vojtı́škovi.
Bez vás bych tuto práci nikdy nedokončil.
iv
v
Acknowledgements
I would like to express my gratitude to many people that supported me and
helped me with the work on this thesis.
First of all, I would like to thank Vojtěch Franc whose expertise in kernel
methods and machine learning allowed me to use them in such a great extent
in this thesis. I would like to thank also to Jiřı́ Kubalı́k who has supported me
for many years on my quest to optimize the evolutionary algortithms. And I
would also like to thank to my supervisor Jiřı́ Lažanský for giving me so much
freedom in the research and for his generous guidence.
I am also very greatful to many people who inspired me in the recent years:
Nikolaus Hansen, Marcus Gallagher, Jiřı́ Očenášek, Martin Pelikan, Peter Bosman, Jörn Grahl, Bo Yuan, and many others.
I would also like to thank to my colleague Filip Železný who read the manuscript, provided many useful comments, suggested improvements and pointed
out to many typos I made.
Thanks also to my family and my friends. They were always ready to support me and to help me relax when I needed it.
vi
vii
Abstract
This thesis deals with black-box parametric optimization—it studies methods which allow us to find optimal (or near optimal) solutions for optimization
tasks which can be cast as search for a minimum of real function with real arguments. In recent decades, evolutionary algorithms (EAs) gained much interest
thanks to their unique features; they are less prone to get stuck in local minima,
they can find more than one optimum at a time, etc. However, they are not very
efficient when solving problems which contain interactions among variables.
Recent years have shown that a subclass of EAs, estimation of distribution
algorithms (EDAs), allows us to account for interactions among solution components in a unified way. It was shown, that they are able to reliably solve hard
optimization problems, at least in discrete domain; their efficiency in real domain was questionable and there is still an open space for research.
In this thesis, it is argued that a direct application of methods successful in
discrete domain does not lead to successful algorithms in real domain. The requirements for a successful real-valued EDA are presented. Instead of learning
the structure of dependencies among variables (as in the case of discrete EDAs)
it is suggested to use coordinate transforms to make the solution components
as independent as possible. Various types of linear and non-linear coordinate
transforms are presented and compared to other EDAs and GAs. Special chapter is dedicated to methods for preserving the diversity in the population.
The dependencies among variables in real domain can take on many more
forms than in the case of discrete domain. Although this thesis proposes a number of algorithms for various classes of objective functions, a control mechanism
which would use the right solver remains a topic for future research.
viii
Contents
1 Introduction to EAs
1.1 Global Black-Box Optimization . . . . . . . . . . . . . . . .
1.2 Local Search and the Neighborhood Structure . . . . . . . .
1.2.1 Discrete vs. Continuous Spaces . . . . . . . . . . . .
1.2.2 Local Search in the Neighborhood of Several Points
1.3 Genetic and Evolutionary Algorithms . . . . . . . . . . . .
1.3.1 Conventional EAs . . . . . . . . . . . . . . . . . . .
1.3.2 EA: A Simple Problem Solver? . . . . . . . . . . . .
1.3.3 EA: A Big Puzzle! . . . . . . . . . . . . . . . . . . . .
1.3.4 Effect of the Genotype-Phenotype Mapping in EAs
1.3.5 The Roles of Mutation and Crossover . . . . . . . .
1.3.6 Two Ways of Linkage Learning . . . . . . . . . . . .
1.4 The Topic of this Thesis . . . . . . . . . . . . . . . . . . . . .
1.4.1 Targeting Efforts . . . . . . . . . . . . . . . . . . . .
1.4.2 Requirements for a Successful EA . . . . . . . . . .
1.4.3 The Goals . . . . . . . . . . . . . . . . . . . . . . . .
1.4.4 The Roadmap . . . . . . . . . . . . . . . . . . . . . .
2 State of the Art
2.1 Estimation of Distribution Algorithms
2.1.1 Basic Principles . . . . . . . . .
2.1.2 EDAs in Discrete Spaces . . . .
2.1.3 EDAs in Continuous Spaces . .
2.2 Coordinate Transforms . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
4
5
6
8
9
11
12
13
15
16
16
17
17
18
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
21
21
22
23
25
3 Envisioning the Structure
3.1 Univariate Marginal Probability Models . . . . . . . . .
3.1.1 Histogram Formalization . . . . . . . . . . . . .
3.1.2 One-dimensional Probability Density Functions
3.1.3 Empirical Comparison . . . . . . . . . . . . . . .
3.1.4 Comparison with Other Algorithms . . . . . . .
3.1.5 Summary . . . . . . . . . . . . . . . . . . . . . .
3.2 UMDA with Linear Population Transforms . . . . . . .
3.2.1 Principal Components Analysis . . . . . . . . .
3.2.2 Toy Examples on PCA . . . . . . . . . . . . . . .
3.2.3 Independent Components Analysis . . . . . . .
3.2.4 Toy Examples of ICA . . . . . . . . . . . . . . . .
3.2.5 Example Test of Independence . . . . . . . . . .
3.2.6 Empirical Comparison . . . . . . . . . . . . . . .
3.2.7 Comparison with Other Algorithms . . . . . . .
3.2.8 Summary . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
27
27
28
28
31
36
36
37
37
38
38
40
41
43
46
46
ix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
x
CONTENTS
4 Focusing the Search
4.1 EDA with Distribution Tree Model . . . . .
4.1.1 Growing the Distribution Tree . . .
4.1.2 Sampling from the Distribution Tree
4.1.3 Empirical Comparison . . . . . . . .
4.1.4 Summary . . . . . . . . . . . . . . .
4.2 Kernel PCA . . . . . . . . . . . . . . . . . .
4.2.1 KPCA Definition . . . . . . . . . . .
4.2.2 The Pre-Image Problem . . . . . . .
4.2.3 KPCA Model Usage . . . . . . . . .
4.2.4 Toy Examples of KPCA . . . . . . .
4.2.5 Empirical Comparison . . . . . . . .
4.2.6 Summary . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
49
49
50
52
52
59
60
60
61
62
62
63
67
5 Population Shift
5.1 The Danger of Premature Convergence . . . . . . . . . . .
5.2 Using a Model of Successful Mutation Steps . . . . . . . . .
5.2.1 Adaptation of the Covariance Matrix . . . . . . . .
5.2.2 Empirical Comparison: CMA-ES vs. Nelder-Mead .
5.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Estimation of Contour Lines of the Fitness Function . . . .
5.3.1 Principle and Methods . . . . . . . . . . . . . . . . .
5.3.2 Empirical Comparison . . . . . . . . . . . . . . . . .
5.3.3 Results and Discussion . . . . . . . . . . . . . . . . .
5.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
69
69
71
72
73
77
78
78
83
85
86
6 Conclusions and Future Work
6.1 The Main Contributions . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
89
91
A Test Functions
A.1 Sphere Function . . .
A.2 Ellipsoidal Function .
A.3 Two-Peaks Function
A.4 Griewangk Function
A.5 Rosenbrock Function
95
95
95
95
96
97
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
B Vestibulo-Ocular Reflex Analysis
B.1 Introduction . . . . . . . . . .
B.2 Problem Specification . . . . .
B.3 Minimizing the Loss Function
B.4 Nature of the Loss Function .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
99
. 99
. 100
. 101
. 102
List of Tables
3.1
3.2
3.3
3.4
3.5
3.6
UMDA empirical comparison: factor settings . . .
UMDA empirical comparison: number of bins . .
UMDA empirical comparison: results . . . . . . .
PCA vs. ICA empirical comparison: factor settings
PCA vs. ICA: Griewangk function . . . . . . . . .
PCA vs. ICA: Two Peaks function . . . . . . . . . .
.
.
.
.
.
.
32
32
34
43
45
46
4.1
4.2
4.3
DiT-EA empirical comparison: factor settings . . . . . . . . . . . .
DiT-EA empirical comparison: results . . . . . . . . . . . . . . . .
KPCA-EA empirical comparison: results . . . . . . . . . . . . . . .
53
55
65
5.1
5.2
5.3
Settings for parameters of artificial VOR signal . . . . . . . . . . .
Nelder-Mead vs. CMA-ES: success rates . . . . . . . . . . . . . . .
CMA-ES and Separating Hyperellipsoid: population sizes . . . .
75
75
85
xi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xii
LIST OF TABLES
List of Figures
1.1
1.2
1.3
1.4
1.5
Example of binary and integer neighborhoods . .
The neigborhood of Nelder-Mead simplex . . . . .
Neighborhoods induced by the bit-flip mutation .
Neighborhoods induced by the 1-point crossover .
Neighborhoods induced by the 2-point crossover .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
6
14
14
14
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
Equi-width histogram model . . . . . . . . . . . . . . . .
Equi-height histogram model . . . . . . . . . . . . . . . .
Max-diff histogram model . . . . . . . . . . . . . . . . . .
Mixture of Gaussians model . . . . . . . . . . . . . . . . .
Bin boundaries evolution for the Two Peaks function . .
Bin boundaries evolution for the Griewangk function . .
Two linear principal components of the 2D toy data set .
PCA and marginal equi-height histogram model . . . . .
PCA and ICA components comparison . . . . . . . . . .
Principal and independent components of 2D Griewangk
Observed and expected frequency tables before PCA . .
Observed and expected frequency tables after PCA . . .
Monitored statistics layout . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
30
30
30
31
35
35
38
39
41
42
42
43
44
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
DiT search space partitioning . . . . . . . . . . . . .
Efficiency of the DiT-EA on the 2D Two Peaks . . . .
Efficiency of the DiT-EA on the 2D Griewangk . . .
Efficiency of the DiT-EA on the 2D Rosenbrock . . .
Efficiency of the DiT-EA on the 20D Two Peaks . . .
Efficiency of the DiT-EA on the 10D Griewangk . . .
Efficiency of the DiT-EA on the 10D Rosenbrock . .
First six nonlinear components of the toy data set .
KPCA crossover for clustering and curve-modeling
Efficiency of the KPCA-EA on the 2D Two Peaks . .
Efficiency of the KPCA-EA on the 2D Griewangk . .
Efficiency of the KPCA-EA on the 2D Rosenbrock .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
50
55
56
57
57
58
58
63
64
66
66
67
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
Coefficients d and c in relation to the selection proportion τ
One iteration of EMNA and CMA-ES . . . . . . . . . . . .
Vestibulo-ocular reflex signal . . . . . . . . . . . . . . . . .
Aligned VOR signal segments . . . . . . . . . . . . . . . . .
Nelder-Mead vs. CMA-ES: number of needed evaluations
Nelder-Mead vs. CMA-ES: typical progress . . . . . . . . .
CMA-ES vs. Separating Hyperellipsoid: covariance matrix
CMA-ES vs. Separating Hyperellipsoid: average progress .
Modified Perceptron vs. SVM: separating hyperellipsoid .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
70
72
74
74
76
76
79
85
86
xiii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xiv
LIST OF FIGURES
A.1 Two Peaks function . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.2 Griewangk function . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.3 Rosenbrock function . . . . . . . . . . . . . . . . . . . . . . . . . .
B.1
B.2
B.3
B.4
B.5
Simulated VOR signal . . . . . . . . . . . . . . . . .
Aligned VOR signal segments . . . . . . . . . . . . .
The fitness landscape cuts for the VOR analysis I. .
The fitness landscape cuts for the VOR analysis II. .
The fitness landscape cuts for the VOR analysis III. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
96
97
97
99
100
103
103
103
List of Algorithms
1.1
1.2
1.3
1.4
2.1
2.2
3.1
3.2
4.1
4.2
4.3
5.1
5.2
5.3
5.4
Hill-Climbing . . . . . . . . . . . . . . . . . . . . . . . . . .
Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . .
An iteration of Nelder-Mead downhill simplex algorithm
Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . .
Estimation of Distribution Algorithm . . . . . . . . . . . .
Coordinate transform inside EA . . . . . . . . . . . . . . .
Generic EDA . . . . . . . . . . . . . . . . . . . . . . . . . .
EDA with PCA or ICA preprocessing . . . . . . . . . . . .
Function SplitNode . . . . . . . . . . . . . . . . . . . . . . .
Function FindBestSplit . . . . . . . . . . . . . . . . . . . . .
KPCA model building and sampling . . . . . . . . . . . .
Perceptron Algorithm . . . . . . . . . . . . . . . . . . . . .
Modified Perceptron Algorithm . . . . . . . . . . . . . . .
Gaussian distribution sampling . . . . . . . . . . . . . . .
Evolutionary model for the proposed method . . . . . . .
xv
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
5
7
8
22
25
33
45
51
51
62
81
82
83
84
xvi
LIST OF ALGORITHMS
Chapter 1
Introduction to Evolutionary
Algorithms
This chapter presents an introduction to the field of optimization. It informally
mentions the relevant conventional techniques used to solve the optimization
tasks. It gives a brief survey of the ordinary genetic and evolutionary algorithms (GEAs), presents them in a unifying framework with the conventional
optimization techniques and points out some issues in their design. These issues directly lead to the emergence of a class of evolutionary algorithms based
on the creation and sampling from probabilistic models, so-called Estimation of
Distribution Algorithms (EDAs), which are described in next chapter in more detail.
1.1 Global Black-Box Optimization
Optimization tasks can be found in almost all areas of human activities. A baker
has to find the optimal composition of the dough to produce rolls which are
popular among his customers. A company must create such a manufacturing
schedule so that it minimizes the cost of its production. An engineer has to find
such settings of a controller which deliver the best response of the system under
control.
The optimization task is usually defined as
x∗ = arg min f (x),
x∈S
(1.1)
so that among the members x of the set of all possible objects S, x ∈ S, we have
to find the optimal x∗ which minimizes the objective (cost1 ) function f . The cost
function usually tells us how good the individual objects are.2 In the examples
given above, objects are the composition of the dough, the schedule, and the
settings of the controller, respectively. Similarly, the objective functions are the
satisfaction of baker’s customers, the cost of manufacturing, and e.g. the speed
and precision of the system being controlled.
There exist many algorithms which try to solve the optimization task. According to Neumaier (2004), the following classification of optimization methods (based on the degree of rigor with which they approach the goal) is made:
1
In this thesis, the terms cost function, objective function and evaluation function are used
rather interchangeably. If the distinction between them is important, it is pointed out.
2
In many cases, it is sufficient to have a function which takes two objects and tells us which of
them is better, so that we are able to produce a complete ordering of a set of objects.
1
Cost function
2
1
I NTRODUCTION
TO
EA S
An incomplete method uses clever heuristics to guide the search, but gets stuck
in a local minimum.
An asymptotically complete method reaches the global optimum with probability one if allowed to run indefinitely long, but has no means to verify if
the global optimum was already found.
A complete method reaches the global optimum with certainty, assuming indefinitely long run time, and knows after a finite time that an approximate
global minimum has been found (within prescribed tolerances).
Black-box
optimization
With an algorithm from the last of the three3 categories we can predict the
amount of work needed to find the global minimum within some prescribed tolerances, but this estimate is usually very pessimistic (exponential in the problem
size) and serves as the lower bound on the efficiency of an algorithm.
We speak of the black-box optimization if the form of the particular objective function is completely hidden from us. We do not know in advance if the
function is unimodal or multimodal, if it can be decomposed to some subfunctions, we do not know its derivatives, etc. All we can do is to sample some
object from the space S and have it evaluated by the cost function. If the search
space is infinite, without any global information, no algorithm is able to verify the global optimality of the solution in a finite time. In such circumstances,
the best algorithm we can construct for the black-box optimization problems is
some asymptotically complete method. In the rest of this thesis, an algorithm is
said to perform a global search, if the following condition holds:
lim P (xtBSF = x∗ ) = 1,
t→∞
Best-so-far
solution
(1.2)
where xtBSF is the best-so-far solution found by the algorithm from the start of
the search to the time instant t (it is the current estimate the algorithm gives us
for the global optimum) and x∗ is the true global optimum.
1.2 Local Search and the Neighborhood Structure
Local search
In this section, several incomplete local search methods are described. A plausible and broadly accepted assumption for any optimization task is that the objective function must be a ‘reasonable’ one, i.e. it must be able to guide the search,
to give some clues where to find an optimum. If this assumption does not hold,
e.g. if we are presented with an objective function with very low signal-to-noise
ratio or if we face a needle-in-the-haystack type of problem, then we can expect
that the best algorithm we can use is the pure random search.
The most often used optimization procedures are various forms of local search,
i.e. incomplete methods. The feature that all local search procedures have in
common is the fact that they repeatedly look for increasingly better objects in
the neighborhood of the current object, thus incrementally improving the solution. They start with an initial object from the search space generated randomly
or created as an initial guess of a solution. The search is carried out until a termination criterion is met. The termination criterion can be e.g. a limit on the
number of iterations, sufficient quality of the current object, or the search can be
stopped when all possible perturbations result in worse objects when compared
to the current one. In that case we found a local optimum of the cost function.
3
In the original paper, there were four categories, but the last one dealt with the presence of
rounding errors which is not relevant here, and thus was omitted.
1.2 Local Search and the Neighborhood Structure
3
Individual local search strategies differ
• in the definition of neighborhood, and
• in the way they search the neighborhood and accept better solutions.
If the neighborhood is finite and precisely defined, it can be searched using
strict rules. Very often, the neighbors are searched in random order—the local
search algorithm can then accept the first improving neighbor, or the best improving neighbor, etc.
In one particular variant of the local search, so-called hill-climbing, the iteration consists of a slight perturbation of the current object and if the result is
better, it becomes the current object, i.e. it accepts the first-improving neighbor
(see Alg. 1.1).
Hill-climbing
Algorithm 1.1: Hill-Climbing
1
2
3
4
5
6
7
begin
x ← Initialize()
while not TerminationCondition() do
y ← Perturb(x)
if BetterThan(y, x) then
x←y
end
The cost function f (x) (which is hidden in the function BetterThan in
Alg. 1.1) takes the object x and evaluates it. In order to do that, it must ‘understand’ the argument x, i.e. we have to choose the representation of the object. Many optimization tasks have their obvious ‘natural’ representations, e.g.
amounts of individual ingredients, Gantt charts, vectors of real numbers, etc.
However, often the natural representation is not the most suitable one—some
optimization algorithms require a specific representation because they are not
able to work with others, sometimes the task at hand can simply be solved easier not using the natural representation (this feature is strongly related to the
neighborhood structure of the search space which is described below). In such
cases we need a mapping function which is able to transform the object description from the search space representation (the representation used by the
optimization algorithm when searching for the best solution) to the natural representation (suitable for evaluation of the objects). We say that a point in the
search space represents the object.
In all local search procedures, the concept of the neighborhood plays the fundamental role. The most straightforward and intuitive definition of the neighborhood of an object is the following:
N (x) = { y | dist(x, y) < ǫ},
(1.3)
i.e. the neighborhood N (x) of an object x is a set of such objects y so that their
distance from the object x is lower then some constant. The neighborhood and
its structure, however, is very closely related to the chosen representation of an
object and to the perturbation scheme used in the algorithm. From this point of
view, a much better definition of the neighborhood is this:
N (x) = { y | y = P erturb(x)},
(1.4)
Representation
Neighborhood
4
1
I NTRODUCTION
TO
EA S
i.e. the neighborhood is a set of such objects y which are accessible by one
application of the perturbation. In Fig. 1.1, we can see an example of a small
search space represented as binary and integer numbers (the objective values
of individual points are not given there). Now, suppose that the perturbation
for binary representation flips one randomly chosen bit of the point 101B (i.e.
the neighborhood is a set of points with Hamming distance equal to 1) while
the perturbation for integer representation randomly adds or subtracts 1 to the
point 5D (i.e. the neighborhood is a set of points with Euclidean distance equal
to 1). It is obvious that these two neighborhoods neither coincide, nor one is
a subset of the other. Furthermore, imagine what happens if we enlarge the
u
?
?
?
Binary
000
001
010
011
100
101
110
111
Integer
0
1
2
3
4
5
6
7
6
u
6
Figure 1.1: Example of binary and integer neighborhoods
Larger neighborhood,
fewer local
optima
search space so that we need 4 or more bits to represent the points in binary
space. The number of binary neighbors increases while the neighborhood in the
integer space remains the same.
It is worth mentioning that larger neighborhood of points results in lower
number of local optima, but also in greater amount of time spent by searching
the neighborhood. If we chose such a perturbation operator able to generate
all points of the search space we would have only one local optimum (which
would also be the global one), but we would have to search the whole space of
candidate solutions. For the success of the local search procedures, it is crucial
to find a good compromise in the size of the neighborhood, i.e. to choose the
right perturbation operators.
1.2.1
Steepest
descent
Discrete vs. Continuous Spaces
The optimization tasks defined in discrete spaces are fundamentally different
from the tasks defined in continuous spaces. The number of neighbors in a discrete space is usually finite and thus we are able e.g. to test if a local optimum
was found—we can enumerate all neighbors of the current solution, and if none
is better, we can finish. This usually cannot be done in continuous space—the
neighborhood is defined to be infinite in that case and something like enumeration simply does not make sense4 . To ensure that we found a local optimum
in continuous space, we would need to know the derivatives of the objective
function.
These features can be observed considering another type of a local search algorithm called steepest descent (see Alg. 1.2). This algorithm examines all points
in the vicinity of the current point and jumps to the best of them (i.e. the algorithm accepts the best improving neighbor). In case of discrete space, this can
be done by enumeration. In continuous spaces, the neighborhood is given by a
4
Of course, even in continuous spaces we can define the neighborhood to be finite, but in that
case (without any regularity assumptions) there is high probability of missing better solutions.
1.2 Local Search and the Neighborhood Structure
5
Algorithm 1.2: Steepest Descent
begin
x ← Initialize()
while not TerminationCondition() do
Nx ← { y | y = Perturb (x)}
x ← BestOf(Nx )
1
2
3
4
5
end
6
line going through the current point in the direction of the gradient. The points
on this line are examined and the best one is selected as the new current point.5
It is interesting to note, that the steepest descent algorithm is a deterministic
algorithm, while the hill-climbing is not. If started from the same initial point
over and over, the steepest descent algorithm always gives the same solution
(in both discrete and continuous space). On the contrary, the hill-climbing can
give different results in each run as it depends on the order of evaluation of the
neighbors which is usually random.
Steepest descent also spends a lot of time on unnecessary evaluation of the
whole neighborhood in case of a discrete search space. We can expect that it
will find a local optimum with less iterations but very often with much more
objective function evaluations than the hill-climbing—it depends on the size of
neighborhood and on the complexity of the search problem.
1.2.2
Local Search in the Neighborhood of Several Points
The algorithms described above work with one current point and perform the
search in its neighborhood. One of possible generalizations is to use a set of
current points (instead of mere one) and search in their common neighborhood.
Again, the neighborhood N (X) of a group of points X, X = {x1 , x2 , . . . , xN },
can be defined using a ‘combination’ operator as
N (X) = { y | y = Combine(X)},
(1.5)
i.e. as a set of all points y which can result from some combination of the points
in X. Individual algorithms then differ by the definition of the Combine operation.
This approach is not very often pursued in the conventional optimization
techniques. However, one excellent and famous example of this technique can
be mentioned here, the Nelder-Mead simplex search introduced in Nelder & Mead
(1965). This algorithm is used to solve optimization problems in RD . In Ddimensional space, the algorithm maintains a set of D + 1 points that define
a simplex in that space (an interval in 1D, a triangle in 2D, a tetrahedron in
3D, etc.). Using this simplex it creates a finite set of candidate points which
constitute the neighborhood of the simplex (see Eqs. 1.7 to 1.11 and Fig. 1.2).
This neighborhood is neither searched in random order as in the case of hillclimbing, nor the best point is selected as in the case of steepest descent. Instead,
not all neighbors are necessarily evaluated; there is a predefined sequence of
assessing the quality of the points along with the rules how to incorporate them
to the current set of points defining the simplex.
5
The search for the best point on the line is usually done analytically. The gradient in the best
point is perpendicular to the line.
Nelder-Mead
simplex
search
6
1
I NTRODUCTION
TO
EA S
4
ye
3
yr
x1
2
yoc
ys3
1
ys2
yic
0
x3
x2
-1
-1
0
1
2
3
4
5
6
7
Figure 1.2: Points forming the neighborhood in Nelder-Mead simplex search.
The defining simplex is plotted in dashed line, the points in neighborhood are
marked with black dots.
Suppose now, that the simplex is given by points x1 , x2 , . . . , xD+1 ∈ RD
which are ordered in such a way that f (x1 ) ≤ f (x2 ) ≤ . . . ≤ f (xD+1 ), f is the
cost function. During one iteration, the neighborhood of the simplex consists of
points which are computed as follows:
x̄ =
D
1 X
xd ,
D d=1
(1.6)
yr = x̄ + ρ(x̄ − xD+1 ),
(1.7)
ye = x̄ + χ(xr − x̄),
(1.8)
yoc = x̄ + γ(xr − x̄),
yic = x̄ − γ(x̄ − xD+1 ),
ysi = x1 + σ(xi − x1 ),
(1.9)
(1.10)
i ∈ 2, . . . , D + 1,
(1.11)
where the coefficients should satisfy ρ > 0, χ > 1, χ > ρ, 0 < γ < 1, 0 < σ < 1.
The standard settings are ρ = 1, χ = 2, γ = 0.5, σ = 0.5. One iteration of the
algorithm is presented as Alg. 1.3.
Although this algorithm maintains a set of current points and the neighborhood is defined dynamically during the run, it still preserves the characteristics
of an incomplete method. Nelder-Mead search is applicable in black-box optimization (which does not hold for the steepest descent). The hill-climbing is not
able to adapt the size and shape of its neighborhood which is the most important feature of the Nelder-Mead method. However, the author is not aware of
any systematic empirical comparison of the above mentioned methods.
1.3 Genetic and Evolutionary Algorithms
Genetic and evolutionary algorithms (GEAs) have been already known for a few
decades and they proved to be a powerful optimization and searching tool in
many research and application areas. They can be considered a stochastic generalization of the techniques which search in the neighborhood of several points
described above. They maintain a set of potential solutions during the search
and are able to perform a search in many neighborhoods of various sizes and
structures in the same time.
They are inspired by the processes that can be observed in nature, mainly
by natural selection and variation. The terminology used to describe the GEA
1.3 Genetic and Evolutionary Algorithms
7
Algorithm 1.3: An iteration of Nelder-Mead downhill simplex algorithm
Input: x1 , x2 , . . . , xD+1 so that f (x1 ) ≤ f (x2 ) ≤ . . . ≤ f (xD+1 )
1 begin
/* Reflection
*/
2
compute yr using Eq. 1.7
3
if f (x1 ) ≤ f (yr ) < f (xD ) then Accept(yr ); exit
4
if f (yr ) < f (x1 ) then
/* Expansion
*/
5
compute ye using Eq. 1.8
6
if f (ye ) < f (yr ) then Accept(ye ); exit
if f yr ≥ f (xD ) then
/* Contraction
if f (yr ) < f (xD+1 ) then
/* Outside contraction
compute yoc using Eq. 1.9
if f (yoc ) ≤ f (yr ) then Accept(yoc ); exit
7
8
9
10
else
/* Inside contraction
compute yic using Eq. 1.10
if f (yic ) < f (xD+1 ) then Accept(yic ); exit
11
12
13
/* Shrinking
compute points ysi using Eq. 1.11
MakeSimplex(x1 , ys2 , . . . , yD+1 )
end
14
15
16
*/
*/
*/
*/
is also borrowed from the biology and genetics. A potential solution or a point
in the search space is called a chromosome or an individual. Each component of
the chromosome (each variable of the solution) is called a locus and a particular
value at certain locus is called an allele. The set of potential solutions maintained
by the algorithm is called a population. The cost function (which is usually minimized as the name suggests) is translated to the fitness function which describes
the ability of an individual to survive in the current environment (and should
be thus maximized). The search performed by the GEAs is called an evolution.
The evolution consists of many iterations called generations and in each generation the algorithm performs operations like selection, crossover, mutation, and
replacement which shall be described below.
GEAs gained their popularity mainly for two reasons:
1. they are conceptually easy, i.e. it is easy to describe the processes taking
place inside them and thus they are easy to implement, and
2. they have shown better global search abilities6 then conventional methods, i.e. for many problems, GEAs are able to find even such local optima
which are hard to find for local optimizers, and furthermore the operators
6
We cannot decide if certain algorithm performs local or global search until we know the
representation of solution and the definition of the neighborhood induced by the variation operators. E.g. the hill-climbing algorithm is usually regarded as local search method because its
perturbation operator is usually local. However, it can perform the global search as well if the
neighborhood contained all possible solutions, i.e. if the probability of generating a solution is
greater than 0 for all possible ones.
Chromosome,
locus, allele,
population,
fitness, etc.
Why are they
so popular?
8
1
I NTRODUCTION
TO
EA S
of GEAs are usually chosen in such a way that they ensure the asymptotic
completeness of the algorithm.
In subsequent sections, the above mentioned topics are discussed. First, the
well-established types of GEAs are described, and finally the general structure
of an evolutionary algorithm is presented and explained.
1.3.1
Conventional EAs
The EAs are sometimes classified by the evolutionary computation (EC) community into four groups:
Evolutionary programming (EP) which works with finite automata and numeric representations and has been developed mainly in the United States
since 1960s,
Evolutionary strategies (ES) which work with vectors of real (rational) numbers and were invented in Germany in late 1960s,
Genetic algorithms (GAs) that work with binary representations and have been
known since 1970s, and
Genetic programming (GP) which works with programs represented in the tree
form and is the most recent member of the EA family.
Structure of
an
evolutionary
algorithm
There are other types of EAs which are not included in the above list, but these
are pretty recent and differ only by the operators used in them. In fact, there
is no need to draw any hard boundaries between the above mentioned four
types of EAs. They influence each other and cooperate by exchanging ideas and
various techniques.
What all the EAs have in common is the evolutionary cycle in which some
sort of selection and variation must be present. A typical evolutionary scheme
is presented as Alg. 1.4. Not all of the presented operations have to be included,
neither they must follow this order.
Algorithm 1.4: Genetic Algorithm
1
2
3
4
5
6
7
8
9
10
11
12
Initialization
begin
X (0) ← Initialize()
f (0) ← Evaluate(X (0) )
g←0
while not TerminationCondition() do
XPar ← Select(X (g) , f (g) )
XOffs ← Crossover(XPar )
XOffs ← Mutate(XOffs )
fOffs ← Evaluate (XOffs )
[X (g+1) , f (g+1) ] ← Replace(X (g) ,XOffs ,f (g) , fOffs )
g ←g+1
end
The initialization phase creates the first generation of individuals. Usually,
they are created randomly, but if the designer has some knowledge of the placement of good solutions in the search space, he can use it and initialize the population in a biased way. The population is then evaluated using the fitness function
and the main generational loop starts.
1.3 Genetic and Evolutionary Algorithms
The selection phase designates the parents, i.e. it chooses some of the individuals in the current population to serve as a source of genetic material for
creation of new offspring. The selection should model the phenomenon that
better individuals have higher odds to reproduce, thus it is driven by the fitness
values. There are many selection schemes based on the following three basic
types: proportional which takes the raw fitness values and based on them it calculates the probabilities to include that individual to the parental set; rank-based
which does not use the raw fitness values to compute the probabilities, instead it
sorts the individuals and (to simplify it) the rank becomes the fitness value; and
tournament which takes small random groups of individuals from the current
population and designates the best in the group as a parent.
After the set of parents is determined, the variational operators—crossover
and mutation—are carried out to create new, hopefully better, offspring individuals. The crossover performs the search in the neighborhood of several individuals as it usually takes two or more parents, combines them and creates
one or more offspring. The particular process used to combine the parents to
produce new individuals is strongly dependent on the chosen representation
and will not be described here (for more information see e.g. Goldberg (1989),
Michalewicz & Fogel (1999)). The mutation is equivalent to a local search strategy in the neighborhood of one point—it usually takes one individual, perturbs
it a bit, and the resulting ‘mutant’ is joined with the other offspring. After the
creation of the offspring population is completed, all of the members are evaluated.
Now, we have the old population and the population of the offspring solutions. Among all individuals of these two populations the competition for the
natural resources takes place. Not all the parents survive to the next generation.
Sometimes, the principle of the survival of the fittest holds, at least in a probabilistic form, i.e. the fitter the individual, the higher the probability it will survive;
sometimes the old population is replaced by the new one as a whole. The implementation of this procedure is called the replacement. Although it is possible
to use a dynamic population size, most often it is required to be constant and
the replacement discards many generated solutions.
1.3.2
9
Selection
Crossover
Mutation
Replacement
EA: A Simple Problem Solver?
The EA metaheuristic is very simple. In a great majority of problems, if one
compiles the fitness function and decides the representation, he can simply use
some of the ‘standard’ operators suitable for the chosen representation. The EA
will then run and will produce some results. Using such an EA one can easily
evolve e.g. matrices, graphs, nets, sets of rules, images, designs of industrial
facilities, etc. Making EAs run without any special domain knowledge is thus a
very easy job for a wide range of applications. This flexibility and wide applicability of EAs is their greatest strength and weakness in the same time. For all
these reasons, it is very tempting to describe the EAs as simple problem solvers.
In fact, the things are not that easy. If we apply this naive approach, our EA
is very likely to suffer from several potential negative features described below.
Instantiation of EA
EA constitutes a general framework, a template. To make it work one must
choose a representation of solutions and define some ‘call-back’ functions (initialization, selection and replacement methods, crossover and mutation opera-
EA Related
Issues
10
1
No-FreeLunch
Theorem
I NTRODUCTION
TO
EA S
tors, parameters of the methods, etc.) which are then plugged in the template.
After that, an instance of the EA is created and can be used. The task of defining
these functions can be called the instantiation of EA. For some types of EAs, the
amount of ‘things to be set up’ is really large, and this can be the source of frustration of many EA designers. To find a really good instantiation is often very
hard problem itself.
The problem of a good instantiation is closely related to the so-called NoFree-Lunch Theorem (Wolpert & Macready (1997)) which states: All searching
procedures based on the sampling of the search space have –on average– the
same efficiency across all possible problems. Something similar holds even for
the representation, and for the variational operators.
When designing the EA with the aim of better efficiency, we have to apply
the domain specific knowledge, which will help the algorithm to ‘search in the
right direction’, but will deteriorate the algorithm behavior in other domains.
Similarly, if our aim is broader applicability, we have to develop an algorithm
that will e.g. learn the problem characteristics from its own evolution at the
expense of lower efficiency.7 In that case we need very flexible variational operators that are able to adapt to changing living environments.
Linkage, Epistasis, Statistical Dependency
The basic versions of EA usually treat the individual components of the solutions as statistically independent of each other. If this assumption does not hold,
we speak of the linkage, epistasis, or statistical dependency of individual components. The linkage presents a severe issue to EAs.
It is closely related to the issue of the instantiation. Sometimes, a problem at
hand can be successfully solved using a basic version of EA, but choosing a bad
representation we can accidentally introduce the dependencies to the algorithm
and greatly deteriorate the efficiency of the EA. Most often, however, it is not
easy (or it is impossible) to come out with a representation which reduces or
removes the dependencies because we know nothing or very little about the
fitness function.
Premature Convergence
During the evolution it can easily happen that one sub-optimal individual takes
over the whole population, and the search almost stops. This phenomenon is
called premature convergence. It is not an issue itself, rather it is a symptom of a
bad instantiation of the EA. Some types of EAs are more resistant against this
phenomenon—the algorithms that rely more on the mutation type of variational
operator than on the crossover type. One of the EA parameters has a significant
effect on the premature convergence—the population size. If it is small, there
is much bigger chance that the premature convergence emerges. A too large
population, on the other hand, wastes computational resources.
7
I hypothesize that the No-Free-Lunch theorem holds even for the algorithms that learn problem characteristics during their run, i.e. although they have the ability to adapt to one class of
problems, they are misled in the ‘complementary’ class of problems. All the work with creating
algorithms able to learn is carried out in the hope that the class of problems solvable by these
algorithms corresponds to a great extent with the class of problems people try to solve in their
real life.
1.3 Genetic and Evolutionary Algorithms
1.3.3
11
EA: A Big Puzzle!
To make EA run is an easy task. To make it run successfully is a very hard task.
If we are interested in creating an EA instance that is tuned (or is able to tune
itself) to the problem being solved, we have to solve a big puzzle and carefully
assemble individual pieces so that they cooperate and do not harm each other
in order to overcome at least some of the above mentioned problems.
This effort can be aimed at creating algorithms which have a lower number
of parameters that must be set by the designer, and are more robust so that
we can apply them to broader range of problems without substantial changes.
The effort can be aimed at automatic control of some EA parameters so that the
algorithm itself actively tries to fight the premature convergence. The effort can
be aimed at detection of the relations between variables, and at incorporating
this extra knowledge into the evolutionary scheme. The possibilities to optimize
the EA are countless, however they can be classified to several groups:
Theory It would be great if we had a kind of EA model which would tell us,
how the particular instantiation of EA would behave. Such general and
practically applicable models, however, do not exist at present. The interactions between individual parts of EA can be easily described, but hardly
analyzed.
In spite of that, there are some attempts to build theoretical foundations for
EAs. Some attempts to build a theoretical apparatus on the optimal population sizing for GAs can be found in Goldberg (1989), Goldberg et al.
(1992), or Harik, Cantù-Paz, Goldberg & Miller (1997), and on optimal operator probabilities in Schaffer & Morishima (1987), or Thierens & Goldberg
(1991). Eiben et al. (1999) present a judgment of these theoretical models:
‘. . . the complexities of evolutionary processes and characteristics of interesting problems allow theoretical analysis only after significant simplification in either the algorithm or the problem model. . . . the current EA theory
is not seen as a useful basis for practitioners.’
Deterministic Heuristics The characteristics of the population change during
the evolution, and thus a static setting of EA parameters is not optimal.
The deterministic control changes some parameters using a fixed time
schedule. The works listed in the next paragraph show that for many realworld problems even non-optimal choice of schedule often leads to better
results than a near-optimal choice of a static value.
A deterministic schedule for changing the mutation probability can be
found in Fogarty (1989). Janikow & Michalewicz (1991) present a mutation operator which has the property to search the space uniformly initially, and very locally at later stages. In Joines & Houck (1994), the penalties of solutions to constrained optimization problem are transformed by
deterministic schedule resulting in dynamic evaluation function.
Feedback Heuristics With feedback heuristics, the changes of parameters are
triggered by some event in the population. The typical feature of EAs
using feedback heuristics is the monitoring of some population statistics
(relative improvement of population, diversity of population, etc.).
Optimize the
optimization
algorithm!
12
1
I NTRODUCTION
TO
EA S
Julstrom (1995) presents an algorithm that uses one of the mutation or
the crossover operator more often than the other based on their previous
success. Davis (1991) used a similar principle for multiple crossover operators. Lis & Lis (1996) present a parallel model where each subpopulation
uses different settings of parameters. After certain time period, the subpopulations are compared and the settings are shifted towards the values
of the most successful subpopulation. Shaefer (1987) adapts the mapping
of variables (resolution, range, position) based on convergence and variance of the genes. A co-evolutionary approach was used to adapt evaluation function e.g. by Paredis (1995).
Self-Adaptation The roots of the self-adaptation can be found in ES where the
mutation steps are part of individuals. The parameters of EA undergo the
evolution together with the actual solution of the problem.
Bäck (1992) self-adapts mutation rates in GA. Fogel et al. (1995) uses selfadaptation to control the relative probabilities of five mutation operators
for the components of a finite state machine. Spears (1995) added one
extra bit to the individuals to discriminate if 2-point or uniform crossover
should be used. Schaffer & Morishima (1987) present a self-adaptation of
the number and locations of crossover points. Diploid chromosomes (each
individual has one additional bit which determines if one should use the
chromosome itself or its inversion) are also a form of self-adaptation (see
e.g. Goldberg & Smith (1987)).
In the context of ES, Rudolph (2001) has proved that the self-adaptation of
mutation steps can lead to premature convergence.
Meta Optimizers To adapt the settings of an EA we can use another optimizer.
This procedure usually involves letting the EA run for some time with certain settings and ‘measuring’ its performance. Similarly to self-adaptation,
the optimization process takes usually longer time since we search a much
larger space (object parameter space + EA parameter space). Example of
this approach can be found e.g. in Grefenstette (1986).
New Types of EAs Although many researchers reported promising results using adaptation in EAs, it is not clear if the adaptation can solve our problems completely. Eiben et al. (1999) state that one of the main obstacles
of optimizing the EA settings is formed by the interactions between these
parameters. If we use some basic form of EA (which treats the variables independently of each other) to optimize the EA parameters (which usually
interact with each other), we can hardly find the optimal settings.
However, new types of EAs (or new types of variational operators and other EA
components) can be of great help. Algorithms that have fewer parameters to
set (or optimize), and address perhaps the most severe issue in EA design—the
interactions among variables.
Before some methods that would allow us to construct such algorithms or
operators that account for dependencies among variables can be described, we
have to clarify the influence of chosen representation on the operator-induced
neighborhood structure and the roles of crossover and mutation in EAs.
1.3.4
Effect of the Genotype-Phenotype Mapping in EAs
As stated earlier, optimization algorithms can work with two (or more) representations of possible solutions. The representation that can be directly inter-
1.3 Genetic and Evolutionary Algorithms
preted as a solution and evaluated is called the phenotype; the other representation on which the variational operators are applied is called the genotype. This
subsection shortly demonstrates the phenotype-level effect of this genotypephenotype mapping when it is used with variational operators on the genotype
level.
The phenotype used in the examples below is a pair of integer numbers.
Both are encoded as 5-bit binary numbers, i.e. the genotype is a bit string of
length 10 and the mapping itself is of type B 10 → I 2 .
The first example in Fig. 1.3 shows all possible neighbors of a given individual induced by a single bit-flip mutation operator applied to the genotype. The
neighbors are located on lines parallel to the coordinate axes with the origin in
the given point. The number of neighbors is independent of the given point.
The second example in Fig. 1.4 shows the neighborhood structure of two
parents induced by a 1-point crossover. With this kind of crossover only one of
the phenotype coordinates can be different from both parents. The number of
neighbors varies depending on the pair of parents.
The last example in Fig. 1.5 shows the neighborhood structure of two parents
induced by a 2-point crossover. In this case the neighbors can differ in both
phenotype coordinates from both parents, i.e. the neighborhood structure can
be rather complex and its size is variable as in previous case.
These examples demonstrate that simple operations on the genotype level
can induce rather complex patterns on the phenotype level. If some genotypephenotype mapping is employed in the algorithm, then e.g. a local search procedure performed on the genotype level can hardly be described as a local search
on the phenotype level; the genotype level local search may seem to be very
unstructured on the phenotype level.
Using a genotype that induces a high number of neighbors with mutation
and crossover operators (e.g. binary representation) enhances the global search
abilities of the algorithm. However, the usefulness of the genotype-phenotype
mapping (as used in the above examples) is questionable and more or less accidental because the mapping is ‘hard-wired’ to the algorithm, i.e. for some
instances of optimization tasks it can be advantageous, but for other ones of
the same type and complexity it may fail completely. As designers of EAs, we
should take control of this aspect and use adaptive genotype-phenotype mappings rather than the obvious hard-wired solutions.
1.3.5
The Roles of Mutation and Crossover
One of the EA advantages when compared to other techniques is the ability
to escape from local optima and to use the population as a facility to learn the
structure of the problem8 . Is it the crossover, or the mutation, what ensures these
features?9
The mutation generates offspring in the local neighborhood of one point, i.e.
it has only one point as its input. As explained in the previous subsection, if
some genotype-phenotype mapping is present, the mutation can generate new
individuals that can be hardly described as neighbors in the sense of a phenotype distance. Without such mapping, however, there is no chance for mutation
8
To learn the structure of the problem: create a variational operator which is able to modify
itself based on the previous population members given as its input with the aim of reaching
higher probability of creating good individuals compared to a non-adaptive operator.
9
The ability to escape from local optima is also ensured by the stochastic selection process, i.e.
by the ability to temporarily worsen the fitness of the population.
13
Phenotype
and genotype
14
1
30
30
25
25
20
20
15
15
10
10
5
5
0
0
5
10
15
20
25
30
0
0
5
I NTRODUCTION
10
15
20
TO
25
EA S
30
Figure 1.3: The neighborhoods induced by the binary mutation for points (18,12)
and (16,16).
30
30
25
25
20
20
15
15
10
10
5
5
0
0
5
10
15
20
25
30
0
0
5
10
15
20
25
30
Figure 1.4: The neighborhoods induced by the binary 1-point crossover for pairs
(15,15)-(16,16) and (15,15)-(31,31).
30
30
25
25
20
20
15
15
10
10
5
5
0
0
5
10
15
20
25
30
0
0
5
10
15
20
25
30
Figure 1.5: The neighborhoods induced by the binary 2-point crossover for pairs
(15,15)-(16,16) and (18,17)-(12,10).
1.3 Genetic and Evolutionary Algorithms
to generate a non-neighbor; thus, it cannot escape from a local optimum. Based
on one individual (with which the mutation works), we cannot learn anything
important about the global structure of the problem.
With the crossover, on the other hand, we can learn the problem structure.
In the context of GAs, Harik (1999) described the role of crossover as follows:
15
Learning
global
structure with
mutation is
impossible!
‘. . . The GA’s population consists of chromosomes that have been favored
by evolution and are thus in some sense good. The distribution that this
population represents tells the algorithm where to find other good solutions.
In that sense, the role of crossover is to generate new chromosomes that are
very much like the ones found in the current population. . . . ’
Harik emphasizes that the distribution of the selected individuals is the key element that should help us in creating new individuals. To take advantage of
the distribution (to estimate it and use in the creation process) we need to use
an operator that takes more than one parent as input; we have to use crossover
with more parents (even more than 2, perhaps the whole population). This enables us to account for the dependencies among variables. The exact meaning
of ‘. . . very much like . . . ’ depends on the particular operator, i.e. on the neighborhood structure that it induces. For such a crossover, some more complex
models of the distribution of the selected individuals must be used. Each of
them, however, has its own assumptions of varying strength. The two extremes
are:
• A flexible model with weak structural assumptions is able to fit the data
very well. The neighborhood of several points is then usually very similar
to a unification of the individual neighborhoods.
• An inflexible model with strong structural assumptions usually fits the
data with less fidelity, but if the problem being solved satisfies the assumptions, this model can create offspring individuals even in promising areas
far away from the individual neighborhoods of the parents.
Models from the whole range in between the above mentioned cases are
able to learn the structure of the problem.10 Models with strong assumptions,
however, have higher ability to escape from local optima, if their assumptions
fit the solved problem well. They can be usually considered to be parametric
models, i.e. they enforce one particular type of model and we learn only the
model parameters.
1.3.6
Two Ways of Linkage Learning
As stated above, the linkage presents a serious issue for EAs, but there are some
possibilities to overcome this limitation. Among other methods, I would like to
point out the following two:
Probabilistic models In the framework of estimation of distribution algorithms
(see Chap. 2.1), we can account for the statistical dependencies by using
a probabilistic model that is flexible enough to cover these interactions.
This approach originated in the field of GAs and was then transfered to
real-valued spaces.
10
The meaning of the term problem structure is clarified in footnote 8 on page 13.
More parents
are needed.
16
1
I NTRODUCTION
TO
EA S
Coordinate-system transforms The individuals can be transformed from the
search space into another one in which the linkage is greatly reduced.
There we can apply ordinary variational operators. This method is much
more usual in real-valued spaces—for ordinal (or even categorical) variables we could hardly find any profitable transform.
Probabilistic
model +
coordinate
transform
In the real-valued spaces, many of the complex probabilistic models can be
viewed as a combination of learning (1) a possibly non-linear coordinate transform and (2) a much simpler probabilistic model of distribution of the transformed data points. Sampling from such a complex model then amounts to (1)
sampling from the simple probabilistic model in the transformed space and (2)
converting the new data points into the original space via the inverse transform.
It is often difficult to distinguish which method an algorithm uses, they often cooperate.
Both of the above mentioned methods of linkage learning can be used as
parts of crossover and mutation operators, but with different purposes. In case
of mutation they are used to estimate the structure of the local neighborhood of
the current data point, in case of crossover, some structural characteristics of the
whole population are estimated.
1.4 The Topic of this Thesis
This section gives a bird-eye view on the goals of this thesis. It describes the
location of my efforts on the map of EA design.
1.4.1
Classification
of EAs
Targeting Efforts
EAs can be classified using many criteria. Here I present three that are relevant
to the topic of this thesis:
• Representation. EAs can work with many different representations (binary strings, real vectors, trees, etc.).
• Linkage learning approach. The learning of the problem structure is usually based:
1. on the analysis of the population distribution in the search space, or
2. on the analysis of the fitness landscape.
The basic difference between these two approaches lies in the fact that the
fitness landscape is usually fixed during the evolution while the distribution of the population changes every generation.
• Presence of selection. Traditional EAs use selection in its original form,
i.e. only a subset of the population is used to create new individuals. In
other words, neither crossover, nor mutation takes the fitness values into
account. Recently, several algorithms able to work without selection (even
though they may use it) have emerged. They use such forms of variational
operators that work even with bad individuals (that would be otherwise
discarded by selection) and create offspring by weighting the population
members.
Target area of
this thesis
With regard to these criteria, the algorithms presented in this thesis can be
described as follows:
1.4 The Topic of this Thesis
• they use vectors of real numbers as the representation,
• they use distribution-based linkage learning, and
• they use the selection operator in its original sense.
If any presented algorithm does not match these specifications, it will be explicitly mentioned.
1.4.2
Requirements for a Successful EA
In the area of real-coded EAs, there are several ‘empirical’ features which are
essential for a successful EA:
1. Envisioning the structure. The algorithm should be able to create individuals in previously unseen areas of the search space. This can be accomplished by a crossover operator with a less flexible model if the population
distribution suggests such a possibility. This is clearly the hardest part in
the real-valued spaces as we do not know the type of the structure, we
even do not know what complexity of the problem structure should we
expect.
2. Focusing the search. The algorithm must be able to intensify the search in
areas where several population members already reside. Such an operation is usually carried out by some interpolating or averaging. It requires
several individuals to estimate this kind of structure and is thus accomplished by crossover with highly flexible models.
3. Population shift. The two previous requirements apply to discrete spaces
as well. In real-valued spaces, the algorithm must be able to efficiently
shift the population members to better values. It usually amounts to searching the local neighborhoods of individual population members and it is
usually accomplished by mutation-like heuristics. In discrete spaces, something like ‘shifting’ the population does not make sense since no individual can make several steps in the same direction which is often a must for
a real-valued EA.
4. Making it cooperate. All the above mentioned features are essential and
they must work together to successfully solve the problem.
1.4.3
The Goals
A design of a simple algorithm which would satisfy all the above requirements
is a very hard task and a too ambitious goal. However, if we had efficient methods for the first three requirements, we could make them work together in several ways; the information interchange among them would be the crucial part.
Making several heuristics cooperate is a very broad topic—it deserves its own
dissertation (or even several of them).
This thesis, thus, aims at the first three requirements, i.e. it aims at using the
probabilistic models and coordinate transforms in the following situations:
1. The problem has no (or weak) interactions among variables.
Many empirical problems have been tackled by methods (even by local
search methods) that assumed independence of individual object variables. Population-based optimization algorithms should further improve
17
18
1
I NTRODUCTION
TO
EA S
the efficiency of such algorithms in terms of global search abilities and
should take advantage of the fact that by mixing good partial solutions
often a new better solution is obtained (even in previously unseen areas of
the search space).
2. The problem has linear interactions among variables.
Algorithms with the assumption of independence are too constrained. The
use of coordinate transforms to decorrelate the problem variables should
give more degrees of freedom to the model which should be, in turn, useful for broader range of problems.
3. The problem has higher-order interactions among variables.
The solutions in promising areas form various clusters or curves in the
search space. The models should be very flexible and data driven to be
able to describe such kind of distributions; this in turn limits their ability
to shift the population.
4. The population members are placed in an area that does not contain any local
optimum.
In this situation, many EAs have problems with premature convergence
(their ability to search outside the area occupied by the population is limited) or with insufficient speed of convergence. This situation does not
usually arise when using conventional direct search methods (e.g. NelderMead). New type of EA should be able to traverse the search space efficiently.
1.4.4
The Roadmap
Chapter 1 introduced the notions of optimization, local search algorithms, evolutionary algorithms (viewed as methods for local search in the neighborhood
of several points), and thus showed the location of the topic of this thesis on
the map of optimization methods. It emphasized the fact that the neighborhood
structure is the crucial part of the optimization algorithm—if the neighborhood
structure is not suitable for the particular optimization problem, it can hardly be
solved efficiently, quickly, and reliably.
In Chapter 2, the estimation of distribution algorithms are introduced, along
with the way of incorporating the coordinate transforms into them. The chapter
surveys the various kinds of probabilistic models used in EAs for continuous
and discrete optimization problems, because the discrete models are often generalized for the use in continuous domain. A few works of explicit applications
of coordinate transforms found in the literature are surveyed as well.
The univariate probabilistic models and the EAs that use them are elaborated in the first part of Chapter 3. Three types of histograms and mixture of
gaussians are compared on a few test problems. In the second part of the chapter, the probabilistic models are coupled with the linear coordinate transforms
to decorrelate the object variables (or to make them as independent as possible).
Principal component and independent component analysis are compared and
evaluated.
Chapter 4 presents two flexible models able to describe higher-order interactions among variables. The first one is a distribution tree model which builds on
the classification and regression tree learning method, while the second model
is built on kernel principal component analysis.
1.4 The Topic of this Thesis
Chapter 5 presents theoretical explanation of the premature convergence
phenomenon for one particular instantiation of EA. Then, it describes the evolutionary algorithm with covariance matrix adaptation—one of the most successful EA that uses linear coordinate transform and successfully prevents the
premature convergence. A comparison of this method to Nelder-Mead search
strategy is presented. Finally, an improvement of the method is suggested and
evaluated.
Chapter 6 concludes the thesis, lists its main contributions and provides
guidelines for future work.
Chapters 3, 4, and 5 contain results of experimental comparisons of the presented methods with competitors which I consider to be relevant. The test problems are carefully chosen to reveal the main advantages and disadvantages of
the individual optimizers.
19
20
1
I NTRODUCTION
TO
EA S
Chapter 2
State of the Art
This chapter contains a survey of techniques for probabilistic modeling and coordinate transforms with respect to EAs. First, the general characteristics of estimation of distribution algorithms are described, along with some methods for
probabilistic modeling in discrete spaces. Even though the discrete probabilistic
distributions are not the main topic of this thesis, many of the continuous distributions are based on their discrete counterparts. Finally, a short section gives
attention to the use of coordinate transforms in real domains.
2.1 Estimation of Distribution Algorithms
In Chapter 1, the basic functionality of evolutionary algorithms is shortly described. The principle they use is very simple: the EA usually starts with randomly initialized population of individuals from which more promising ones
are selected to breed new individuals by means of (general or specialized) variation operators. This process is repeated until some termination condition is
satisfied.
Section 1.3.5 pointed out the necessity of using multi-parent1 variational
operators which offer us the possibility to learn the linkage information. At
this place we can use many methods developed for statistics, machine learning or data mining. We can e.g. model the fitness landscape with a (possibly
non-linear) regression model of known type for which the place of optimum is
known. This estimate of optimum can be taken as the offspring coming out from
such kind of generalized crossover operator.
There is, however, another very general approach to the variation phase.
Instead of the classic select-crossover-mutate strategy, we can use a select-modelsample approach (see Alg. 2.1). These algorithms are called Estimation of Distribution Algorithms (EDAs), Probabilistic Model Building Genetic Algorithms (PMBGAs), or Iterated Density Estimation Algorithms (IDEAs), and they form a substantial part of this thesis.
2.1.1
Basic Principles
During evolution, the EDAs build an explicit probabilistic model. After each
selection phase, only the good-enough individuals should serve as the basis for
building the model which (more or less) describes the distribution of good individuals in the search space. When the model is ready, EDA produces new
1
A multi-parent operator means an operator with two or more individuals as inputs.
21
Model and
sample
22
2
S TATE
OF THE
A RT
Algorithm 2.1: Estimation of Distribution Algorithm
1
2
3
4
5
6
7
8
9
10
11
12
Do not search
for the
optimum,
search for the
right
distribution!
begin
X (0) ← Initialize()
f (0) ← Evaluate(X (0) )
g←0
while not TerminationCondition() do
XPar ← Select(X (g) , f (g) )
M ← CreateModel(XPar )
XOffs ← Sample(M )
fOffs ← Evaluate (XOffs )
[X (g+1) , f (g+1) ] ← Replace(X (g) ,XOffs ,f (g) , fOffs )
g ←g+1
end
offspring individuals by sampling from this model, creating more individuals
in the promising regions than elsewhere.
The overall characteristics and features of an EDA as a whole depend mainly
on the type of the probabilistic model employed. We can use a very simple
model which can be built up very quickly; the resulting EDA is very fast and accurate when solving easy problems, however it may not solve problems which
involve interactions between variables (like an ordinary EA). The main advantage of EDAs is the fact, that we can use a model of arbitrary complexity. If we
employ some sophisticated probabilistic model that can cover (possibly nonlinear) dependencies in the population, the resulting EDA will be able to solve
very hard problems. Although EDAs will then need large populations and large
number of evaluations to solve it, the conventional EAs might not be able to
solve such problems at all.
2.1.2
No
interactions
Pairwise
dependencies
EDAs in Discrete Spaces
Probably the first algorithm for the binary representation which uses the principles of EDAs is the population based incremental learning (PBIL) of Baluja (1994).
The PBIL works in binary space and adapts a vector of probabilities pi that the
i-th component of the solution is zero. First, it generates a few individuals in
accordance with the probability vector, then it selects the best one, and modifies
the vector of probabilities with a kind of Hebbian learning rule, i.e. the model
is built by exponential forgetting the values of the old best vectors and moving
towards the current best vector. A similar principle is used by the compact genetic
algorithm (cGA) (Harik, Lobo & Goldberg 1997). Similarly to the PBIL, the cGA
uses a vector of probabilities, but adapts it by a different learning rule which
resembles a tournament selection. The univariate marginal distribution algorithm
(UMDA) of Mühlenbein & Paass (1996) is already a proper EDA—it selects the
promising solutions and models the distribution of zeros and ones by computing a vector of probabilities.
All the above mentioned algorithms treat the variables independently of
each other and their behavior is similar. There are also models which detect the
dependencies between pairs of variables. They are the first non-trivial EDAs as
they are able (to certain extent) solve the linkage problem. They differ mainly in
the form of a model they create; the models include a chain, a tree, or a forest.
The algorithm called mutual-information-maximizing input clustering (MIMIC) pre-
2.1 Estimation of Distribution Algorithms
sented by de Bonet et al. (1997) builds a model of bivariate dependencies in
the form of a chain. The sampling is then based on the chain rule, i.e. on
the probability of the first variable, on the conditional probability of the second variable given the value of the first one, etc. Baluja & Davies (1997) used
the dependency trees to encode the relationships between the variables, while
Pelikan & Mühlenbein’s (1999) bivariate marginal distribution algorithm (BMDA)
employed a forest distribution (a set of mutually independent dependency trees).
These algorithms are able to identify and properly mix building blocks2 of size
two, but it is still insufficient to solve more complex problems.
EDAs with more sophisticated probabilistic models can describe even multivariate dependencies between the variables. The bottleneck of these methods
is the fact that they need complex algorithms to learn the distribution. It results in a significant learning time overhead. Furthermore, in almost all cases
we end up only with a sub-optimal model of the joint probability distribution.
The factorized distribution algorithm (FDA) of Mühlenbein et al. (1999) can contain
multivariate marginal and conditional probabilities. However, the factorization
of the distribution must be known a priori, e.g. given by an expert. The extended compact genetic algorithm (ECGA) (Harik 1999) uses different principles. It
divides the variables into several mutually independent groups. Each of these
groups is described by a complete joint distribution. The division into groups
is carried out by a greedy manner with respect to the Bayesian information criterion. The most complex model for discrete variables that can be learned is
probably a general Bayesian network (the chain, tree and forest are special cases
of Bayesian network). The first of EDAs that uses this complex model is the
Bayesian optimization algorithm (BOA) of Pelikan et al. (1998). The network for a
population of good solutions is learned every generation by a greedy algorithm
which tries to minimize the Bayesian-Dirichlet metric between the distribution
encoded in the model and the observed distribution. This network is then sampled to create new individuals. BOA is able to solve problems of bounded difficulty in subquadratic time. The same model was independently used in EDA
by Etxeberria & Larrañaga (1999) with the name estimation of Bayesian network
algorithm (EBNA). The FDA was later improved to be able to learn the factorization, and the modified version is called learning factorized distribution algorithm
(LFDA).
2.1.3
EDAs in Continuous Spaces
As stated in Sec. 1.2.1, there are fundamental differences between discrete and
continuous spaces. Another example of this difference:
• All possible conditional PDFs p(X2 |X1 ) with two binary variables X1 and
X2 can be described with only two numbers: p(X2 = 1|X1 = 0) and
p(X2 = 1|X1 = 1).
• A model of a conditional PDF p(X2 |X1 ) with two continuous variables X1
and X2 can take much more complex forms with various parameters for
each possible value of X1 —mere two numbers (like in the case of binary
PDF mentioned above) usually do not suffice.
Thus, searching for the right probabilistic model in continuous domain involves
not only the factorization of PDF, but also the search for the right type of distributions along their parameters. The search for the right model structure has to
2
Building blocks are defined as short over-average schemata of low size. For more information, see e.g. Goldberg (1989).
23
Multivariate
dependencies
24
2
S TATE
OF THE
A RT
be often limited to a class of (rather simple) models (or is carried out only in the
space of model parameters). Nevertheless, many continuous EDAs use direct
generalizations of the models used in discrete EDAs.
No
dependencies
Bivariate and
multivariate
dependencies
Clustering
The easiest EDAs in the continuous spaces take into account no dependencies among variables. The joint density function is factorized as a product of
univariate independent densities. E.g. the univariate marginal distribution algorithm for continuous domains (UMDAC ) by Larrañaga et al. (1999) belongs to this
type of EDAs. This algorithm performs some statistical tests in order to determine which of the theoretical density functions fits the particular variable
best. Then, it computes the maximum likelihood estimates of the parameters
for chosen univariate marginal PDFs. Another example of EDA using a model
without dependencies is the stochastic hill-climbing with learning by vectors of normal distributions (SHCLVND) of Rudlof & Köppen (1996). In this algorithm, the
joint density is formed as a product of univariate independent normal densities
where the vector of means is updated using a kind of Hebbian learning rule
and the vector of variances is adapted by a reduction policy (at each iteration it
is modified by a factor lower than 1). Sebag & Ducoulombier (1998) propose an
extension of PBIL for continuous domains (PBILC ), however their approach is very
similar to an instance of ES (see Schwefel (1995)). Servet et al. (1997) present a
progressive approach; for each dimension, an interval in which the search takes
place is maintained, along with a probability that the respective component of
the solution comes from the right half of the interval. In case that the probability gets close to 1 (or 0), the interval is updated to the right (or left) half. Finally,
there are several attempts to use histogram models (see e.g. Bosman & Thierens
(1999), Tsutsui, Pelikan & Goldberg (2001), Pošı́k (2003)).
The next step is to take advantage of models which cover bivariate or multivariate dependencies among variables. They include probabilistic models based
on bivariate normal distributions (MIMICG
C ), on the multivariate normal distributions where the complete covariance matrix is estimated (EMNA), on Gaussian networks (EGNA), etc. (Larrañaga & Lozano (2002)). Bosman & Thierens
(2000) also used a multivariate normal distribution with full covariance matrix.
MBOA of Očenášek & Schwarz (2002) learns a Bayesian network with decision
trees and univariate normal-kernel leaves. The search domain is decomposed
into axis-parallel partitions; the variables are approximated by univariate kernel distributions.
Sometimes the good solutions form multiple ‘clouds’ in the search space.
For this type of distribution, a mixture of normal distributions is a very useful
model. Example of this approach was presented by Gallagher et al. (1999); the
algorithm proposed there (AMix) uses an adaptive mixture of multivariate gaussians which can change the number of components during evolution. Learning
such complex class of models, however, requires large number of samples, and
is usually carried out by some kind of iterative algorithm (e.g. EM algorithm).
Another approach is pursued by Bosman & Thierens (2000); multivariate gaussian with relatively small variance is centered around each individual. The joint
probability function is then a superposition of all those kernels. This model has
the advantage of capturing various types of distribution; furthermore, we can
think of it to be a non-parametric model (when we keep the variance of kernels constant). This way of estimating the true PDF is in fact equal to Parzen
windows. The behavior of such an EDA is exactly the same as the behavior
of mutative-only ES where the mutation is carried out by adding a normally
distributed random vector to all selected parent individuals.
2.2 Coordinate Transforms
25
Algorithm 2.2: The Use of Coordinate Transform Inside EA
1
2
3
4
5
6
7
8
9
10
11
12
13
begin
X (0) ← Initialize()
f (0) ← Evaluate(X (0) )
g←0
while not TerminationCondition() do
XPar ← Select(X (g) , f (g) )
YPar ← DirectTransform(XPar )
YOffs ← ProduceOffspring(YPar )
XOffs ← InverseTransform(YOffs )
fOffs ← Evaluate (XOffs )
[X (g+1) , f (g+1) ] ← Replace(X (g) ,XOffs ,f (g) , fOffs )
g ←g+1
end
2.2 Coordinate Transforms
Sec. 1.3.6 states that in continuous domain the probabilistic modeling is often
accompanied with a kind of coordinate transform. However, the transforms can
be used even with the ordinary reproduction operators of conventional GEAs
(see Alg. 2.2).
After the selection, the parents are transformed into a different coordinate
system where the creation of offspring individuals might be easier or more successful. The offspring is created either using the crossover and mutation operators (in case of conventional GEAs) or by probabilistic model building and
sampling (in case of EDAs). Then the new individuals are transformed back
to the original space, are evaluated and can compete for survival with the old
population members. From this it immediately follows that if the coordinate
transform should be adapted using the selected parents, it has to be reversible.
The possible transforms can be classified into several types with various levels of complexity and usefulness:
• Linear coordinate-wise transform. This group contains e.g. change of origin, change of scale of individual coordinates, standardization, etc. These
methods are far too simple to yield any important profit for the algorithm.
• Non-linear coordinate-wise transform. These transforms allow for adaptive
focusing to promising regions of the search space, however, they are not
able to reduce the dependencies among variables.
• Linear transform with rotation. Methods which reduce the amount of dependencies in the population by transforming the selected individuals as a
whole, like principal component analysis or independent component analysis, belong to this category. These methods are able to greatly improve the
performance of the algorithm if the problem corresponds to the model.
• General non-linear transform. This category contains transforms which employ some sort of clustering, mainly mixture models. They are able to perform different coordinate transforms in different areas of the search space.
The evolution then evokes a parallel model of EA where the population is
divided to several demes which can evolve independently of the others.
Reversible
transform
26
2
S TATE
OF THE
A RT
The survey of literature dealing with coordinate transforms in the context
of EAs is very short. Only a few papers dealt with coordinate transforms in the
way just described. Much more often, the transforms are part of the probabilistic
model, thus:
• All the EDAs that use a PDF factorized to univariate components perform
a non-linear coordinate-wise transform (although they do not mention it
explicitly).
• All the EDAs that use some of the forms of the principal component analysis or independent component analysis perform a linear rotational transform. Again, they are usually used as part of the PDF model, but they can
be used independently of the model.
• All the EDAs that use a kind of probabilistic mixture model perform a nonlinear transform. The transform could be used as a standalone component
of EA.
The independent component analysis was explicitly used as a part of EA by
Zhang et al. (2000). Pošı́k (2005b) used the principal component analysis in a
co-evolutionary scheme and Pošı́k (2005a) compared the usefulness of principal
and independent component analysis. Finally, Pošı́k (2004) combined a nonlinear transformation in the form of kernel principal component analysis with a
pure random search to construct a generalized crossover operator with interesting characteristics.
Chapter 3
Envisioning the Structure
This chapter aims at the most difficult part of the optimization task—at the envisioning of the structure of distribution of good solutions in the search space.
It is only fair to say that there is nothing oracular about the methods presented
here. Since the structure can be generally very complex, there is no easy, fast
and reliable method of envisioning the structure from a finite sample (selected
population members). Instead, models with strong structural assumptions are
used and if the structure of the problem does not correspond to them, then the
model performance is low. Although we cannot assume that the optimization
task at hand has suitable structure, an empirical justification of this approach is
presented by conventionally used methods (e.g. line search) which have similar
assumptions and still are often employed to solve real-world problems.
First, univariate marginal probability models are described and their efficiency is demonstrated on selected optimization problems. Then, they are combined with coordinate transforms to relax the strong assumptions, at least to
some extent.
Disclaimer
3.1 Univariate Marginal Probability Models
This section builds on the work published in Pošı́k (2003) and concentrates on
models that assume the statistical independence of individual variables. This
allows us to factorize the joint probability distribution as a product of univariate
marginal distributions:
p(x) =
D
Y
pd (xd )
(3.1)
d=1
where p(x) is the joint multivariate probability distribution and pd (xd ) are the
1D marginals. This kind of model is actually the simplest reasonable one and
algorithms which use this type of model fall into the class of Univariate Marginal
Distribution Algorithms (UMDAs).
One source of the differences among the individual UMDA class algorithms
is the choice of suitable 1-dimensional PDF. In the literature we can find examples of using histogram models, (unimodal) gaussian PDFs, mixtures of gaussians, or Parzen windows-based density estimators. Again, the individual models differ in flexibility: when less flexible models are used (e.g. gaussian PDF or
histogram model with low number of bins) we can perform sampling between
the training points, while in case of more flexible models (mixture of gaussians
or Parzen windows) the sampling is more limited to local neighborhoods of
27
UMDA
28
3
E NVISIONING THE S TRUCTURE
training points. Which of the two cases is better depends on the type of problem
we are about to solve.
In following subsections, several types of marginal densities are described.
The purpose is to evaluate the behavior of the UMDA algorithm (without mutation) using 4 types of marginal probability models. The first two were considered in the work of Tsutsui, Pelikan & Goldberg (2001), namely the equi-width
and equi-height histograms. Pošı́k (2003) expanded the set of models to the maxdiff histogram and the univariate mixture of gaussians.
3.1.1
Histogram in
continuous
domain
Histogram Formalization
Before the description of the particular types of histograms, let us formalize
the notion of 1D histogram model. Further subsections will describe various
instances of this formalization.
A histogram is a non-parametric model of a univariate PDF. It consists of C
bins and each c-th bin, c ∈ h1, Ci, is described by three parameters: bc−1 and bc
are the boundaries of the bin and pc is the value of PDF if bc−1 < x ≤ bc .
When constructing a histogram, we have a set {xn }N
n=1 of N data points. Let
nc be the number of data points falling between bc−1 and bc . Then, the values of
pc are proportional to nc .
pc = αnc
(3.2)
Let dc be the width of the c-th bin, dc = bc − bc−1 . The following condition must
hold if the p(x) is a PDF:
1=
Z
∞
p(x)dx =
−∞
C
X
pc d c = α
C
X
nc dc
(3.3)
c=1
c=1
From this equation we immediately get for the normalizing constant α
α = PC
1
c=1 nc dc
3.1.2
(3.4)
One-dimensional Probability Density Functions
The individual models of PDF are described here in detail.
Equi-Width Histogram (HEW)
All bins have
the same
width
This type of histogram is probably the best known and the most-widely used
one, although it exhibits several unpleasant features. The domain of the respective variable is divided to C equally large bins. Since the size of each bin can be
written as dc = d = bCC−b0 , the equation for all the values of PDF reduces to
pc (x) =
nc
dN
(3.5)
When constructing the histogram, we first determine the boundaries of each
bin, count the respective number of data points in each bin, and use the equation
3.5 to compute the values of PDF for this histogram model. Figure 3.1 shows an
example of the equi-width histogram.
3.1 Univariate Marginal Probability Models
29
Equi-Height Histogram (HEH)
This type of histogram fixes the probabilities of individual bins instead of fixing their widths. Each bin then contains the same number of data points. This
means that in areas where the data points are sparse, the bin boundaries are far
from each other, while in regions where the data points are dense, the boundaries are pretty close.
Since the probability of each bin is constant and equal to p(bc−1 < x ≤ bc ) =
n
1
N = C , we can write
1
(3.6)
pc (x) =
dc C
All bins
contain the
same number
of points
Equation 3.6 holds only theoretically. In real cases, we can hardly expect that
the number of data points can be divided by the number of bins without any
remainder. One simple method how to construct this type of histogram is to
compute the number of data points each bin should contain (the difference of
these values across all bins is maximally 1). Then we put the boundary of consecutive bins to the half of the distance of the two neighboring data points where
the first one should come into one bin and the second one to the other bin.
It is also possible to build an empirical CDF with some kind of interpolation between points and using this estimate to determine the quantiles of the
distribution at levels C1 , C2 , . . . , C−1
C , which then serve as the bin boundaries
b1 , . . . bC−1 . Then, the number of data points in each bin can be exactly the same
(and possibly fractional) for all the bins. An example of the equi-height histogram is shown in Fig. 3.2. As can be seen, it is actually not the height of the
bins what is fixed, but rather their area (i.e. the probability of generating a point
belonging to the bin).
Max-Diff Histogram (HMD)
This type of histogram is reported to have several good features (very small
errors, short construction time), see e.g. Poosala et al. (1996).
The max-diff histogram does not put any ‘equi-something’ constraint on the
model. Instead, it is defined by a very simple rule. When constructing max-diff
histogram with C bins, we need to specify C − 1 boundaries (the first and the
last boundaries are equal to minimum and maximum of the respective variable
domain). If we compute all the N − 1 distances between all pairs of neighboring
data points, we can select the C − 1 largest of them and put the boundaries to
the half of the respective intervals.
This procedure has a natural explanation. We try to find consistent clusters
in data, and all data points in such a cluster are put to the same bin. It is just
an approximation and we can argue what results this procedure gives when the
data does not show any structure, but it turned out to be very precise and useful
in many situations. Figure 3.3 shows an example of the max-diff histogram.
Univariate Mixture of Gaussians (MOG)
The univariate mixture of gaussians (MOG) is the last marginal PDF model compared in this section. It is included in this comparison for several reasons. We
can consider all the histogram models to be a certain kind of mixture model as
well—mixture of uniforms. From this point of view, the mixture of gaussians
can give interesting results when compared to the histograms.
Histograms (viewed as the mixtures of uniform distributions) are usually
limited in the ‘fidelity’ with which they can model the actual distribution. Hence,
Put the
boundaries to
the largest
gaps
30
3
E NVISIONING THE S TRUCTURE
Equi−width Histogram
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
5
10
15
20
Figure 3.1: Left: equi-width histogram. Right: bivariate PDF as a product of two
marginal equi-width histograms.
Equi−height Histogram
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
5
10
15
20
Figure 3.2: Left: equi-height histogram. Right: bivariate PDF as a product of
two marginal equi-height histograms.
Max−diff Histogram
0.12
0.1
0.08
0.06
0.04
0.02
0
0
5
10
15
20
Figure 3.3: Left: max-diff histogram. Right: bivariate PDF as a product of two
marginal max-diff histograms.
3.1 Univariate Marginal Probability Models
31
they are constructed with a pretty large number of bins (i.e. components, in the
language of mixture models). On the contrary, the mixture model is usually capable to describe the actual distribution with much higher degree of ‘fidelity’,
thus it allows us to drastically decrease the number of components.
The MOG is given in the form
p(x) =
C
X
c=1
αc N (x; µc , σc2 )
(3.7)
where αi are coefficients of individual gaussian components and N (x; µc , σc2 ) is
the probability of generating value x using the c-th component with mean value
µc and standard deviation σc . The fitting of MOG to the data sample requires
much more effort. In MOG, the individual components overlap each other. We
cannot therefore exactly decide which component each data point belongs to (all
data points belong to all components with certain probability). As a result, to
create the model we have to use an iterative learning scheme—the Expectation
Maximization (EM) algorithm (see e.g. Schlesinger & Hlaváč (2002), Moerland
(2000)) which searches for the maximum likelihood estimates of the parameters
αc , µc and σc . Figure 3.4 shows an example of the MOG model.
Mixture of Gaussians
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
5
10
15
20
25
Figure 3.4: Left: univariate mixture of gaussians. Right: bivariate PDF as a
product of two marginal MOG models.
Sampling
To sample new points from the models described above, a sampling method
from Tsutsui, Pelikan & Goldberg (2001) called Baker’s stochastic universal sampling (SUS) was used. It uses the same principles as the (in EC community perhaps more known) remainder stochastic sampling (RSS).
The random values are generated independently for each dimension. First,
the expected number of data points which should be generated by each bin is
computed according to the bin (or component) probabilities given by equation
3.2. Then, the integer part of the expected number of offspring is generated by
sampling the uniform (in case of histograms) or the normal (in case of MOG)
distribution with the proper parameters. The rest of points is then sampled
by stochastically selecting bins (or components) with probability given by the
fractional parts of the expected numbers of offspring for each bin (component).
3.1.3
Empirical Comparison
The settings of the whole full-factorial experiment are surveyed in Table 3.1.
Use the EM
algorithm
32
3
E NVISIONING THE S TRUCTURE
Table 3.1: Settings of individual factors of the experiment
Factor
Levels
Test Function
Probability Model
Number of Components
Population Size
20D Two Peaks, 10D Griewangk
HEW, HEH, HMD, MOG
Low, High
200, 400, 600, 800
Two reference functions, namely the 20 dimensional Two Peaks (see Sec. A.3)
and the 10 dimensional Griewangk function (see Sec. A.4), were selected for the
comparison. For both test problems, all four UMDAs described in Sec. 3.1.2
were compared.
For each of the models, the number of bins (components) has two levels, High
and Low. The exact number of bins (components) varies (see Table 3.2). The High
Table 3.2: Number of bins (components) used for individual algorithms and test
functions.
Function
Model
Low High
Two Peaks
Griewangk
HEW, HEH, HMD
MOG
HEW, HEH, HMD
MOG
60
3
50
3
120
6
100
6
number of bins was computed as the size of the variable domain divided by the
chosen precision (for equi-width model), i.e. 12/0.1 = 120 for the Two Peaks
function, and 10/0.1 = 100 for Griewangk. The Low number of bins is simply
one half. The mixture models are much more flexible and thus dramatically less
components were used than in the case of histograms. Values for this kind of
model were determined by hand after a few experiments. The levels are 6 and
3 for the High and Low number of components, respectively. The initial values
for the centers of individual components of mixture were constructed using the
k-means clustering method.
In order to give some indication of how complex the test problems are, they
were also optimized with the so called line search (LS) method (see Whitley et al.
(1996)). This method is not intended to be a universal optimization algorithm,
rather it is a heuristic which is quite effective for separable problems and thus it
can give us some insight in the complexity of the evaluation function.
Evolutionary Model
A very simple evolutionary model that is used for the most of experiments
throughout this thesis is presented as Alg. 3.1.
The differences when compared with the general EDA algorithm (Alg. 2.1)
are the following:
• selection of parents is not used at all (i.e. all population members become
parents), and
• replacement scheme simply takes N best individuals from both parents
and offspring.
3.1 Univariate Marginal Probability Models
33
Algorithm 3.1: Generic EDA
begin
Initialize and evaluate the population
while not TerminationCondition do
Based on the current population, create N new offsprings
Evaluate them
Join the old and the new population to get a data pool of size 2N
Use the truncation selection to select the better half of the data
points (returning the population size back to N )
1
2
3
4
5
6
7
end
8
Monitored statistics
For each of possible factor combinations, the algorithm was run 20 times. Each
run was allowed to continue until the maximum number of 50,000 evaluations
was reached. We say, that the algorithm found the global optimum if for each
variable d the condition |xbest
− xopt
d
d | < 0.1 holds (i.e. if the difference of the best
found solution xbest from the optimal solution xopt is lower then 0.1 in each of
the coordinates).
In all experiments, three statistics were tracked:
1. The number of runs in which the algorithm succeeded in finding the global
optimum (NoFoundOpts).
2. The average number of evaluations needed to find the global optimum
computed from those runs in which the algorithm really found the optimum (AveNoEvals).
3. The fitness of the best solution the algorithm was able to find averaged
over all 20 runs (AveBest).
Results and Discussion
The results of our experiments are presented in Table 3.3.
The results of the line search heuristic correspond to the fact that the Two
Peaks function is separable and that one of the sampling points was set precisely
in the optimum. The Griewangk function is more complex and the line search
heuristic was not able to solve it in all experiments. This suggests that the test
function does not lost all of its complexity when used in more dimensional form
(as mentioned in Sec. A.4).
EDAs should be able to solve the easy (when the variables are known to be
independent beforehand) Two Peaks problem. Let us first consider the results
for the HEW model. The results of the experiment are in accordance with the
results reported by Tsutsui, Pelikan & Goldberg (2001)—the HEW model is the
least flexible one. Furthermore, it can be characterized by one important feature:
if in any stage of evolution, any bin of histogram for any dimension happens to
be empty, it is impossible to generate a value from this bin in the following
evolution cycles. Generaly, if the PDF is equal to zero somewhere in the search
space, the algorithm has no possibilities to overcome it and to start searching in
these ‘zero areas’.1 This is also documented by the very weak results obtained
by the UMDA with HEW model. If the ‘resolution’ (the number of bins) of
1
This drawback can be avoided by the so-called Bernoulli’s adjustment, i.e. by adding a small
HEW model
and zero
areas
34
3
E NVISIONING THE S TRUCTURE
Model
Number of Bins (Components)
Low
High
200
400
600
0
0
5,00560
20
6530
0,10710
20
6270
0,00003
7
6571
0,87390
1
13200
4,82110
20
11080
0,02070
20
14920
0,00008
20
11860
0,00011
5,05280
20
16050
0,01040
20
27960
0,04990
20
17130
0,00930
13
15954
0,00370
18
6667
0,00074
17
6482
0,00110
15
5693
0,00190
17
20353
0,00210
18
12867
0,00074
17
12376
0,00110
15
10613
0,00180
16
25763
0,00230
18
18967
0,00074
18
19200
0,00074
18
16100
0,00074
HEH
HEH
MOG
HMD
Griewangk
HEW
LS
MOG
HMD
Two Peaks
HEW
LS
Function
Table 3.3: UMDA empirical comparison: results
HEH slightly
better than
HMD
Population Size
800
200
20
2421
0,00000
1
1
47200 5600
4,95510 3,48040
20
20
20760 7720
0,07490 0,03740
20
20
47080 6770
2,20380 0,00260
19
16
23032 5613
0,13970 0,19650
18
14056
0,00250
14
1
27886 5800
0,00320 0,00500
20
16
25000 6650
0,00000 0,00150
16
18
25850 6456
0,00140 0,00074
19
19
21726 5894
0,00037 0,00037
400
600
800
17
10306
2,51180
20
10620
0,00690
20
10980
0,00001
20
10480
0,00008
19
15189
2,50030
20
15570
0,00430
20
17490
0,00230
20
15780
0,00660
20
19600
2,52340
20
20400
0,03520
20
25280
0,05740
20
20720
0,06370
15
12320
0,00081
18
12711
0,00074
17
12235
0,00110
19
12084
0,00037
18
18867
0,00095
20
18270
0,00000
19
18221
0,00037
17
17682
0,00110
19
24926
0,00083
20
23920
0,00000
20
23720
0,00000
18
23644
0,00074
Statistics
NoFoundOpts
AveNoEvals
AveBest
NoFoundOpts
AveNoEvals
AveBest
NoFoundOpts
AveNoEvals
AveBest
NoFoundOpts
AveNoEvals
AveBest
NoFoundOpts
AveNoEvals
AveBest
NoFoundOpts
AveNoEvals
AveBest
NoFoundOpts
AveNoEvals
AveBest
NoFoundOpts
AveNoEvals
AveBest
NoFoundOpts
AveNoEvals
AveBest
NoFoundOpts
AveNoEvals
AveBest
the histogram is lower than the precision required, the algorithm becomes very
inaccurate. If the ‘resolution’ is sufficient, the efficiency of the algorithm is very
dependent on the population size. This can be seen especially in the case of the
Two Peaks evaluation function.
The other three models, HEH, HMD, and MOG, seem to be much more robust with respect to the parameters. None of them also exhibits the unpleasant
feature of HEW model (‘zero areas’) described in previous paragraph. The models are constructed in such a way that they do not allow for a zero probability
density (during the evolution, the value of PDF can become ’very small’ in some
areas, but is never zero, theoretically).
HEH and HMD models showed very similar behavior, although on these
two test functions the HEH model seems to work better. However, the differences are very insignificant. Both are very successful in the number of runs
they succeed to find the global optima. The average number of evaluations they
need to find the optima is comparable for both models. However, looking at
the results table, we can see that the average best achieved fitness scores are still
highly dependent on the size of population.
In Figures 3.5 and 3.6, the typical evolution of the bin boundaries (for hisconstant to all bin probabilities so that even the empty ones are not omitted from furter use.
However, other types of histograms do not suffer from this bottleneck so that we can use them
instead of the HEW model.
3.1 Univariate Marginal Probability Models
Evolution of bin boundaries for the HEH model
35
Evolution of bin boundaries for the HMD model
Evolution of component centers for the MOG model
12
12
12
10
10
10
8
8
8
6
6
6
4
4
4
2
2
2
0
0
0
0
20
40
60
80
100
20
40
60
80
100
0
0
20
40
60
80
100
Figure 3.5: Two Peaks function — Evolution of bin boundaries for HEH and
HMD models and evolution of component centers for MOG model.
Evolution of bin boundaries for the HEH model
5
Evolution of bin boundaries for the HMD model
5
Evolution of component centers for the MOG model
6
4
2
0
0
0
−2
−5
0
20
40
60
80
100
−5
0
20
40
60
80
100
−4
0
20
40
60
80
100
Figure 3.6: Griewangk function — Evolution of bin boundaries for HEH and
HMD models and evolution of component centers for MOG model.
togram models) and components centers (for the MOG model) are presented.
These typical representatives depict cases in which all the models converged to
the right solution (for this particular dimension). However, to find the global
optimum the algorithm had to converge to the right solution in all the dimensions and that is often not the case for the MOG model (as can be seen from the
results table).
For the MOG model, we can see that several components of the mixture
simply stopped evolving. The reason for this is very simple—their mixture coefficients dropped to zero and the components simply vanished.2 Sometimes,
the component that resides in the area of global optimum vanishes before it can
take over the population. This can be thought of as an analogy of the premature
convergence problem.
The UMDA with the equi-width histogram turned out to be very unreliable and non-robust algorithm. UMDA with the other two types of histogram
models, equi-height and max-diff, seems to be effective and robust for separable
functions, and for functions with a small degree of interactions among variables.
In the latter case (for a Griewangk function), they outperformed the simple line
search heuristic.
The mixture of gaussians model is a representative of a qualitatively different type of model. Although it exhibits a bit worse behavior then the two above
mentioned types of histograms, it is fair to say that it was used with a considerably smaller number of components and that it offers a few additional advantages. E.g., with almost no effort, it can be easily extended from a set of independent marginal mixtures of gaussians to a general multidimensional mixture of
gaussians which will be able to cover some interactions among variables. (Extending the histogram models to more dimensions would require much more
memory space than extending the mixture of gaussians.)
2
This ‘component vanishing’ is an analogy to the zero areas of the HEW model and the
Bernoulli adjustment should work as a remedy.
Vanishing of
MOG
components
36
3
3.1.4
E NVISIONING THE S TRUCTURE
Comparison with Other Algorithms
We have seen that the UMDA is able to solve order-1 decomposable problems
(Two Peaks) reliably, and when used with sufficiently large population, it can
solve even non-decomposable functions (Griewangk). Of course, in that case
its efficiency depends on the amount of dependencies. What should we expect
from comparisons with other methods?
Comparisons with several other algorithms are postponed to Sec. 4.1.3. Two
Peaks function is very hard for any optimization algorithm which is not aware
of the independence of individual variables. Local search strategies would find
the global optimum in almost no run. GAs would have problems when performing crossover; nevertheless, if they knew the right crossover points (any
multiple of the number of bits used to encode one variable), they should get
close to the global optimum, but due to the discretization, their precision would
be limited. The line search heuristic would have in general case similar problems with insufficient precision when each line search is implemented as a grid
search. When using e.g. the Brent’s method of line search, it would very often
end up in the local optimum, not in the global one.
3.1.5
Summary
This section solved the first goal of the thesis—to find an optimization method
for situations when the problem has no or weak interactions among variables.
The results of an empirical comparison of the UMDA type of EDA using several types of marginal probability models was described. All the UMDAs used
(with the exception of UMDA with equi-width historgram) belong to the class
of asymptotically complete search algorithms provided the sought optimum lies
inside the enclosing hypercube. The algorithms were viewed also from the robustness point of view, i.e. from the point of view of their (non-) dependency
on the algorithm parameters setting. The efficiency of UMDAs is not expected
to be very high for more complex functions, i.e. for functions with interactions
among variables.
UMDA Advantages
Simplicity 1D PDFs used by UMDA are easy to learn and sample.
Scalability The time and space demands of a probabilistic model factorized as
a product of D univariate marginal distributions grow only linearly with
the dimensionality of the problem.
UMDA Limitations
No interactions The UMDAs are not able to deal with problems that exhibit the
dependencies among individual variables. UMDAs are not able to solve
such problems reliably.
What’s Next?
More flexible models have to be used if we wanted to solve problems with interactions. One possibility which preserves (more or less) the advantages of
UMDAs and incorporates at least linear dependencies among variables is to
preprocess the population of individuals with linear coordinate transforms.
3.2 UMDA with Linear Population Transforms
37
3.2 UMDA with Linear Population Transforms
This section is based on the work of Pošı́k (2005a). The UMDA is enriched with
linear coordinate transforms that should reduce or completely get rid of the linear dependencies found among the problem variables.
The main requirement on such a transform should be its reversibility. We
need the ‘forward’ part of the transform to reduce the dependencies among variables so that the evolution can take place in the transformed space. The ‘backward’ part of transform (or inverse transform) is to transform newly generated
offspring back into the original space in order to be evaluated—the objective
function is defined only in the original space.
In this section, two possible coordinate transforms are shortly described.
Both of them form foundation for building more complex models and are used
in some way in subsequent experiments.
3.2.1
Reversibility
Principal Components Analysis
Perhaps the best known linear coordinate transform is the so-called principal
components analysis (PCA). It is used mainly for dimensionality reduction in multivariate analysis and its applications range from data compression, through image processing, to visualisation, pattern recognition, or time series prediction.
The PCA is most commonly defined as a linear projection which maximizes
the variance in the projected space (Hotelling (1933)). We have to find such an
orthogonal coordinate system, in which the variance of the data is maximized
along the axes. It can be easily accomplished by performing the eigendecomposition of the data sample covariance matrix. One additional property of PCA is
worth mentioning: among all orthogonal linear projections, the PCA projection
minimizes the squared reconstruction error.
To formalize the computations, let us denote the set of centered data points at
hand (the population) as X = (x1 , x2 , ...xN ), where each xn is a column vector.
The population matrix X is of size D × N , where D is the dimension of the input
space and N is the population size. The covariance matrix of size D × D is given
by
1
(3.8)
C = XXT .
N
If we find the eigendecomposition of this matrix, we find the linear transform
which decorrelates the components of individuals, i.e. we need to compute a
diagonal matrix λ of order D and full-rank square matrix V of size D × D such
that the condition λV = CV holds. Matrices λ and V contain the eigenvalues
and eigenvectors of the covariance matrix C, respectively. Then, the transform
Y =V×X
PCA
maximizes
variances,
minimizes
square reconstruction
error
Eigen decomposition
(3.9)
rotates the coordinate system of the population matrix X in such a way that the
coordinates of individual data points in matrix Y are not correlated. The inverse
transform can be done simply by inverting the eigenvectors matrix V, i.e.
X = V−1 × Y = VT × Y.
(3.10)
Moreover, a probabilistic formulation of PCA offers an alternative way of estimating the principal axes using the maximum-likelihood framework in which
we can use a computationally efficient, iterative expectation-maximization (EM)
algorithm (Tipping & Bishop (1999)). This formulation is appealing also from
other points of view—it allows us to compare the probabilistic PCA model with
other probabilistic models, or to estimate the whole mixture of PCAs.
Probabilistic
formulation
38
3
3.2.2
E NVISIONING THE S TRUCTURE
Toy Examples on PCA
As already stated, PCA is a linear coordinate transform which decorrelates the
object variables. PCA has no means for recombining the data; if we first decorrelate the data and then transform them back using the inverse transform we
should get the same data points. That is why the PCA can be used only as a
part of other variational technique (crossover operator or probabilistic model).
In this section, the decorrelated data are modeled by a product of marginal equiheight histograms (see Sec. 3.1.2).
We can visualize the outcome of PCA by means of contour plots of the linear
components extracted from the data (see Fig. 3.7 for a 2D example). The contours are lines with the same value of the respective linear component, i.e. the
principal axes are perpendicular to the contour lines. We can see that individual
contours are parallel to each other and contours of the first principal component
are perpendicular to the contour lines of the second principal component. PCA
captures the high-level structure of the data set, but fails to describe the details.
Principal Comp. 1
Principal Comp. 2
10
10
8
8
6
6
4
4
2
2
0
0
0
5
10
0
5
10
Figure 3.7: Two linear principal components of the 2D toy data set
Nevertheless, PCA in conjunction with e.g. marginal histograms can be relatively successful in describing more details. Data sampled from such kind of
model can be very similar to those in the training data set (see Fig. 3.8).
Results of some experiments with the PCA preprocessing are postponed to
Sec. 3.2.6. It can be seen there that for some functions the PCA is useful and for
other functions not. These results and their discussion suggest that we need a
facility which would recognize if the PCA is useful for the particular situation.
The discussion of this problem is presented in the next section.
3.2.3
Decorrelate?
Make
independent!
Independence
definition
Independent Components Analysis
The PCA described in Sec. 3.2.1 can be also described in the following terms: it
is a linear transform which minimizes the correlations3 (the ‘first-order’ dependencies) among variables. It would be very nice to have similar algorithm minimizing a compound criterion of dependency which would take into account
also ‘higher-order’ interactions among variables.
It is very hard to test for the independence of continuous variables. The
notion of independence is usually defined by the PDFs. Let the p(X1 , X2 ) be the
joint PDF of variables X1 and X2 . We say that X1 and X2 are independent if the
3
In fact, PCA not only minimizes the correlations, but avoids them completely.
3.2 UMDA with Linear Population Transforms
9
39
Training data points
Points sampled from PCA−Hist
8
7
6
5
4
3
2
1
0
0
2
4
6
8
10
Figure 3.8: An example of using the PCA with univariate marginal equi-height
histograms
joint PDF can be factorized as follows:
p(X1 , X2 ) = p1 (X1 )p2 (X2 ),
(3.11)
where p1 and p2 are the marginal PDFs. This definition can be extended to any
number of variables.
The PCA makes the variables uncorrelated, but some part of dependencies
among variables is hidden also in higher-order moments, i.e. to make the variables uncorrelated is not sufficient. If the variables are independent, they are
also uncorrelated, but the opposite is not true.
The independent components analysis (ICA) (see e.g. Hyvärinen (1999)) is a
rather recent data analysis technique. Its primary aim is to find such a linear
transform which makes the transformed variables as independent of each other
as possible. This goal makes the ICA a very appealing preprocessing technique
for the use in EAs.
ICA Basics
We can define the ICA in several ways. If we select the mutual information (MI)
as the measure of dependency, we can define the ICA as a process of finding
such a linear transform which minimizes the MI. Unfortunately, direct minimization of MI over all possible linear transforms would require to estimate the
density functions—very hard, very uncertain, and often very time consuming
work. Hyvärinen & Oja (2000) show that:
. . . ICA estimation by minimization of mutual information is equivalent to
maximizing the sum of non-gaussianities of the estimates, when the estimates are constrained to be uncorrelated. . . .
From the above citation, it is clear that due to the constraint of uncorrelated
projections, the ICA need not estimate the joint probability density—the prob-
Mutual
information
40
Independece,
nongaussianity,
negentropy
3
E NVISIONING THE S TRUCTURE
lem is simplified to a great extent and can be solved just by searching for 1dimensional subspaces with the greatest measures of non-gaussianities of the
projections. The general measure of non-gaussianity is usually the negentropy. If
H(X) is the differential entropy of a random variable X, then the negentropy of
a random variable J(X) is defined as
J(X) = H(XGauss ) − H(X),
Negentropy
approximations
(3.12)
where the XGauss is a random variable with normal distribution and the same
variance as the variable X. It is well known fact, that a gaussian variable has
the largest entropy among all random variables with equal variance, thus the
negentropy is always nonnegative and zero for normal distribution, so that it
can be used as a measure of non-gaussianity. Nevertheless, it remains only theoretical measure because one still has to estimate the probability distributions.
In practice, we have to resort to some approximations of negentropy. Some
of the approximations can be found in Hyvärinen & Oja (2000). They usually
emphasize the differences from the normal distribution, i.e. they return greater
values if the empirical distribution is spiky, multimodal, or heavy-tailed. Finding the most independent directions is judged only by the shape of the 1D projection distributions. During the EA run it can easily happen that the distribution in some direction has a more non-gaussian shape then the distribution in a
direction which is really independent. This can mislead the EA that uses ICA as
a preprocessing step.
ICA Features
The fact that ICA can be performed by searching for the most non-gaussian
directions is appealing also from an empirical point of view. The most nongaussian projections (i.e. spiky, multimodal, clustered, etc.) are the most interesting ones. This principle was also used in a statistical method for visualization of the most interesting views of the data—in projection pursuit by Friedman
(1987).
In the ICA model, only one of the independent components can have normal distribution. A greater number of gaussian variables would make the ICA
model unidentifiable because all rotations of D-dimensional gaussian random
cloud of data points are in fact equivalent.
3.2.4
No difference
between PCA
and ICA
preprocessing
Toy Examples of ICA
The first toy example emphasizes the fundamental difference between PCA and
ICA. As can be seen in Fig. 3.9, the first principal component (PC) discovered
by PCA (upper left picture) is the direction with the greatest variance. Only the
second PC (upper right picture) describes the structure of the data set, i.e. the
presence of two data clusters. In ICA, the situation is completely different. The
first independent component (IC) describes the main information hidden in the
data set—the cluster structure (lower left picture).
Although it is interesting to see, which projections are of the main interest
for PCA and ICA, when modeling this particular toy data set by a probabilistic
model with marginal histograms, it does not matter if we preprocess the data
set with PCA or ICA. We use the full set of the discovered components and their
order does not matter. Both models should behave almost equally.
However, the situation is different for the second toy data set. This data
resemble the population that can be seen in certain phases of evolution when
3.2 UMDA with Linear Population Transforms
41
PC 2
PC 1
6
6
4
4
2
2
0
0
2
4
6
0
0
2
IC 1
6
4
4
2
2
0
2
6
IC 2
6
0
4
4
6
0
0
2
4
6
Figure 3.9: Comparison of the two principal components (PCA, first row) and
the two independent components (ICA, second row) for a toy data set.
optimizing the 2D Griewangk function (see Sec. A.4). The effects of using PCA
and ICA can be seen in Fig. 3.10. In this case, the components are completely
different and a bit unexpected. It is the PCA that actually discovers the most
independent components, while the ICA (operating on the basis of maximizing
the nongaussianity of the projections) almost does not rotate the data. This is
an example of a data set when the ICA fails to find the independent components and PCA gives better result. This phenomenon can be also observed in
the results of experiments in Sec. 3.2.6 and is described in more detail in the
next section.
Although it is not presented here, imagine the same data set without the
cloud in the middle. Both PCA and ICA would return almost the same components as in the case when the middle cloud is present. In this case, however, the
ICA would return really independent directions, while PCA would make the estimation of the probability density using marginal histograms very inaccurate.
3.2.5
Example Test of Independence
Let us use a data pattern similar to that seen in Fig. 3.10, i.e. the population
observed during the evolution of the 2D Griewangk function (the data points
form five clusters in a pattern which evokes the number 5 on a dice). One way of
assessing the independence of the individual coordinates is to discretize them,
form a kind of frequency table and use the χ2 -test of independence. For the sake
of simplicity, each of the two variables is divided to 3 equal intervals. The 100
data points are uniformly divided into the 5 clusters. The observed frequency
table is then depicted in Fig. 3.11 (left).
The frequency tables are matrices and their cells contain the number of data
Huge
difference
between PCA
and ICA
preprocessing
42
3
E NVISIONING THE S TRUCTURE
PC 1
PC 2
5
5
0
0
−5
−5
−10
0
10
−10
0
IC 1
IC 2
5
5
0
0
−5
−5
−10
10
0
10
−10
0
10
Figure 3.10: Principal and independent components for a data set similar to a
population when evolving 2D Griewangk function.
20
20
20
20
40
20
20
40
40
20
40
100
16
8
16
40
8
4
8
20
16
8
16
40
40
20
40
100
Figure 3.11: Observed (left) and expected (right) frequency table
points which belong to the respective D-dimensional interval (2-dimensional
in this example). Let the observed frequency table be O with the entries Oi,j ,
P
i, j = 1, 2, 3. Let us further describe the marginal sums of rows as Oi,: = j Oi,j
(the last column of the tables in Fig. 3.11) and the marginal sums of columns as
P
O:,j = i Oi,j (the last row of the tables in Fig. 3.11). The lower right entry of the
table is N —the overall number of data points, i.e. the sum of all table entries.
The χ2 -test only compares the observed frequencies with the expected ones.
In order to use this test as the test of independence we have to construct the
frequency table expected when the assumption of independence holds. Let the
expected frequency table be E. Its entries (see Fig. 3.11, right) can be easily
computed as
Oi,: O:,j
.
(3.13)
Ei,j =
N
The test statistic Chi2 will then be a measure of how much the observed
frequency table differs from the expected one and we can define it as follows:
Chi2 =
X (Oi,j − Ei,j )2
i,j
Ei,j
.
(3.14)
3.2 UMDA with Linear Population Transforms
43
The Chi2 random variable has the χ2 distribution. If the number of rows is
I and the number of columns is J, the number of degrees of freedom for the χ2
distribution is (I − 1)(J − 1), i.e. for this example it is equal to 4. The p-value of
this test (computed as p = 1 − CDF χ2 (dof, Chi2 ), where CDF χ2 (dof, Chi2 ) is
a value of the cumulative distribution function of the χ2 distribution with dof
degrees of freedom computed at the point Chi2 ) is the probability of observing
this or greater Chi2 if the assumption of independence holds. Thus, the p-value
can be interpreted as a measure of independence of the two variables. If the
p-value is close to zero we can be pretty sure that the variables are not independent. If the p-value is close to 1, the variables can be considered as independent
(strictly speaking, we do not have enough evidence to prove their dependence).
Coming back to our example, computing the test statistic and the p-value
2
2
2
results to Chi2 = 4 (20−16)
+ 4 (0−8)
+ (20−4)
= 4 + 32 + 64 = 100 and p =
16
8
4
2
1 − CDF χ (4, 100) = 0, thus we can say that based on our finite sample from
the distribution, there is almost no chance that the variables are independent.
Now, we try to use the same test for the data rotated by the outcome of the
PCA analysis. In that case the frequency tables are depicted in Fig. 3.12.
20
20
20
20
20
60
20
20
20
60
20
100
4
12
4
20
12
36
12
60
4
12
4
20
20
60
20
100
Figure 3.12: Observed (left) and expected (right) frequency table for data rotated
by PCA.
2
2
2
+ 4 (20−12)
+ (20−36)
=
In this case, the results of the test are Chi2 = 4 (0−4)
4
12
36
2
16 + 21.33 + 7.11 = 44.44 and p = 1 − CDF χ (4, 44.44) = 5.2 × 10−9 . Even
in this case we can hardly describe the variables as independent in an absolute
sense, but we can state that they are ‘more independent’ than in the previous
case (unrotated, or rotated by ICA).
3.2.6
Empirical Comparison
The settings of the whole full-factorial experiment are surveyed in Table 3.4.
Table 3.4: Settings of individual factors of the experiment comparing PCA and
ICA population preprocessing
Factor
Levels
Test Function
Probability Model
Transform
Number of Components
Population Size
2D and 20D Two Peaks, 2D and 10D Griewangk
HEH
None, PCA, ICA
120 for Two Peaks, 100 for Griewangk
20, 50, 100, 200, 400, 600, 800
This experiment demonstrates the influence of population preprocessing using PCA and ICA. The 2- and 10-dimensional Griewangk function with the 2and 20-dimensional Two Peaks function were selected for this demonstration.
For each factorial setting, 20 runs were carried out, each was allowed to run for
50,000 evaluations.
44
3
E NVISIONING THE S TRUCTURE
Monitored Statistics
During experiment, several measures of the algorithm efficiency were tracked:
• BSF (Best-so-far fitness). Average fitness of the best individual after 50,000
evaluations in all 20 runs.
• StdevBSF. The standard deviation of the best fitness after 50,000 evaluations in all 20 runs.
• Found0.1 (Found0.01, Found0.001). As the evolution progresses, it is checked
how many times (out of the 20 runs) the best solution is in the ‘0.1 neighborhood’ (0.01, 0.001 respectively) of the global optimum. E.g. for 0.1
T | < 0.1 for all d must hold.
neighborhood, the condition |xBSF
− xOP
d
d
• WhenFound0.1 (WhenFound0.01, WhenFound0.001). The average number of
evaluations needed to get to the 0.1 neighborhood (0.01, 0.001 respectively)
is computed only from the runs in which the algorithm succeeded to get
that close to the global optimum.
In the results tables, the statistics are organized in cells and their layout is
following:
Found0.1
WhenFound0.1
BSF (StdevBSF)
Found0.01
Found0.001
WhenFound0.01 WhenFound0.001
Figure 3.13: Layout of monitored statistics in the results tables.
Evolutionary Model
For the comparison, the UMDA with marginal histogram models was used. As
stated in Sec. 3.1, this kind of algorithm assumes the individual variables to be
statistically independent of each other. The first set of experiments was carried
out using the evolutionary model described in Sec. 3.1.3. The second and third
set of experiments uses the PCA and ICA preprocessing, respectively (the evolutionary model is shown as Alg. 3.2). The only difference is in the steps 4 and
7 which perform the forward and inverse transforms; otherwise both the evolutionary models are identical.
Results and Discussion
Table 3.5 shows the effect of the PCA preprocessing when optimizing the Griewangk function. We can see that for 2D version of the function, the algorithm
with PCA is better in the quality of the final solution, in the speed of finding
the solution, and even in the reliability of finding the global solution. The reason for this superior behavior is that the PCA makes the individual features
of chromosomes uncorrelated (and the most independent, in this case) so that
the marginal models make smaller errors when estimating the joint probability
density. The ICA, on the other hand, is many times misled by the population
(see Fig. 3.10) and gives only slightly better solutions (but more often) than the
UMDA without any preprocessing.
In case of the 10D Griewangk function, the population has a very different
shape and the advantages of ICA come out. UMDA with ICA preprocessing is
more precise and more reliable than UMDA with PCA.
3.2 UMDA with Linear Population Transforms
45
Algorithm 3.2: EDA with PCA or ICA preprocessing
1
2
3
4
5
6
7
8
9
10
11
begin
Initialize and evaluate the population
while not TerminationCondition do
Based on the current population, perform the PCA or ICA of the
parents
Model the transformed parents by marginal histograms
Sample N new offspring from the model
Transform the new offspring to the original space using inverse
PCA or ICA transform
Evaluate them
Join the old and the new population to get a data pool of size 2N
Use the truncation selection to select the better half of the data
points (returning the population size back to N )
end
Alg
Hist
Func
Table 3.5: Influence of PCA and ICA preprocessing when optimizing the Griewangk function. The shaded cells are the best performers.
50
100
0,0052 (0,0035)
0,0058 (0,0029)
0,0061 (0,0024)
6
5
4
6
6
8601 19301 331
2
2
0
0,0054 (0,0025)
6
0
0
468
400
600
800
0,0035 (0,0025)
0,0035 (0,0028)
0,0024 (0,0026)
14
0
0
801
12
1
1301
601
0
3,83e-4 (0,0017)
7,77e-4 (0,0023)
9,31e-6 (4,082e-5)
19
17
17
19
18
17
19
17
17
18
17
17
20
252
412
237
544
893
392
1186
1805
783
2805
3964
1394
3761
6429
PCA-Hist
4,95e-4 (0,0015)
6
156
ICA-Hist
5,3e-4 (0,0017)
8
10
125
0,0045 (0,0037)
0,0012 (0,0027)
7,4436e-4 (0,0023)
3,7231e-4 (0,0017)
8
17
18
19
0,0037 (0,0038)
10
297
5
9
480
0
2253 14613
0,0044 (0,0037)
8
479
7
0
3445 11180
0,0039 (0,0062)
14
PCA-Hist
0,1223 (0,0892)
0,0519 (0,0807)
3,706e-4 (0,0017)
4
19
0
1
0
1296 1775
593
0
0
19
17
7
2984
4
0,0046 (0,0034)
7
887
4
5301
18
4
6928
12
0,0036 (0,0036)
0
0
18
1,122e-5 (5,018e-5)
20
19
19
3198 14537 17280 1547 11033 16226
0,001 (0,0023)
0,0015 (0,0027)
11
7
5
18
12
10
17
10
5
1062 23729 21718 3255 28904 33518 6498 32239 37288
0,0011 (0,0027)
17
17
16
3,698e-4 (0,0017)
3,702e-4 (0,0017)
19
19
19
19
19
9
3523 20583 23676 6338 12490 15134 12189 19036 26001 17906 28012 37612 23622 37180 49157
0
0
0,0034 (0,0037)
8
8
11
7
1100 1409 1207 2488
19
14
3887
19
Hist
6,4e-4 (0,0017)
2
3301 3551 301 11401
ICA-Hist
2D Griewangk
158
10D Griewangk
Population Size
200
20
2,2836e-9 (8,73e-9)
20
20
2337 3842
19
20
5292
4793
20
20
20
1463 2260 3050 2913 4528
6353
5698
20
20
20
20
2,9802e-11 (6,16e-12)
20
20
20
20
1,5385e-8 (5,4e-9)
20
20
20
7613 10413 9612 15652 21232 13924 23014 32254 18646 31246 43006
0 (0)
0 (0)
20
0 (0)
0 (0)
20
20
8,32e-15 (2,17e-15)
20
20
20
20
3,8e-10 (9,9e-11)
20
20
20
1,02e-7 (4,5e-8)
20
20
20
9168 12588 10833 17973 24513 16083 25503 35673 20705 34265 47105
For other functions, the preprocessing can have different influences. As an
example, the results of the same algorithms optimizing the 2D and 20D Two
Peaks function are presented in Tab. 3.6. In this case, UMDA without any preprocessing is the best solver because it assumes the independence which is fulfilled in this case. PCA and ICA rotate the coordinate system, and in this case,
they can make the problem only harder to solve when using UMDA.
UMDA using the PCA is thus much worse than UMDA without it, overall.
The PCA preprocessing has no chance to be successful in this case—in fact, the
PCA actually prevents the algorithm from finding good enough solutions.
The 20D Two Peaks function is very hard to optimize if we do not assume
the independence of individual variables. Even for ICA, it is very hard to estimate the right structure of the problem from the finite sample (population). The
UMDA with ICA preprocessing does not get so good results as UMDA without
it, but it outperforms the UMDA with the PCA. This suggests that the rotational
structure discovered by ICA is better for use with UMDA.
46
3
E NVISIONING THE S TRUCTURE
Alg
100
0,5791 (0,5784)
0,0504 (0,2235)
7,1e-16 (2,8e-15)
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
1469 396
936
1401
541
1621
2621
1161
2941
4621
1531
3871
6601
1281
4761
8801
11
5
5
19
19
265
1585
6461
243
867
19
0 (0)
0,9236 (0,3701)
0,6536 (0,4641)
5
2
3
7
564
396
951
PCA-Hist
0,8583 (0,4806)
2
700
ICA-Hist
1,2092 (0,6393)
4
3
174
13,455 (2,3201)
3,6682 (1,2294)
0
0
0
1,0836 (0,644)
2
422
0
2
522
0
2
1201 1033 3319
0,4561 (0,4761)
12
459
9
975
0
2
5669
0,2872 (0,4372)
9
15
1281 857
0
2
13
2356
13
3005
0,0816 (0,0555)
20
14
1
6
3
400
600
800
0 (0)
4,4e-17 (2e-16)
7,3e-13 (4,7e-13)
0,4816 (0,4616)
11
8
7
0,4667 (0,4377)
0,4398 (0,41)
12
6
4
13
6
4
8544 18565 21588 8543 24146 24038 4205 28960 21606 6839 25737 30207
0,1565 (0,364)
17
2161
16
5096
14
6843
0,0735 (0,0701)
20
13
2
0,0201 (0,0523)
0,0167 (0,0515)
0,0361 (0,0896)
20
17
17
20
18
18
20
16
14
3676 10430 14647 2294 13319 17497 6499 20174 22639
0,0038 (7e-4)
0,0027 (0,0059)
20
20
18
20
20
0,0368 (0,0070)
20
20
20
7
9726 25730 18601 6161 15032 23501 9941 18421 26890 15181 26371 39121 19161 35041 48687
14,93 (2,008)
PCA-Hist
2D Two Peaks
50
0
ICA-Hist
20D Two Peaks
Population Size
200
20
Hist
Hist
Func
Table 3.6: Influence of PCA and ICA preprocessing when optimizing the Two
Peaks function.
0
3.2.7
0
0
14,87 (1,907)
0
0
25,5829 (1,6473)
0
0
0
21,398 (0,554)
0
0
0
20,0724 (0,0886)
0
0
0
20,0001 (8,554e-4)
0
0
0
20,0053 (5,8681e-4)
0
0
0
20,046 (0,0058)
0
0
0
27,4542 (1,1316)
18,1138 (5,8225)
4,1084 (4,3175)
12,4599 (4,521)
15,2307 (1,5758)
15,2542 (1,6338)
0
1
6
1
0
0
0
0
4778
0
0
2
2
0
0
0
0
0
0
18436 18796 22596 35295
Comparison with Other Algorithms
We can expect that the efficiency of the local optimizers will not be affected by
any rotational transform. In fact, many local optimizers for real domain incorporate their own on-line version of PCA. The Two Peaks function remains hard
for them, but they should be faster when solving the Griewangk function (compared to UMDAs).
The line search heuristic should be influenced by population preprocessing
in a similar, or even more severe way than UMDAs. It would be only seldom
able to successfully solve any of the two problems when used with PCA or ICA.
The use of rotational transforms on the phenotype level would make GAs rotationally invariant (similarly to any other EA). Their performance would probably be worse than in the unrotated case—inaccurately estimated rotation would
break even the separability of groups of bits.
3.2.8
Summary
The second part of this chapter suggests a solution to the second aim of this
thesis—it provides a way of adapting the search algorithms to situations when
there are linear dependencies among the object variables.
Generally speaking, it is the independence of individual variables which
plays the major role in the EA efficiency if we use the UMDA. From the experimental results, it is obvious we need a tool which evaluates the quality of
transforms suggested by the PCA and ICA and enables us to choose the better
one (or to choose none of them at all). This role might be played by the tests of
independence.
If we wanted to test the independence based on current data set directly
using the definition 3.11, it would require to build some approximations of the
PDFs. There would be many problems related to this approach, e.g. choosing
the right form of a PDF model is an intricate issue (selecting a wrong model can
make the results unusable).
The mutual information can be also used as a measure of independence. The
less the mutual information, the less the dependence among variables. Similar
features characterize also the Kullback-Leibler (KL) divergence which is very of-
3.2 UMDA with Linear Population Transforms
ten used as a measure of difference between two distributions. It can be used to
measure the dependency among variables if we measure the KL divergence of
the joint density and the product of marginal densities. However, the main problem still remains—we need to estimate the densities. Some authors have used
the polynomial density expansions (Comon (1994), Amari et al. (1996)), but the
estimates are not very precise if the distributions differ greatly from the normal
one.
It is also possible to use a method similar to that presented in Sec. 3.2.5:
discretize the domain of each variable, treat the data points as measured on
categorical scale, and then use the χ2 -test of independence for frequency tables.
Of course, this test is only an approximation but its value lies mainly in the fact
that it is relatively fast, not complicated, and can be easily generalized to more
dimensions.
However, this approach has several practical implications. First, one of the
assumptions of the χ2 -test is that all the entries of the expected frequency table should be at least 5. Increasing number of dimensions (possibly with the
number of bins in each dimension) leads very quickly to a situation in which
we simply do not have enough data to perform this test. Second, the multivariate test of independence answers the question ‘Are all the variables independent of each other?’, i.e. it gives no indication of situations when we have two
or more groups of variables in which the variables are dependent, but among
which are not. This information has a great value for us because it allows for
the factorization of the joint PDF, i.e. it allows us to efficiently fight the curse of
dimensionality.
For the time being, I do not offer any precise and efficient method which
would choose the best way of population preprocessing—it remains an open issue for further research. An empirical rule of thumb: if time demands are not an
issue, try all variants—without preprocessing, with PCA, or with ICA. If a scalable optimizer is needed, do not preprocess the population. If the optimization
algorithm has to be rotationally invariant, use ICA preprocessing—in general
case, it has better chances to be successful than PCA (however, it is also more
time demanding).
Advantages of UMDA with Linear Population Preprocessing
Simplicity Still relatively simple approach.
Higher flexibility Model can encompass larger class of probability distributions, reduces linear dependencies.
Rotational invariance All the rotated instances of the same problem should be
solved, on average, in similar time.
Limitations of UMDA with Linear Population Preprocessing
Higher time and space demands The transforms induce additional computational burden.
Worse scalability The demands of the transforms increase more rapidly with
dimensionality of the problem when compared with the pure UMDA. The
practical applicability of UMDA with population preprocessing depends
on the complexity of fitness evaluation; if the individuals are evaluated
quickly, we do not want to wait on the preprocessing. If the evaluation of
47
48
3
E NVISIONING THE S TRUCTURE
one individual takes several minutes, we can afford spending some time
on preprocessing.
Chapter 4
Focusing the Search
The previous chapter, Envisioning the Structure, described methods with strong
structural assumptions—they assumed that individual variables are independent, or, that they can be made independent by removing the dependencies
using linear coordinate transforms. If this assumption was satisfied, the majority of probability models were also able to focus the search to the area where
promising solutions lied (with the only exception being the model based on the
equi-width histograms).
This chapter presents two models with weaker structural assumptions, i.e.
two EAs that are able to find and use some (even non-linear) dependencies
among variables. This also implies that they generally need larger populations
to estimate the dependencies correctly.
4.1 EDA with Distribution Tree Model
The distribution tree (DiT1 ) is an attempt to construct a model which would be
• flexible enough to cover some non-linear interactions between variables,
• generative so that new population members can be created in accordance
with the distribution encoded by the model, and
• still simple and fast to build.
It is based mainly on the Classification and Regression Trees (CART) framework
introduced by Breiman et al. (1984), but it differs from CART in several important aspects. The primary objective of DiT is not to present a model to classify
or predict new, previously unseen data points, but rather to generate new individuals so that their distribution in the search space is very similar to that of the
original data points.
When presented with the training data set (the population members selected
for mating), the distribution tree is built by recursively searching for the ‘best’
axis-parallel split. The leaf nodes of DiT present a partitioning of the search
space. Each partition is represented by a multidimensional hyper-rectangle
whose edges are aligned with the coordinate axes. Each leaf node includes certain number of individuals; based on it, the number of offspring for the respective node is calculated. It can be considered a kind of multidimensional histogram. Two examples of the DiT partitioning of the search space can be seen in
Fig. 4.1.
1
In this article for the distribution tree model the abbreviation ‘DiT’ is used in order to prevent
messing it with the ‘DT’ abbrev. commonly used for the decision trees.
49
DiT roots
DiT at a
glance
50
4
Griewangk function
F OCUSING
THE
S EARCH
Rosenbrock function
5
2
4
1.5
3
1
2
1
0.5
0
0
−1
−0.5
−2
−1
−3
−1.5
−4
−5
−5
0
5
−2
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Figure 4.1: DiT partitioning of the search space for the Griewangk function (left)
and for the Rosenbrock function (right)
When compared to other structure-building algorithms, e.g. the MBOA of
Očenášek & Schwarz (2002), the DiT is much simpler model. The tree here is
a kind of temporary structure; after the model is created we can forget the tree.
Only the leaf nodes are important because they form a mixture of uniform distributions. The structure in MBOA plays much more important role as it captures
the dependency structure of individual variables. The DiT algorithm is able to
cover some kind of interactions but the information is hidden only in the leaves.
4.1.1
Find best split
and recurse
Growing the Distribution Tree
This subsection gives more detailed description of the algorithm which is used
to grow the DiT. As already stated, it is basically the CART algorithm using
different split criteria.
Let us denote the population as X = {x1 , x2 , . . . , xN }, N is the population
size, xn = (xn1 , xn2 , . . . , xnD ) is the n-th individual, n = 1, . . . , N , D is the dimensionality of the search space. Further, there are three parameters of the algorithm: minT oSplit, describing the minimal node size2 allowed to be split,
minInN ode, the minimal number of data points that must remain in the left
and right part of the node after splitting, and maxP V alue, the maximal p-value
for which we consider the split statistically significant. Furthermore, we have a
matrix consisting of minimal and maximal values for each dimension (the box
constraints of the search space).
The tree building algorithm is very simple (see Alg. 4.1). First, the set of
individuals belonging to the node being processed is determined (in the beginning, all individuals belong to the root node). Then, it is checked if the node
has enough data points to be split (compare the node size with the minT oSplit
parameter). If there are not enough individuals, the node processing is stopped.
Otherwise, the best split point is found. If the split fulfills all the conditions (if
it is statistically significant enough, and if the size of the potential left and right
nodes is going to be at least minInN ode), the node data points are divided into
two sets which should belong to the left and the right node, respectively. Finally,
based on these two subsets of individuals, the procedure is recursively applied
to create the left and right subtrees.
2
To prevent misunderstanding, let me point out that ‘size of node’ or ‘node size’ refers to the
number of data points which belong to the respective node. When I describe the ‘physical size’
of node in DiT I use the term ‘node volume’.
4.1 EDA with Distribution Tree Model
51
Algorithm 4.1: Function SplitNode
1
2
3
4
5
6
7
8
9
begin
if NumberOfPointsInCurrentNode() < minT oSplit then
exit
split ← FindBestSplit(minInN ode)
if GoodEnoughSplit(split.Chi2 , maxP V alue) then
node.leftSubtree ← SplitNode(split.left)
node.rightSubtree ← SplitNode(split.right)
return node
end
Searching for the Best Split
The heart of this procedure is constituted by the way the search for the best split
is performed. The splits can be placed between each pair of the successive data
points in each dimension. If we have N D-dimensional data points in the node,
we have D × (N − 1) possible splits. Each of these candidate splits is evaluated
via hypothesis testing. The null hypothesis is the assumption that in each leaf
node the data points have approximately uniform distribution. This assumption is tested against the actual observed situation via the χ2 -test. Eventually,
the best split is selected provided that for this split the test says ‘there is only
a negligibly small probability of no difference between the uniform and the observed distribution’. Otherwise we have no reason to split the node—even if its
size is large.
Algorithm 4.2: Function FindBestSplit
1
2
3
4
5
6
7
8
9
10
11
12
begin
for all dimensions do
Build a list of candidate split points
for all candidate split points do
Determine expected frequencies in the left and right subnode
Determine observed frequencies in the left and right subnode
Compute the test statistic Chi2
Perform the χ2 -test, determine the p-value
if the current split is better than the best split found so far then
Remember it
return the best split found
end
Performing the χ2 -test
For each particular candidate split, the observed sizes of the left and the right
subnode are simply the number of points which belong to them, let us denote
them as NLobs and NRobs . For both subnodes we can also directly compute the
expected sizes, NLexp and NRexp , knowing the size of node as a whole and the
relative volumes of the left and right subnodes. The χ2 -test allows us to compare
The best split
52
4
F OCUSING
THE
S EARCH
the observed sizes with the expected ones. First, we compute the test statistic
Chi2 =
The p-value
as a measure
of the split
quality
(NLexp − NLobs )2 (NRexp − NRobs )2
+
.
NLexp
NRexp
(4.1)
The random variable Chi2 has χ2 distribution with 1 degree of freedom. It describes to what extent the observed and expected frequencies differ from each
other. Thus, we can compute the p-value of the test (the probability of observing
this or greater Chi2 assuming the uniform distribution in the node) as 1 minus
the value of the cumulative distribution function of χ2 distribution with 1 degree of freedom at point Chi2 .
p = 1 − CDF χ2 (1, Chi2 )
(4.2)
This way we can select the split that gives us the highest discrimination from
the uniform distribution with a sufficiently high probability, i.e. among all candidate splits, the one with the lowest p-value is selected, and is realized if its
p-value is lower than maxP V alue.
4.1.2
Sampling from the Distribution Tree
The process of creating new individuals, i.e. sampling from the distribution tree,
is straightforward. The leaf nodes of the tree fully cover the whole search space.
For each node, its limits are known as well as the number of offspring. We
can thus apply the SUS described in Sec. 3.1.2. Simply stated, in each node the
number of offspring generated using D-dimensional uniform number generator
is the same as the number of parents belonging to the node. Since an empty leaf
node is not possible, this ensures that the search will not stop in any area of the
search space.
4.1.3
Empirical Comparison
The carried out experiments compare the DiT evolutionary algorithm to several
other evolutionary techniques. The same evolutionary model was used for all
experiments and is described in Sec. 3.1.3.
For the performance evaluation of the EA using the DiT several reference
optimization problems were used: 2- and 20-dimensional Two Peaks function
(Sec. A.3), 2- and 10-dimensional Griewangk function (Sec. A.4), and finally 2and 10-dimensional Rosenbrock function (Sec. A.5).
Involved Evolutionary Techniques
The distribution tree evolutionary algorithm (DiT-EA) is compared to the genetic algorithm (with low and high resolution), to histogram UMDA, to random
search and to the line search heuristic. The settings of all algorithms are summarized in Tab. 4.1.3. A short decription of each of the compared algorithms
follows:
Random Search Sampling individuals one by one with uniform distribution
on the whole search space—that is the simple random search. It is the
dummiest way to solve a given task; in other words it is the ‘trial and
error’ approach without any learning. Although it is the dummiest way, it
is not always the worst way as the experiments showed.
4.1 EDA with Distribution Tree Model
53
Table 4.1: Summary of settings of individual algorithms involved in the study
Algorithm
Setting
All
EA Model
Selection
No. of Evaluations
LS
Step
GA-low
Crossover
Crossover Rate
Resolution
GA-high
Hist
DiT
Two Peaks
Function
Griewangk
Rosenbrock
Steady-state
Truncation
50,000
0.01
14 bits
Uniform
100%
14 bits
12 bits
Crossover
Crossover Rate
Resolution
56 bits
Uniform
100%
56 bits
55 bits
Histogram type
No. of Bins
120
Equi-height
100
41
Min to Split
Min in Node
Max p-value
5
3
0.001
Line Search In order to give some indication of how complex the test problems
are, they were also optimized with the so called line search (LS) method
(see Whitley et al. (1996)). This method is not intended to be a universal
optimization algorithm, rather it is a heuristic which is very effective for
separable problems and thus it can give us some insight in the complexity
of the evaluation function.
Its principle is very easy: start from randomly generated data point, randomly select a dimension, discretize and enumerate it, move to the best
data point in the unidimensional enumerated space, and repeat for another, yet unused dimension.
The discretization length was set to 0.01. This means that for Two Peaks
function the LS algorithm evaluates 1201 data points for each dimension.
If the LS gets stuck (no further improvement possible), it is restarted from
another randomly generated point.
Genetic Algorithms It is very hard to compare the GA to EDA in continuous
domain. While EDA searches the space of real numbers (or rational numbers when evolving in silico), the GA evolves binary strings and searches
a discretized space. The discretization, i.e. the resolution, must be specified by the user in advance. This way the user can deteriorate the GA
performance with insufficient precision setting. In the experiments, two
GAs were used—one with low resolution (allowing the GA to find such a
solution the distance of which from the global optimum is not larger than
0.001), and second with high resolution (allowing the GA to find such a
solution the distance of which from global optimum is not larger than
2.22−16 which is the distance from 1.0 to the next largest floating point
number in MATLAB).
54
4
F OCUSING
THE
S EARCH
Similarly to the other algorithms, the truncation selection was used. It is
not the best setting for GA because this selection type has relatively high
selection pressure and can result in premature convergence.
Uniform crossover was used for GA in all experiments. In EDAs, the
model building and sampling phase of evolution can be described as a
generalized crossover operation. Since the mutation was not used in other
types of algorithms, I did not use it for the GAs either.
UMDA with Equi-height Histogram The univariate marginal distribution algorithm (UMDA) with equi-height histogram model was described in Sec.
3.1. Similarly to the line search heuristic, the histogram UMDA should be
able to solve separable problems very efficiently, but it should run in trouble when solving problems with higher degree of dependency between
variables. The number of bins for each histogram is set in such a way that
if all the bins had equal width, none of them would be larger than 0.1.
Distribution Tree Evolutionary Algorithm The distribution tree, its construction and sampling is described in section 4.1. The only parameters to be
set are the minimal number of individuals that must remain in each node
(minInN ode set to 3), the minimal node size allowing the split in the node
(minT oSplit set to 5), and the maximal p-value for which we consider a
split statistically significant (maxP V alue set to 0.001).
Results and Discussion
2D Two Peaks
2D
Griewangk
2D
Rosenbrock
The statistics which were tracked during evolution are described in Sec. 3.2.6.
The overall results for all problems are shown in Table 4.2. For population based
algorithms, experiments with population sizes of 20, 50, 100, 200, 400, 600, and
800 were carried out. Each experiment was repeated 20 times, and the average
best fitness and its standard deviation based on all 20 runs was measured. The
population size which resulted in the best average score after 50,000 evaluations
was selected to be included in the results table.
The results for the 2D Two Peaks function are presented in Fig. 4.2. We can
see that the GA with a low resolution is not able to overcome the limitation
of the insufficient bit string length. It quickly finds the optimum, but only in
its own small search space. The GA with high resolution solves the problem
much better. The histogram UMDA is very efficient solving this task (already
for small populations)—the Two Peaks function is separable. The behavior of
the DiT-EA is comparable to the GA with high resolution but it can reach more
precise results.
The results for the 2D Griewangk function are shown in Fig. 4.3. We can
clearly see the difference between the Two Peaks and the Griewangk function.
While the histogram UMDA was the best ‘solver’ in case of separable Two
Peaks function, it completely fails to solve the Griewangk function which is nonseparable. Neither of the GAs solves this function extremely well. The DiT-EA
is clearly the winner for this function.
In Fig. 4.4, the behavior of tested algorithms on the Rosenbrock function is
depicted. Histogram UMDA failed to solve this task because it is non-separable.
The GA with high resolution is the best algorithm for this function. The DiT-EA
and the GA with low resolution gained similar results. For this function the
DiT-EA suffers from the fact that the valley of the Rosenbrock function is not
aligned with the coordinate axes. Axis parallel splits made by the DiT cannot
cover this kind of interaction sufficiently well. However, let me point out that
4.1 EDA with Distribution Tree Model
55
Table 4.2: Results of experiments on all test functions. The layout of individual
statisics in each table cell is described in Fig. 3.13, the number presented in the
last line of each table cell is the population size which produced the best average
results.
10D Rosenbrock 10D Griewangk 20D Two Peaks 2D Rosenbrock 2D Griewangk 2D Two Peaks
Func
Algorithm
Random
Line Search
GA low res.
GA high res.
Hist UMDA
DiT EA
0,1788 (0,0752)
0 (0)
0,0018 (0)
2,5e-11 (3,9e-11)
0 (0)
4,8e-13 (2,3e-13)
20
2
0
2998 26021
20
20
20
20
20
20
20
20
20
20
20
20
20
133
133
133
1341
4261
6621
1801
5371
9601
541
1621
2621
1771
2,1e-4 (2,5e-4)
20
6
0
3365 20138
0,0012 (0,0021)
20
5
0
3134 3017
0 (0)
20
20
20
912
912
912
0,0051 (0,0053)
20
35,3914 (1,8198)
0
0
0
0
2533
0
20
400
600
200
600
2,3e-5 (8,7e-5)
1,3e-7 (5,1e-7)
0,0024 (0,0026)
0 (0)
20
18
17
0
0
0
20
18
14
20
20
0
0
20
20
20
20
961
3421
6441
600
600
800
400
5,3e-5 (1,4e-4)
9e-7 (3,1e-6)
0,0044 (0,0042)
5e-5 (9,6e-5)
20
18
14
20
20
19
16
3
0
2134
20
16
5
2131 16014 29041
800
800
800
600
0,8168 (0,7827)
0,9627 (0,6612)
0,0027 (0,0059)
27,1593 (2,7677)
10
9
0
22921 22921 22921 25721 41379
0,4874 (0,0874)
20
1981 26868 33848 1921 28261 35268 3887
2721 16623 27258 2881 15441 24843 2351
0 (0)
20
6241 10111
7
3
0
25430 38534
20
20
18
0
0
0
9941 18421 26890
400
400
400
100
0,2975 (0,3412)
0,0063 (0,0031)
0,0083 (0,0057)
3,698e-4 (0,0017)
2,5e-4 (4,9e-4)
7
3
2
19
7
7
3
2
2
1
19
19
18
3
0
21523 21523 21523 30534 36401 45201 35801 43601 47201 17906 28012 37612 32401 49401
27,0552 (10,418)
0
0
0
1,168 (0,4596)
0
0
0
400
400
600
600
1,9947 (1,3471)
2,2752 (0,9745)
0,6373 (0,1714)
1,8554 (0,6581)
0
0
0
0
0
400
0
0
0
400
0
0
400
0
600
2D Two Peaks, all algorithms
5
10
Average Best−So−Far Fitness
0
DiT−EA, pop. size: 600
Hist−UMDA, pop.size: 200
GA−low, pop. size: 400
GA−high, pop. size: 600
0
10
−5
10
−10
10
−15
10
−20
10
0
1
2
3
4
No. of Evaluations
5
6
4
x 10
Figure 4.2: 2D Two Peaks function—Comparison of DiT-EA against other algorithms
56
4
F OCUSING
THE
S EARCH
2D Griewangk, all algorithms
0
Average Best−So−Far Fitness
10
−5
10
−10
10
−15
10
DiT−EA, pop. size: 400
Hist−UMDA, pop. size: 800
GA−low, pop. size: 600
GA−high, pop. size: 600
−20
10
0
1
2
3
4
No. of Evaluations
5
6
4
x 10
Figure 4.3: 2D Griewangk function—Comparison of DiT-EA against other algorithms
even the GA with high resolution did not find a solution of such a quality when
used with lower population sizes. In the case of population size lower than 800
it found a solution comparable to that of DiT-EA and GA with lower resolution.
20D Two
Peaks
10D
Griewangk
10D
Rosenbrock
In Fig. 4.5 we can see the evolution on the 20-dimensional Two Peaks function. Both GAs have almost the same behavior independently of the resolution.
Again, since this is a separable function the histogram UMDA is able to solve
it very quickly and precisely. The DiT-EA failed to solve this problem and its
performance is only slightly better than that of random search.
Figure 4.6 shows the efficiency of all algorithms when solving the 10-dimensional Griewangk function. Both GAs have similar behavior (as in the case of
20D Two Peaks function) and did not discover a good solution. Overcoming the
curse of dimensionality with its ‘univariate look on the world’, the histogram
UMDA was able to find similar solutions as the DiT-EA; they are the winners
for this case.
In Fig. 4.7 we can see that the way in which the 10-dimensional Rosenbrock
function is constructed still ensures a great amount of separability—it is separable by pairs. The histogram UMDA can take advantage of this fact and it
expresses the best behavior among all algorithms. The DiT-EA is only slightly
better than both GAs which (again) express almost the same behavior independently on resolution. None of the presented algorithms, however, was able to
adapt the distribution to the curved valley of the Rosenbrock function.
As expected, the Hist-UMDA was very efficient when solving separable
problems, i.e. the 2D and 20D Two Peaks function, and the 10D Rosenbrock
function as well. Although the variables in the basic 2D Rosenbrock function
are not independent, due to the way the 10D version was constructed, each pair
of variables in the 10D Rosenbrock function is independent of the rest.
4.1 EDA with Distribution Tree Model
2D Rosenbrock, all algorithms
−1
10
DiT−EA, pop. size: 600
Hist−UMDA, pop. size: 800
GA−low, pop. size: 800
GA−high, pop. size: 800
−2
Average Best−So−Far Fitness
57
10
−3
10
−4
10
−5
10
−6
10
−7
10
0
1
2
3
4
No. of Evaluations
5
6
4
x 10
Figure 4.4: 2D Rosenbrock function—Comparison of DiT-EA against other algorithms
20D Two Peaks, all algorithms
2
Average Best−So−Far Fitness
10
DiT−EA, pop. size: 100
Hist−UMDA, pop.size: 400
GA−low, pop. size: 400
GA−high, pop. size: 400
1
10
0
10
−1
10
−2
10
−3
10
0
1
2
3
No. of Evaluations
4
5
4
x 10
Figure 4.5: 20D Two Peaks function—Comparison of DiT-EA against other algorithms
58
4
THE
S EARCH
10D Griewangk, all algorithms
0
10
Average Best−So−Far Fitness
F OCUSING
DiT−EA, pop. size: 600
Hist−UMDA, pop. size: 600
GA−low, pop. size: 400
GA−high, pop. size: 400
−1
10
−2
10
−3
10
−4
10
0
1
2
3
4
No. of Evaluations
5
6
4
x 10
Figure 4.6: 10D Griewangk function—Comparison of DiT-EA against other algorithms
10D Rosenbrock, all algorithms
3
Average Best−So−Far Fitness
10
DiT−EA, pop. size: 600
Hist−UMDA, pop. size: 400
GA−low, pop. size: 400
GA−high, pop. size: 400
2
10
1
10
0
10
−1
10
0
1
2
3
4
No. of Evaluations
5
6
4
x 10
Figure 4.7: 10D Rosenbrock function—Comparison of DiT-EA against other algorithms
4.1 EDA with Distribution Tree Model
The DiT-EA was the best competitor when solving both variants of the Griewangk function (see Fig. 4.3 for an illustration), although the amount of dependencies among variables is greatly reduced for the 10D version when compared
to the 2D version. In both cases it was able to find a solution of significantly
better quality. In the majority of the rest of the involved test problems, its efficiency was better or comparable to that of GAs. The 20D Two Peaks function
is the only exception (see Fig. 4.5). The DiT-EA was able to find solutions only
slightly better than those obtained by the random search. I hypothesize that this
poor behavior is caused by the great number of local optima of this function (to
be precise, the 20D Two Peaks function has 220 local optima). To capture the
problem structure, the DiT-EA would need a huge population. Hundreds of individuals allow the algorithm to make only limited number of leaf nodes with
‘almost the same’ density.
For the 2D functions, the GA-high found better solutions than the GA-low.
GA with low resolution suffered from the insufficient bit string length. For the
multidimensional functions, however, the difference between the two GAs vanished and they performed almost identically along the whole evolution when
compared by a human eye.
4.1.4
Summary
This section of the thesis described and evaluated a new and original3 model
of probability distribution—the distribution tree. The model is an attempt to
solve the third aim of the thesis—to create a probabilistic model able to focus the
search to promising areas of the search space in situations when the problem has
higher-order interactions among variables. The results of EA empowered with
the DiT model are of mixed quality, yet they are very promising for certain kind
of optimization problems.
The way of DiT construction suggests that the DiT-EA should be very efficient when solving problems with several well-distinguished local optima with
a ‘symmetric’ shape (rectangular, spherical, . . . ), given the population size is
sufficient to reveal them. The population members after selection then form
clusters in the search space and the DiT-EA is able to identify them and to use
this information during evolution (the DiT model tries to identify rectangular
areas of the search space with significantly different densities). This hypothesis
is supported by the results observed for the Griewangk function. Furthermore,
the DiT presents a model built on the basis of rigid statistical testing which is
not rather common in the field of EAs.
The experiments also showed that the members of EDA class of algorithms
(histogram UMDA and DiT-EA) in almost all cases outperformed the GAs—
they were able to find more precise solutions with less time needed.
An open area for future work is the fact that the model building algorithm
allows very simple generalization for discrete variables. Only the procedure for
creating the set of candidate splits would have to be changed.
Advantages of DiT-EA
Simplicity The building procedure is based on the well-known CART algorithm.
3
To the best of the author’s knowledge
59
60
4
F OCUSING
THE
S EARCH
Non-parametric model The DiT is non-parametric in the sense that it does not
estimate parameters of any theoretical distribution—it is driven mainly by
the data.
Clustering capable The DiT model can identify and use certain kind of clusters
in the population—clusters which are recognizable by univariate looks on
the population. Thus, the interactions represented by such clusters can be
covered by this model.
Limitations of DiT-EA
Axis-parallel splits only Creating only axis-parallel splits results in a limited
ability to search the promising areas of the space if they look e.g. like valleys which are not parallel to the coordinate axes. This drawback can be
reduced by employing some coordinate transform in each node. The PCA
or ICA presented in Sec. 3.2 seem to be natural candidates for making the
DiT model rotationally invariant and capable of oblique splits. This possibility, however, remains an open topic for future work and is not pursued
in this thesis.
Lack of decomposition ability The DiT model is not able to decompose the
given problem to several independent subproblems which is exemplified
by the results for the Two Peaks function.
What’s Next?
The next section presents a model able to describe even elongated clusters not
aligned with the coordinate axes, i.e. a situation which cannot be described by
DiT model successfully.
4.2 Kernel PCA
In this section, another model capable of covering very general kind of interactions is presented. As the name suggests, the kernel principal components analysis
(KPCA) builds on the PCA method (described in Sec. 3.2.1) and preserves several of its important characteristics. Furthermore, it uses some techniques from
the kernel methods which became very popular recently.
4.2.1
PCA in terms
of dot
products
KPCA Definition
As stated earlier, the PCA usually amounts to performing the eigendecomposition of the data sample covariance matrix C. However, it can be shown (see e.g.
Schölkopf et al. (1996), Moerland (2000)) that after performing the eigendecomposition of the dot product matrix of size N × N computed as
K=
Kernel
methods
1 T
X X,
N
(4.3)
we arrive to the same non-zero eigenvalues and corresponding eigenvectors.
Thus, we can express the whole PCA in terms of dot products.
The kernel methods gained their popularity mainly due to the simplicity
with which they allow a broad range of originally linear methods to become
non-linear. We can make non-linear any linear algorithm which is expressed in
terms of dot products.
4.2 Kernel PCA
61
Suppose we have a function Φ : RD → F which presents a non-linear mapping from the input space to the so-called feature space. Then, we can define a
function k : RD × RD → R as a dot product of two data points transformed to
the feature space:
k(xi , xj ) = hΦ(xi ), Φ(xj )i
(4.4)
If we compute a so-called kernel matrix K so that the individual elements of the
matrix are Kij = k(xi , xj ), we have a dot product matrix of the data points
transformed to the feature space. We can now perform the linear PCA in the
feature space via the eigendecomposition of the kernel matrix and this whole
process is called kernel PCA.
Suppose now we want to use the KPCA for feature extraction, i.e. for an
input point x we want to find the projection of the image Φ(x) onto the principal
axes of the feature space F . The i-th element yi of the image y (the i-th nonlinear feature of x) can be computed as a projection of the image Φ(x) on the
i-th eigenvector of the kernel matrix K, i.e.
i
fi (x) = yi = hv , Φ(x)i =
N
X
vji k(xj , x) ,
Kernel trick
KPCA feature
extraction
(4.5)
j=1
i ) is the i-th normalized eigenvector of the kernel mawhere vi = (v1i , v2i , . . . , vN
trix K.
The best thing about the kernel methods is that we need not know the mapping function Φ. It can be shown that if the function k satisfies certain conditions
(Mercer’s conditions, see e.g. Schölkopf & Smola (2002)), it corresponds to a dot
product in some non-linearly mapped feature space F . Thus, we can compute
the KPCA just by selecting any valid kernel without explicitly prescribing the
mapping Φ. In the literature, the most often used kernels are the polynomial
kernel, the radial basis function (RBF) kernel and the sigmoidal kernel. In the
rest of this section, the RBF kernel of the form
2
i
j
−
k(x , x ) = e
kxi −xj k
2σ 2
(4.6)
is used.
4.2.2
The Pre-Image Problem
Equation 4.5 provides a way to carry out the projection of input data points onto
the non-linear principal components of the feature space. To be able to use the
KPCA as a generative model inside EDA (or as a generalized crossover operator)
we need to address the inverse transform as well, i.e. we should be able to find
an unknown pre-image x, x ∈ RD , of a given image y, y ∈ F so that y = Φ(x).
In general, this presents rather intricate issue. The mapping Φ induced by the
selected kernel k usually maps the low-dimensional input data points to a highdimensional (possibly even infinite dimensional) feature space F . The feature
space is usually of much higher cardinality so that a one-to-one mapping almost
never exists. This means that for an input point x there always exists its image
Φ(x) in the feature space, but the opposite is not true—for a randomly chosen
image in feature space there only seldom exists a precise pre-image in the input
space.
However, we can try to look for at least approximate pre-images. The idea
is really simple: if we have an image y given as an expansion of training points
P
j
images, y = N
j=1 αj Φ(x ), we can find its approximate pre-image z in such a
Precise
pre-images do
not exist!
Pre-image approximation
62
4
F OCUSING
THE
S EARCH
way that the image of z, Φ(z), minimizes the euclidean distance to y, i.e. we
solve the following optimization problem:
z = arg min ky − Φ(x)k
x∈RD
(4.7)
Schölkopf et al. (1996) provide a solution to this optimization problem for the
RBF kernel in the form of the following fixed point iterative algorithm:
zn+1 =
4.2.3
j
2 /(2σ 2 ))xj
j=1 αj exp(− x − zn
PN
2
j
2
j=1 αj exp(− kx − zn k /(2σ ))
PN
(4.8)
KPCA Model Usage
The KPCA differs from the linear transforms (PCA, ICA) in a substantial way—
it is a much more flexible model and is able to cover very complex interactions.
Thus in case of KPCA, it almost does not matter what kind of ‘variation’ we use
in the feature space, because the final distribution of offspring depends mainly
on the nonlinear transform learned by the KPCA from training data (i.e. from
selected parents). In my implementation of KPCA model, simple uniform random sampling in the feature space is carried out. The principle of the KPCA
model building and sampling (KPCA generalized crossover) as used in this thesis is described in Alg. 4.3.
Algorithm 4.3: KPCA model building and sampling
1
2
3
4
5
6
7
8
9
begin
Standardize the population
Train the KPCA model on the whole population
Transform the population to feature space F via Eq. 4.5
Determine the bounding hypercube in the feature space
Randomly sample N images from the hypercube
Find the pre-images of the new individuals using Eq. 4.8 iteratively
Destandardize the offspring
end
4.2.4
Clusters,
curves, . . .
Toy Examples of KPCA
An EDA which uses the KPCA model in a way shown in Alg. 4.3 can be (a bit
simplistically) described as a random search in the feature space. The KPCA
amounts to performing the linear PCA in the non-linearly transformed input
space, thus in feature space it preserves all the characteristics of the PCA. If
we wanted to visualize the non-linear components extracted from the data, we
could do it e.g. via the contour plots of the components. In case of PCA we
would see a set of parallel contour lines perpendicular to the respective principal
axe (as can be seen in Fig. 3.7). In case of KPCA we can see rather complex set
of curves which describe the data with much greater fidelity (see Fig. 4.8).
The majority of kernels require a few parameters; however, the precise form
of the model is mainly data driven—the training data set distribution is the main
factor influencing the final form of the KPCA model. The KPCA is thus very
efficient in modeling various types of dependencies—clusters, curves, etc. In
Fig. 4.9 we can see that the KPCA is able to perform a kind of clustering (modeling two –or more– clusters of data points) and a kind of ‘curve modeling’
4.2 Kernel PCA
63
Component 1
Component 2
Component 3
10
10
10
5
5
5
0
0
0
0
5
10
0
Component 4
5
10
0
Component 5
10
10
5
5
5
0
0
0
5
10
0
5
10
10
Component 6
10
0
5
0
5
10
Figure 4.8: First six nonlinear components of the toy data set
(describing each cluster as a ring) in the same time. This figure is an example
of the ability of the KPCA model to generate new data points which are ‘very
similar’ to the training data points. This makes it an ideal crossover operator
for variables which are closely dependent on each other; in the same time, the
KPCA crossover is not suitable for variables which are not closely related. In
that case, the KPCA is very limited in introducing a new genetic material to help
in the evolution (as you can see in Fig. 4.9, the KPCA crossover does not generate any data point in areas, where no training points are present). The search
based solely on the KPCA is then strongly biased by the individuals from the
initialization phase of the evolution.
4.2.5
Empirical Comparison
The carried out experiments were aimed at comparing the KPCA model used as
generalized crossover operator against several other algorithms with different
means of generating new individuals. No mutation was used in all experiments.
The evolutionary model described in Sec. 3.1.3 was used and the population
was allowed to evolve for 50,000 evaluations. All experiments were repeated
20 times. For the comparison, the 2-dimensional versions of the Two Peaks,
Griewangk and Rosenbrock functions were chosen. Again, the population size
20, 50, 100, 200, 400, 600, and 800 were tested for all the algorithms and the best
achieved result is reported.
Importance of
good initial
population
64
4
F OCUSING
THE
S EARCH
Training data points
Data points sampled from KPCA
8
7
6
5
4
3
2
1
0
2
4
6
8
10
Figure 4.9: Example of using the KPCA Crossover. The KPCA is able to perform
clustering and a ‘curve-fitting’ in the same time.
Involved Crossover Operators
Experiments on the following EAs with various crossover operators were carried out:
GA A genetic algorithm with the uniform crossover. Binary representation with
a resolution allowing the GA to find such a solution the distance of which
from the global optimum is not larger than 2.22−16 .
UMDA The univariate marginal distribution algorithm as described in Sec. 3.1
in more detail. Empirical equi-height histograms were used as 1D PDF
models. The number of histogram bins was set to such a number that if all
the bins had equal width, none of them was larger then 0.1.
DiT-EA EDA with the DiT model (distribution tree model, in more detail described in Sec. 4.1). The parameters of the model used for comparison
were minT oSplit = 5, minInN ode = 3, and maxP V alue = 0.001.
Real-valued EA with SPX EA with the simplex crossover operator (described
in Tsutsui, Goldberg & Sastry (2001)). Each new individual is created by
randomly sampling D + 1 points from the population, and generating one
point uniformly from the simplex formed by the selected D + 1 points.
EDA with KPCA EA with the KPCA crossover presented in this section. The
kernel parameter σ was set to 1. When performing the inverse transform,
the number of principal components used was set as the minimal number
of components able to describe at least 99.99% of the variance in the feature
space, but not less then 10.
Results and Discussion
The statistics tracked along the course of evolution are described in Sec. 3.2.6.
The overall results are shown in Tab. 4.2.5.
4.2 Kernel PCA
65
Table 4.3: Comparison of the EA with KPCA model against algorithms with
different crossover operators.
KPCA
SPX
DiT
UMDA
GA
Alg.
Function
Two Peaks
Griewangk
Rosenbrock
2,5e-11 (3,9e-11)
1,3e-7 (5,1e-7)
9e-7 (3,1e-6)
20
20
20
1801
5371
9601
20
20
18
20
20
19
1921 28261 35268 2881 15441 24843
600
600
800
0 (0)
0,0024 (0,0026)
0,0044 (0,0042)
20
20
20
14
541
1621
2621
3887
0
0
16
3
2351
2134
0
200
800
800
4,8e-13 (2,3e-13)
0 (0)
5e-5 (9,6e-5)
20
20
20
20
20
20
1771
6241
10111
961
3421
6441
20
16
600
400
600
0,5063 (0,5114)
0 (0)
0 (0)
10
9
5541 10112
6
20
20
20
20
20
20
8751
431
996
1481
1481
2981
3961
100
100
200
0 (0)
0 (0)
0 (0)
20
20
20
20
756
3261
4221
941
100
5
2131 16014 29041
20
20
21211 24891
100
20
20
20
289
694
1036
50
The Two Peaks function is separable (having no interactions between the
variables) and thus the UMDA was expected to solve such a problem very efficiently. Surprisingly, the EA using KPCA crossover was able to solve this problem with comparable efficiency and reliability as the UMDA (see Fig. 4.10). This
is, however, caused by the low dimensionality of the problem and by the ability of the KPCA to implicitly perform clustering. With increasing dimensionality of this problem the UMDA scales well (see Sec. 3.1.3), but the KPCA is
not expected to preserve its superior results in that case (larger population sizes
would be needed). The other algorithms solved this problem much worse than
the UMDA and KPCA; the SPX crossover failed to solve this task finding a deceptive attractor of the Two Peaks function in a half of all the runs.
The Griewangk function is not separable and the UMDA is not able to solve
this function (see Fig. 4.11). The SPX crossover solves this problem best, however this is due to the fact that the global optimum lies in the middle of the other
local optima. SPX creates many offspring in between the parents and thus in the
neighborhood of the global optima. As was shown by the Two Peaks function in
previous paragraph, when solving a general multimodal function the SPX is not
very efficient. The EDA using the DiT model is very efficient solving this task
as well, although it needs a larger population than SPX and KPCA. The KPCA
solves this task reliably with a relatively small population, however it needs
longer time preserving a ‘subpopulation’ on each of the local optima. After the
‘subpopulations’ vanish due to the selection the global optimum is reached very
quickly.
The Rosenbrock function has the strongest dependency between variables
among all three test problems. The SPX and KPCA were the only algorithms
able to solve this task. The KPCA-EA needs much smaller population and is
able to solve this ‘EA-hard’ problem very quickly. It needs only hundreds of
fitness function evaluations which is a number that is still worse, but already
comparable with more classic, gradient-based optimization techniques.
Two Peaks
Griewangk
Rosenbrock
66
4
F OCUSING
THE
S EARCH
Two Peaks, all crossovers
5
Average Best−So−Far Fitness
10
0
10
−5
10
−10
10
−15
10
GA, pop. size: 600
UMDA, pop. size: 200
DiT−EA, pop. size: 600
SPX, pop. size: 100
KPCA−EA, pop. size: 100
−20
10
0
1
2
3
4
No. of Evaluations
5
6
4
x 10
Figure 4.10: 2D Two Peaks function—comparison of the KPCA-EA against other
algorithms
Griewangk, all crossovers
0
Average Best−So−Far Fitness
10
−5
10
−10
10
−15
10
GA, pop. size: 600
UMDA, pop. size: 800
DiT−EA, pop. size: 400
SPX, pop. size: 100
KPCA−EA, pop. size: 100
−20
10
0
1
2
3
4
No. of Evaluations
5
6
4
x 10
Figure 4.11: 2D Griewangk function—comparison of the KPCA-EA against
other algorithms
4.2 Kernel PCA
67
Rosenbrock, all crossovers
0
Average Best−So−Far Fitness
10
−10
10
−20
10
GA, pop. size: 800
UMDA, pop. size: 800
DiT−EA, pop. size: 600
SPX, pop. size: 200
KPCA−EA, pop. size: 50
−30
10
0
1
2
3
4
No. of Evaluations
5
6
4
x 10
Figure 4.12: 2D Rosenbrock function—comparison of the KPCA-EA aginst other
algorithms
4.2.6
Summary
The KPCA model introduced in this section provides alternative solution to the
third aim of this thesis—to create a probabilistic model able to focus the search to
promising areas of the search space in situations when the problem has higherorder interactions among variables. Its advantages are most distinguishable
in situations when the distribution of a small group of tightly linked variables
should be described (that is also the reason for using only low-dimensional test
problems).
When solving practical problems of higher dimensionality, the KPCA should
be accompanied with a ‘device’ for detecting the dependencies between variables (i.e. for decomposing the problem at hand). I hypothesize that the KPCA
will not scale well in the general case and without problem structure learning
and decomposition its superior efficiency and reliability will vanish.
No special parameter tuning for the KPCA model was performed and the
KPCA crossover solved all the low dimensional problems with the same parameter settings. There is an open space for research in the field of creating good
heuristics to set the parameters, or adapting the parameters during the EA run.
The experiments presented in this section suggest that the KPCA crossover has
a great potential and when used appropriately it belongs to the best operators
for competent evolutionary algorithms.
Advantages of KPCA-EA
Flexibility The model is mainly data-driven, and as such it can describe a huge
class of dependencies among variables.
Clustering capable The KPCA model can identify and reproduce the distribution of promising individuals in the form of clusters.
Curve modeling capable The KPCA model can learn and reproduce the distribution of promising individuals in the form of curves.
68
4
F OCUSING
THE
S EARCH
Rotational invariance The model is not sensitive to any rotation of the coordinate system.
Limitations of KPCA-EA
Need for proper initialization The KPCA model is very precise and it does not
usually generate new individuals in areas where no training points were
present. It is thus critical to supply in the training data set such individuals
which form a good representation of the promising solutions.
Bad scalability The KPCA model generally requires larger populations when
used in higher dimensions, i.e. when modeling the distribution of larger
group of tightly linked variables.
Lack of decomposition ability The model is not able to decompose the problem at hand to several simple subproblems.
High time and space demands The model stores a portion of the training data,
and performs the eigendecomposition of the kernel matrix which is of size
N × N . The longest time, however, is demanded by generation of new
individuals. The creation of each new individual is itself an optimization
task.
Chapter 5
Population Shift
Chapters 3 and 4 presented probabilistic models with strong and weaker structural assumptions, respectively. EDAs using these models (when properly set
up) are able to solve optimization tasks precisely. However, the crucial assumption of all these models is that the initial population and all subsequent populations must lie in the vicinity of the target solution, i.e. the population must
surround the solution. If during the evolution this requirement is broken, it is
usually very hard for these algorithms to recover—they usually converge prematurely to locally optimal solutions (or even to solutions which are not locally
optimal).
When using histogram models or the distribution tree model, we need to
set up the box constraints—the limits of hypercube in which the search should
be carried out. They can be set up before the evolution, or they can be set up
adaptively during the evolution. In both cases, we can make an error—we can
set the hypercube in such a way that it will not surround the best solution. These
algorithms do not have any chance to search beyond the box boundaries.
If we decide to use Gaussian distributions, their mixtures, or the KPCA
model, the situation is better, because we do not have any hard boundaries of
the search space. However, even using these models, the chance of generating
new individuals outside the areas in which the parent individuals lie is very
low.
This chapter describes the reasons for these phenomena, and presents methods able not only to focus the search to promising areas but also shift the population to such areas. These methods, however, have the form of local optimizers,
i.e. they are not able to cope with more local optima at the same time.
Surrounding
the target
solution
5.1 The Danger of Premature Convergence
The premature convergence is a phenomenon observed in EAs for a long time. It
arises if the search algorithm somehow misses the true global optimum, another
suboptimal solution takes over the population which loses its diversity, and the
search stops. Rudolph (2001) shows that self-adpatative mutations may lead to
premature convergence. In that article the mutations are created by sampling
random mutation vectors from the Gaussian distribution and its parameters are
self-adapted. The conclusions of this article hold for the EDAs which use the
Gaussian distribution as well.
Bosman & Grahl (2006) present a theoretical model of the progress of a simple EDA solving a 1D problem. The algorithm they analyze is an ordinary EDA
(see Alg. 2.1) with the following instantiation:
69
Danger of
selfadaptation
70
5
P OPULATION S HIFT
• the truncation selection with parameter τ is used, i.e. the best τ × 100% of
individuals are used to create the model,
• the Gaussian distribution N (µ, σ 2 ) is used with maximum likelihood estimates of µ and σ.
(σ
µ(t+1) = µ(t) + σ (t) · d(τ ),
(t+1) 2
)
(5.1)
(t) 2
= (σ ) · c(τ ),
(5.2)
where µ(0) and (σ (0) )2 are the parameters of the Gaussian distribution used to
generate the first population, and the d(τ ) and c(τ ) are coefficients describing
the change of the parameters µ and σ 2 given by the following equations:
φ Φ−1 (τ )
,
τ
Φ−1 (1 − τ ) · φ Φ−1 (τ )
c(τ ) = 1 +
−
τ
d(τ ) =
(5.3)
!2
φ Φ−1 (τ )
τ
,
(5.4)
where Φ and φ is the CDF and PDF of the standard normal distribution, respectively. The dependence of coefficients d(τ ) and c(τ ) on τ is depicted on Fig. 5.1.
We can see that the only way how to prevent the variance σ 2 from diminishing
is to select the whole population (τ = 1), but then the movement of the center
µ would be zero (d(1) = 0). Or, we can force the algorithm to take large step by
selecting only a few best individuals (τ is small), but then the variance σ 2 would
drop to a very small number which would limit the exploration ability.
3.5
1
0.9
3
0.8
2.5
0.7
0.6
2
c(τ)
d(τ)
Population
shift ability is
limited!
They analyze the behavior of this algorithm when searching for an optimum of
a linear (or any monotonic) fitness function defined on interval (−∞, ∞). What
would we expect from any search algorithm applied on such an ever increasing
or ever decreasing function? The algorithm should never stop searching and its
best-so-far solution x should go to infinity.
Bosman et al., however, proved that such an algorithm is not able to move to
infinity, because the distance it can ‘travel’ is limited. They derived the following
difference equations describing the evolution of the parameters of the Gaussian
distribution:
1.5
0.5
0.4
0.3
1
0.2
0.5
0.1
0
0
0.2
0.4
τ
0.6
0.8
1
0
0
0.2
0.4
τ
0.6
0.8
1
Figure 5.1: Coefficients d (left) and c (right) in relation to the selection proportion τ .
If these facts are considered altogether, the following formulas can be derived:
1
p
,
(5.5)
lim µ(t) = µ(0) + σ (0) · d(τ )
t→∞
1 − c(τ )
lim (σ (t) )2 = 0.
t→∞
(5.6)
5.2 Using a Model of Successful Mutation Steps
This means that the population of the studied EDA eventually loses its diversity
so that the search stops. The distance to which the Gaussian distribution can
travel is limited and is given only by the initial variance (σ (0) )2 and the selection
ratio τ .
The analyzed EDA is a typical example of a badly instantiated EDA. Very
often, if the truncation selection is used with a model estimated by maximum
likelihood, premature convergence occurs. Although we do not have such theoretical analyses for other instantiations, there is strong empirical evidence that
the premature convergence can arise even for other learning methods and other
selection schemes.
The researchers use a variety of remedies for the premature convergence
phenomenon, but they usually do not provide any analysis of the algorithm
behavior, or any proof that their remedy is generally applicable. The most often
used ones are the following:
71
Premature
convergence
remedies
• Use truncation selection and maximum likelihood estimates, but set a lower
bound on the distribution variance. This often ensures, that the algorithm
does not stop searching, but the speed of progress and the precision of the
solution are limited.
• Use maximum likelihood estimates but a different selection scheme ensuring
higher variability of selected individuals. It is unclear in what circumstances this remedy works well.
• Use truncation selection and maximum likelihood estimation, but increase
the variance by a selected factor. If the factor is not selected properly the
algorithm can diverge, i.e. the variance can increase in each step even if
the population is situated around the optimum.
• Use a special kind of selection and a carefully chosen procedure for updating the
model (at least one parameter of the model is not set up using maximum
likelihood estimation).
• Do not estimate the distribution of selected individuals, but rather the distribution of successful mutation steps.
The next section describes two methods which use the fifth and the combination of fourth and fifth remedies, respectively.
5.2 Using a Model of Successful Mutation Steps
Instead of building a model of selected individuals, the methods presented in
this section use successful mutation steps to generate new ones. This approach
is somewhat similar to evolutionary strategies where the mutation vectors are
attached to each individual. On the contrary, in methods presented here, the
mutation vectors are detached from the individuals to which they were applied.
This implies that the algorithm performs a local search (if the mutation steps
are applied to one prototype individual in each iteration), or a kind of parallel local search (if the mutation vectors are applied to more individuals in the
population—in that case, the shape of the individual basins of attraction should
be similar to adapt the distribution successfully).
One of the most straightforward solutions to utilize the successful mutation
steps was presented by Pošı́k (2005b). He used a cooperative co-evolution approach. His EA has 2 populations—one with the solution vectors and one with
Mutation
steps
co-evolution
72
5
P OPULATION S HIFT
the successful mutation steps. The solution vectors are evaluated by the fitness
function and are created by randomly pairing them with mutation steps that
were useful in previous generations. The mutation steps are evaluated by the fitness increase they caused and are created by mutating the old mutation vectors.
This approach proved to be flexible in such a way that it can adapt to changing local neighborhoods of the fitness landscape. However, its convergence is
slow and it is not able to efficiently reintroduce the diversity to its population if
needed.
A different, and more successful approach is constituted by the evolutionary
strategy with covariance matrix adaptation.
5.2.1
Adaptation of the Covariance Matrix
Many algorithms from the class of continuous EDAs use the normal distribution
as the model used to generate new individuals. A few of them use the multivariate normal distribution with a full covariance matrix—the two most typical
examples are EMNA (Estimation of Multivariate Normal Algorithm, see e.g.
Larrañaga & Lozano (2002)) and CMA-ES (Evolutionary Strategy with Covariance Matrix Adaptation, see e.g. Hansen & Ostermeier (2001)). Both algorithms
use actually the same formula to generate new individuals:
x(g+1)
∼ N µ(g) , Σ(g)
n
EMNA
(5.7)
They differ in the way the parameters of the normal distribution are estimated.
The main feature of EMNA is that it uses the multivariate normal distribution to encode the distribution of selected individuals. Its parameters are estimated
via the maximum likelihood method (see Fig. 5.2, left). As can be seen, the
variance in the direction of gradient has dropped; EMNA is very susceptible
to premature convergence as discussed in Sec. 5.1. In practical applications it
must be equipped with some of the above mentioned premature convergence
remedies.
−1
−1
−1.5
−1.5
−2
−2
−2.5
−2.5
−3
−3
−3.5
−3.5
−4
−0.5
0
0.5
1
1.5
2
2.5
−4
−0.5
0
0.5
1
1.5
2
2.5
Figure 5.2: An example of one iteration of EMNA (left) and CMA-ES (right).
From the initial distribution (dashed line) new points are sampled, evaluated,
and divided to selected points (blue dots) and discarded points (red crosses).
EMNA then fits a Gaussian model to selected points and places it to the mean
of selected points. CMAES creates a Gaussian model of selected mutation steps
and places it to the best individual.
CMA-ES
Evolutionary strategy with covariance matrix adaptation tries to view the
neighborhood of the current point as a convex quadratic function. The aim of
5.2 Using a Model of Successful Mutation Steps
73
updating the covariance matrix is to estimate on-line the inverse Hessian matrix
of the quadratic model of the neighborhood. CMA-ES uses rather
intricate up
date scheme for the parameters of the normal distribution N µ(g) , σ (g) · C(g)
which is used to sample new individuals. CMA-ES updates separately all three
parameters:
The center of the distribution, µ. The basic update scheme is based on maximum likelihood method, i.e.
(g)
µ(g+1) = arg max P xselected |µ(g+1) .
µ
(5.8)
Individual variants of CMA-ES thus place the center to the best selected individual, or to the average of several best individuals (sometimes weighted
by their relative fitness).
The covariance matrix, C. Again, the update scheme is based on maximum likelihood estimation, i.e.
(g)
(g+1)
C
= arg max P
C
!
xselected − µ(g) (g+1)
|C
.
σ (g)
(5.9)
(g)
The covariance matrix is computed for successful mutation steps xselected −
µ(g) . Again, CMA-ES can use a weighted covariance matrix, as in the case
of the distribution center.
Global step size, σ. The step size is conceptually adapted in such a way so that
the two most recent consecutive steps of µ were conjugate perpendicular,
i.e.
µ(g+2) − µ(g+1)
T
× C(g)
−1
×
µ(g+1) − µ(g)
σ (g+1)
2
→ 0.
(5.10)
Of course, this equation is only conceptual and cannot be used to set the
σ (g+1) , because the µ(g+2) is not known. However, this equation reveals
that CMA-ES performs a kind of principal components analysis over the
distribution center steps.
Moreover, CMA-ES takes into account also the previous values of the strategy parameters µ, C, and σ (taking a weighted average of the old and new values), and uses cumulation to filter out possibly erratic moves of the distribution
center. For the implementation details, see e.g. Hansen (2006).
5.2.2
Empirical Comparison: CMA-ES vs. Nelder-Mead
The CMA-ES is considered the cutting-edge method for real-valued optimization. Its superiority over other evolutionary optimization methods was shown
in many comparative studies (among others, see e.g. IEEE CEC 2005 Special
Session on Real-parameter Optimization). For a list of successful applications,
see Hansen (2005). The CMA-ES has features of local optimizers. To this end, the
superiority of CMA-ES in all the above comparisons may only justify that the
functions used in the comparisons are better optimized with local optimizers.
Although it is not exactly true, let me present here a comparison of CMA-ES
with the well-known local optimization technique, Nelder-Mead simplex search
described in Sec. 1.2.2.
CMA-ES is
cutting-edge
74
5
Vestibuloocular reflex
analysis
P OPULATION S HIFT
The following results were reported by Pošı́k (2005a). He used the NelderMead simplex method and CMA-ES for the analysis of the vestibulo-ocular reflex signal. The details of the task can be found in Appendix B). Here, it suffices to state the fundamental aim of the optimization task. We have a system
(vestibulo-ocular system) that is fed with an input signal—mixture of sine waves
(MOS) of different frequencies, amplitudes, and phase shifts. On the system output we observe a signal depicted in Fig. 5.3—many signal segments with gaps
between them. These segments are assumed to come from another MOS signal
Simulated slow−phase segments of the VOR signal
0.1
0
−0.1
0
5
10
15
20
Figure 5.3: Vestibulo-ocular reflex signal, the input of the analysis.
which has the same frequency components as the input MOS signal; however,
these components can have different amplitudes and phase shifts. The aim of
the analysis is to estimate the parameters (amplitudes and phase shifts) of the
underlying MOS signal, and to align the VOR signal segments with the estimated MOS signal as can be seen in Fig. 5.4. After that, we can compare the am2
1
0
−1
−2
0
5
10
15
20
Figure 5.4: The VOR signal segments aligned with the underlying MOS signal,
the output of the analysis.
Fitness
function
plitudes and phase shifts of the individual sine components of the input MOS
and the estimated output MOS signals, i.e. we can construct several points of
the frequency response of the vestibulo-ocular system.
The fitness function is described in Appendix B. It is a quadratic loss function, computed along the time domain after the individual VOR signal segments
were moved ‘vertically’ to coincide with the estimated MOS signal.
Experimental Setup
Generating
the VOR
signals
The two methods were tested on artificially generated VOR signals to assess
its success and precision and to decide which of the optimization algorithms is
more suitable for this task. The tests were carried out for MOS signals consisting
of 1 to 5 sine components, i.e. the search was carried out in 2-, 4-, 6-, 8-, and 10dimensional parameter spaces (amplitudes and phase shifts of individual sine
components are sought).
First, for each sine component of the signal, the values of frequency, amplitude and phase shift were randomly generated. The ranges for individual
parameters can be found in Table 5.1. Using these randomly generated values,
a continuous MOS signal (which is to be estimated) is created. This signal then
undergoes a disruption process which cuts it to individual segments with gaps
5.2 Using a Model of Successful Mutation Steps
75
between them. The segments are then placed approximately to the same level
(see Fig. 5.3).
Table 5.1: Settings for parameters of artificial VOR signal
Parameter
Value (Range)
fi
ai
φi
Sampling Freq.
Signal Duration
Saccade Duration
h0.05, 2i
h0.2, 2i
h0, π/2i
500 Hz
20 s
0.05 s
For each number of components, 9 different VOR signals were generated.
For each of them the parameters of the underlying MOS were estimated by
minimizing the loss function using both the Nelder-Mead simplex search and
the CMA-ES. In each run, the algorithms were allowed to perform 10,000 evaluations of the loss function. A particular run was considered to be successful
if the algorithm found a parameter set with the loss function value lower than
10−8 .
Experiment
settings
Results and Discussion
First, let us review the success rates of both algorithms when estimating the
parameters of the MOS signal with the number of components ranging from 1
to 5 (see Tab. 5.2).
Success rates
Table 5.2: Success rates (in percentages) of Simplex and CMA-ES algorithms
Components
1
2
3
4
5
Simplex
CMA-ES
100.0
100.0
100.0
44.4
0.0
100.0
100.0
100.0
100.0
100.0
As we can see, the simplex algorithm has difficulties with finding the optimum of the loss function in less than 10,000 evaluations when the underlying
MOS signal has 4 or more components.
The comparison of speed is based on the number of evaluations needed to
find a solution with loss lower than 10−8 , i.e. only successful runs are considered. The results are summarized in Fig. 5.5. The two graphs reveal that
the number of needed evaluations increases with the number of components
(i.e. with the dimensionality of the search space) much faster for the simplex
search method than for the CMA-ES where the increase is almost only linear (at
least subquadratic). CMA-ES is clearly more scalable solution than the simplex
search.
The progress of evolution is depicted in Fig. 5.6. It presents the loss function value of the best solution found so far, averaged over all successful runs.
Again, there is no line for the simplex method searching for parameters of 5
components.
The speed of
convergence
Evolution
profiles
76
5
CMA−ES
6000
6000
5000
5000
No. of Evaluations
No. of Evaluations
Nelder−Mead Simplex Search
P OPULATION S HIFT
4000
3000
2000
1000
4000
3000
2000
1000
0
1
2
3
4
No. of Components
5
0
1
2
3
4
No. of Components
5
Figure 5.5: Number of evaluations needed to find a solution with quality better
than 10−8 as a function of the number of components of the underlying MOS
signal. Middle line: median, box: interquartile range, whiskers: minimum and
maximum.
4
10
Simplex
CMA−ES
2
Average Best−so−far
10
0
10
−2
10
−4
10
−6
10
−8
10
0
1000
2000
3000
4000
No. of Evaluations
5000
6000
Figure 5.6: Typical progress of successful search runs as done by the simplex
method and by the CMA-ES. Both the leftmost lines (solid and dashed) belong
to 1 component, the rightmost dashed line belongs to simplex searching for parameters of 4 components while the rightmost solid line belongs to CMA-ES
searching for parameters of 5 components.
5.2 Using a Model of Successful Mutation Steps
Based on these results, a recommendation can be given: do not use the simplex method when searching for parameters of the MOS signal with more than
2 components. The CMA-ES solves such tasks faster and more reliably.
5.2.3
Summary
So far, this chapter has discussed the role of adaptivity in situations when the
population shift is needed, i.e. in situations when the population does not surround the target optimal solution. It was shown that the most straightforward
type of model adaptation using maximum likelihood method often leads to premature convergence, so that some additional ‘tweaks’ are needed.
Then, the CMA-ES was conceptually introduced—a clever algorithm on the
boundary of evolutionary strategies and EDAs, that uses the normal distribution as the model of successful mutation steps, and adapts the covariance matrix in a way that prevents from premature convergence. The CMA-ES was then
compared with the Nelder-Mead simplex search method; when increasing the
dimensionality of the search space, it was revealed that the CMA-ES outperforms the simplex search very quickly and that the CMA-ES is more scalable
solution.
Advantages of CMA-ES
Scalability The population size grow is only logarithmic with the search space
dimensions, the time scale for adapting the covariance matrix is approximately quadratic.
Flexibility Due to the adaptation mechanism, CMA-ES is able to solve unimodal non-separable problems with non-linear interactions among variables.
Low population sizes Compared to other EAs, CMA-ES uses rather low population sizes which enables for quick adaptation to local neighborhoods
and for effective population shift.
Premature convergence resistant CMA-ES uses the step size control to prevent
premature convergence.
Rotational invariance The model is not sensitive to any rotation of the coordinate system.
Stationarity The model parameters (µ, C) under random selection are unbiased. Step size σ is the exception, E(σ (g+1) |σ (g) ) > σ (g) .
Limitations of CMA-ES
Local search Although with regard to the taxonomy of optimization methods
presented in Sec. 1.1, the CMA-ES belongs to the class of asymptotically
complete methods, it exhibits features of a local search in the neighborhood of one point. If not properly initialized (the center of distribution, µ,
and the step size, σ), it will not be able to escape from the basin of attraction of some local optimum.
Locally linear dependencies only The multivariate normal distribution can describe only linear dependencies (the model have similar decomposition
ability as UMDA with linear coordinate transforms described in Sec. 3.2).
77
78
5
P OPULATION S HIFT
It is not able to describe valleys in the form of curves (however, it is able to
traverse them), and it cannot perform clustering (i.e. it cannot reproduce
the distribution of population members if they form clusters in the search
space).
What’s Next?
The technique the CMA-ES uses for the covariance matrix adaptation is not the
only one possible and is not optimal from several points of view. Higher order
methods can be used to estimate the covariance matrix of the search distribution;
it is the topic of the next section.
5.3 Estimation of Contour Lines of the Fitness Function
Quadratic
regression
model?
The previous section described the CMA-ES—the state-of-the-art among algorithms that use normal distribution to sample new points. It adapts the step size
separately from the ‘directions’ of the multivariate normal distribution. The
adaptation is based on accumulation of the previous steps that the algorithm
made.
Auger et al. (2004) proposed a method improving the CMA-ES covariance
matrix adaptation using a quadratic regression model of the fitness function in
the local neighborhood. Their approach, however, required in D-dimensional
+ 1 data vectors to learn the quadratic function. Morespace at least D(D+3)
2
over, it assumed that each point has its fitness value, i.e. it cannot use selection
schemes based on pure comparisons of two individuals. The method is also not
invariant with respect to order-preserving transformation of the fitness function.
This section describes an initial study of a novel algorithm of Pošı́k & Franc
(2006) for learning the Gaussian distribution by modeling the fitness landscape
contour line between the selected and discarded individuals. It uses a modified
perceptron algorithm that finds an elliptic decision boundary if it exists. If it
does not exist, it will not stop. From this, the main limitation of the algorithm
immediately follows: in present state the algorithm is able to optimize only convex quadratic functions without noises.
5.3.1
Estimate the
contour line!
Principle and Methods
The basic principle of the proposed method is illustrated in Fig. 5.7 (right). After
evaluation of the population, the method constructs a model of the fitness function contour line in the form of an ellipse (D-dimensional ellipsoid) that would
allow us to discriminate between the selected and discarded individuals. The
decision boundary is of the form xT Ax+Bx+c = 0, where x is a D-dimensional
column vector representing a population member, A is a positive definite D × D
matrix, B is a row vector with D elements, and c is a scalar value.
This elliptic function is closely related to the Gaussian distribution: setting
the mean of the distribution to the center of the ellipsoid, and the covariance
matrix to Σ = A−1 , the elliptic decision boundary corresponds to a contour line
of the Gaussian density function. The candidate members of the new population
are then sampled from this distribution.
The proposed method uses a variation of the perceptron algorithm that finds
a linear decision function. To learn an ellipsoid we will have to map the points to
a different space and then map the learned linear function back into the original
5.3 Estimation of Contour Lines of the Fitness Function
79
space where it will form the ellipsoid. This ellipsoid is then turned into a Gaussian used to sample new points. The following paragraphs introduce methods
used to accomplish the process sketched above.
1
1
0
0
−1
−1
−2
−2
−3
−3
−4
−2
−1
0
1
2
3
−4
−2
−1
0
1
2
3
Figure 5.7: An example of learning the covariance matrix using a method similar to CMA-ES (left) and the proposed method (right) based on estimation of
the contour line between the selected (blue dots) and the discarded (red crosses)
individuals. The proposed method gives more accurate estimates of the covariance matrix and the position of the Gaussian.
Quadratic Mapping
We need to learn a quadratic function which would allow us to discriminate
between the two classes of data points. The classifier is then given as
C(x) =
(
1 iff xT Ax + Bx + c > 0
.
2 iff xT Ax + Bx + c < 0
(5.11)
The decision boundary xT Ax + Bx + c = 0 is required to be a hyperellipsoid
which is a special case of quadratic function. As already stated, the hyperellipsoid is to be learned with a method that is designed to find a linear decision
boundary. In order to be able to do that, we have to use a coordinate transform
such that if we fit the linear decision boundary in the transformed space, we
can transform it back and get a quadratic function. This process is sometimes
referred to as the basis expansion (see Hastie et al. (2001)) or feature space straightening (see Schlesinger & Hlaváč (2002)).
The matrix A is symmetric, i.e. aij = aji , i, j ∈ h1, Di. We can rewrite the
decision boundary to the following form:
a11 x1 x1 + 2a12 x1 x2 + . . . + 2a1D x1 xD
+ a22 x2 x2 + . . . + 2a2D x2 xD
...
+ aDD xD xD
b1 x1
+ b2 x2
+ . . . + bD xD
+ c
+
+
+
+
+
= 0
(5.12)
This equation defines a quadratic mapping qmap which for each point x
from the input space creates a new, quadratically mapped point z, where
Basis
expansion
T
z = qmap(x) = x21 , 2x1 x2 , . . . , 2x1 xD , x22 , . . . , 2x2 xd , . . . , x2D , x1 , . . . , xD , 1
(5.13)
Quadratic
mapping
80
5
P OPULATION S HIFT
Then, if we arrange the coefficients aij , bi , and c into a vector w so that
w = (a11 , a12 , . . . , a1D , a22 , . . . , a2D , . . . , aDD , b1 , . . . , bD , c) ,
(5.14)
we can write the decision boundary as wz = 0 and the whole classifier as
C(x) = C(z) =
(
1 iff wz > 0
,
2 iff wz < 0
(5.15)
The dimensionality of the feature space is easily computed as the number of
quadratic terms, D linear terms, and 1 conterms in Eq. 5.12: we have D(D+1)
2
D(D+3)
stant term. This gives
+ 1 dimensions.
2
The learning of a quadratic decision boundary can be carried out by the following process:
1. Transform the points x from the input space to points z in the quadratically
mapped feature space using Eq. 5.13.
2. Find the vector w defining the linear decision boundary in feature space.
3. Reorder the elements of vector w into matrices A, B, and c using Eq. 5.14.
Separating Hyperplane
Perceptron
algorithm
There are many ways to learn a separating hyperplane. In this method, the wellknown perceptron algorithm is used.1
The perceptron algorithm can be stated as follows. We have training vectors
zi ∈ Z of the form zi = (zi1 , . . . , ziD , 1), each of them is classified into one of
the two possible classes, C(zi ) ∈ {1, 2}. We search for a D-dimensional weight
vector w so that wzi > 0 iff C(zi ) = 1 and wzi < 0 iff C(zi ) = 2. In other
words, we search for a hyperplane that separates the two classes and contains
the coordinate origin. The algorithm is presented as Alg. 5.1.
Of course, the algorithm will not stop if the two classes of qmap-ed vectors are not linearly separable, i.e. if the original vectors are not separable by a
quadratic decision boundary.
Ensuring Ellipticity
Modified
perceptron
algorithm
Previous paragraphs showed a way of learning a quadratic decision boundary
by mapping the training vectors into the quadratic space, finding a linear decision boundary, and rearranging the elements of the weight vector w into matrices A, B, and c. However, this quadratic decision function might not be elliptic,
i.e. the matrix A might not be positive definite.
The perceptron algorithm described in Alg. 5.1 is basically an algorithm for
the satisfaction of constraints given in the form of linear inequalities. The usual
set of constraints that must be satisfied is wzi > 0 for all i. If we found a way
to describe the requirement of the matrix A positive definiteness in the form
of similar inequalities, and if we were able to find vectors that violate these inequalities, we could use only slightly modified perceptron algorithm to learn an
elliptic decision boundary. Such a way exists and is described in the following
paragraphs.
1
For the perceptron algorithm, a method was found that ensures that the matrix A computed
from the weight vector w will be positive definite.
5.3 Estimation of Contour Lines of the Fitness Function
81
Algorithm 5.1: Perceptron Algorithm
1
2
3
4
5
6
7
8
9
10
begin
/* Initialize the weight vector
*/
w=0
/* Invert points in class 2
*/
zi = −zi for all i where C(zi ) = 2
/* Find the training vector with minimal projection
onto the weight vector w
*/
z∗ = arg minz∈Z (wz)
if wz∗ > 0 then
/* The minimal projection is positive, the
separating hyperplane is found, and the
algorithm finishes.
*/
exit
else
/* The minimal projection is not positive,
wz∗ ≤ 0, the separating hyperplane was not
found yet and the point z∗ is the one with the
greatest error. Adapt the weight vector using
this point.
*/
∗
T
w = w + (z )
goto step 4.
end
As shown in Sec. 5.3.1, the quadratic form can be written using a linear function:
xT Ax + Bx + c = wz .
(5.16)
Matrix A is positive definite iff the condition xT Ax > 0 holds for all non-zero
vectors x ∈ RD×1 . In order to write the condition of positive definiteness
xT Ax > 0 in terms of the weight vector w, let us define a ‘pure’ quadratic
mapping pqmap for the vectors x where only the quadratic elements are present
while the D linear elements and 1 absolute element are substituted with zeros:
q = pqmap(x), where
qi =
(
(5.17)
zi iff i ∈ h1, D(D+1)
i
2
and z = qmap(x). (5.18)
D(D+1)
+ 1, D(D+3)
+ 1i
0 iff i ∈ h 2
2
Using any D-dimensional vector x and its transformed variant q = pqmap(x),
the following two conditions are equivalent:
xT Ax > 0 ⇐⇒ wq > 0
(5.19)
Furthermore, all eigenvalues of any positive definite matrix are positive. If
we perform the eigendecomposition of matrix A and get negative eigenvalues,
then the related eigenvectors v violate the condition for positive definiteness,
i.e. vT Av = wq ≤ 0, where q = pqmap(v). These pqmap-ed eigenvectors can
thus be used to adapt the weight vector w in the same way as ordinary qmap-ed
data vectors. A modified version of the perceptron algorithm that ensures the
positive definiteness of the resulting matrix A is shown as Alg. 5.2.
Again, the algorithm will not stop if the original vectors are not separable by
an elliptic decision boundary.
Pure
quadratic
mapping
82
5
P OPULATION S HIFT
Algorithm 5.2: Modified Perceptron Algorithm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
begin
/* Initialize the weight vector
*/
w=0
/* Invert points in class 2
*/
zi = −zi for all i where C(zi ) = 2
/* Find the training vector with minimal projection
onto the weight vector w
*/
z∗ = arg minz∈Z (wz)
/* Arrange the first D(D+1)
elements of vector w
2
into a matrix A
*/
A ← use Eq. 5.14
/* Find its minimal eigenvalue λ∗ and corresponding
eigenvector v∗
*/
∗
(λ , v∗ ) ← MinimalEigenvalueAndEigenvectorOf(A)
if wz∗ > 0 and λ∗ > 0 then
/* Both the minimal projection and the minimal
eigenvalue are positive, the separating
hyperplane is found, the algorithm finishes */
exit
else if wz∗ < λ∗ then
/* The minimal projection is lower than the
minimal eigenvalue, adapt the weight vector
using the data vector with the greatest error.
*/
w = w + (z∗ )T
else
/* The minimal eigenvalue is lower or equal to
the minimal projection, adapt the weight
vector using the pqmap-ed eigenvector for the
sake of ellipticity.
*/
w = w + pqmap(v∗ )T
goto step 4.
end
From Quadratic Function to Gaussian Distribution
Quadratic
function standardization
The quadratic function xT Ax + Bx + c learned by the modified perceptron algorithm is not defined uniquely; all functions of the type k(xT Ax + Bx + c), k 6= 0,
have the same decision boundary.
One reasonable way of the standardization (assuming matrix A is positive
definite) is to fix the function value at the minimum of the function. The minimum lies in the point µ = − 21 (BA−1 )T . The function value at the minimum is
deliberately chosen to be −1, i.e. the following equation must hold:
k(µT Aµ + Bµ + c) = −1
1
k=− T
µ Aµ + Bµ + c
(5.20)
(5.21)
The matrices defining the standardized quadratic function are then given as
AS = kA, BS = kB, and CS = kC.
5.3 Estimation of Contour Lines of the Fitness Function
83
The multivariate normal distribution N (µ, Σ) which will be used to sample
new points is then given by µ = − 12 (BA−1 )T and Σ = A−1
S .
Sampling from Normal Distribution
Sampling from the Gaussian distribution with the center µ and covariance matrix Σ is rather a standard task. The distribution, however, suffers from the curse
of dimensionality in such a way that the proportion of generated vectors that lie
inside the separating ellipsoid varies (drops with increasing dimensionality).
Suppose we have a set of vectors generated from D-dimensional standardized normal distribution. Each of the coordinates has unidimensional standardized normal distribution and their sum of squares has a χ2 distribution with
D degrees of freedom, χ2D . Thus, if we wanted to specify the percentage p,
p ∈ (0, 1), of vectors lying inside the separating ellipsoid we can employ the inverse cumulative distribution function of the χ2 distribution, CDFχ−1
2 , in a way
D
that is described in step 3 of the sampling algorithm in Fig. 5.3.
Algorithm 5.3: Sampling from normal distribution with specified proportion of points lying inside the separating ellipsoid
1
2
begin
/* Eigendecompose the covarinace matrix so that
Σ = RT Λ2 R, where R is a rotational matrix of
eigenvectors and Λ2 is a diagonal matrix of the
eigenvalues, i.e. Λ is a diagonal matrix of
standard deviations in individual principal axes
*/
(Λ, R) ← Eig(Σ)
/* Modify the standard deviations using the
critical value of the χ2 distribution
*/
3
Λ← r
4
5
6
7
Λ
CDFχ−1
2 (p)
.
(5.22)
D
/* Generate the desired number of vectors xS from
the standardized multivariate gaussian
distribution N (0, I)
*/
xS ∼ N (0, I)
/* Rescale them using the standard deviations Λ and
rotate them using R
*/
xC = RΛxS
/* Decenter the vectors xC using the center µ
*/
x = xC + µ
end
5.3.2
Empirical Comparison
To assess the very basic characteristics of the proposed algorithm, it is compared
with the CMA-ES on two quadratic fitness functions, spherical and ellipsoidal (see
Appendix A.1 and A.2, respectively).
Setting the
ratio of
offspring
generated
inside the
ellipsoid
84
5
P OPULATION S HIFT
Evolutionary Model
The evolutionary model used in the experiments is depicted as Alg. 5.4. CMAES uses its own standard evolutionary model.
Algorithm 5.4: Evolutionary model for the proposed method
1
2
3
4
5
6
7
8
9
10
11
12
13
Testing the
shift ability
begin
Initialize the population of size N and evaluate it
while not TerminationCondition do
Assign the discarded individuals to class 1, the selected
individuals to class 2
Map the population to quadratic feature space
Find a linear decision boundary given by vector w using the
modified perceptron algorithm
Rearrange components of w into matrices A, B, and c
Turn the quadratic model into gaussian distribution
Modify eigenvalues of the gaussian so that the ellipsoid contains
the desired proportion of new individuals
Sample N − 1 new individuals from learned gaussian
Evaluate them
Join the old and new population using elitism and throw away
some individuals so that the population is of size N again
end
The population is initialized in area h−10, −5iD in order to test not only the
ability to focus the search when it resides in the area of the optimum, but also to
test the ability to efficiently shift the population toward the optimum.
Elitism and sampling N − 1 new individuals together ensure, that the new
population will contain at least 1 old individual and 1 new individual. This
feature greatly prevents the stagnation in situations when no new individual is
better than the worst old individual.
The algorithm was stopped if the best fitness value in the population dropped
under 10−8 . The results reported are taken from 20 independent runs for each
algorithm and configuration.
Where to Place the Gaussian Center?
After finding the decision ellipsoid and turning it into a Gaussian distribution,
we can decide where we want to place it. There are basically two possible decisions:
1. Place it in the center of the learned quadratic function. This placement
is reminiscent of the process done in conventional EDAs (and is actually
depicted in Fig. 5.7, right). Also, if the ellipsoid were fit precisely, the
algorithm could jump to the area of the global optimum in a few iterations.
2. Place it around the best individual of the population. Such an approach is
similar to mutative ES and the search is more local.
Gaussian
centered on
the BSF
solution
Since the competitor of the proposed algorithm is the CMA-ES which uses the
second option, the learned Gaussian distribution is centered around the best
individual. This also makes the algorithm behave more like a local search and
prevents the possibly ‘long jumps’ from the current center to the new estimated
center of the Gaussian.
5.3 Estimation of Contour Lines of the Fitness Function
Sphere Function
4
Ellipsoid Function
10
10
85
10
2
10
5
10
Average BSF Fitness
Average BSF Fitness
0
10
−2
10
−4
10
−6
10
0
10
−5
10
−8
10
−10
10
0
−10
500
1000
1500
Number of Evaluations
2000
2500
10
0
1000
2000
3000
4000
Number of Evaluations
5000
6000
Figure 5.8: Comparison of average evolution traces for the proposed algorithm
(solid line) and the CMA-ES (dashed line). The individual lines belong to 2, 4,
6, and 8 dimensional versions, respectively, from left to right.
Population Sizes
The CMA-ES uses a population sizing equation of the form N = 4 + ⌊3log(D)⌋.
For the proposed method, no population sizing model exists yet. However, the
evaluation concentrates on the potential hidden in the proposed method, and
thus I decided to tune the population size for individual test problems and individual dimensionalities.2 The best settings found for the proposed algorithm
along with the population size used by CMA-ES are presented in Table 5.3.
Table 5.3: Best population sizes for both test problems
5.3.3
Dimension
2
4
6
8
CMA-ES
Proposed method, Sphere
Proposed method, Ellipsoidal
6
9
11
8
8
10
9
7
8
10
6
6
Results and Discussion
The comparison of the algorithms can be seen in Fig. 5.8. For the sphere function
the proposed approach is slightly better for dimensions 2 and 4, and slightly
worse for dimensions 6 and 8. This could suggest that the algorithm will not
scale up well and that the efficiency of the algorithm could drop with increasing dimensionality. However, the proposed algorithm used lower population
sizes than the CMA-ES; it is possible that after careful study of the algorithm
properties, it will be able to keep its superiority.
For the ellipsoid function, our algorithm clearly outperforms the CMA-ES
for all tested dimensions (2, 4, 6, 8). But again, the difference between the proposed method and CMA-ES gets lower with increasing dimensionality.
Coming back to Table 5.3, we can observe a very interesting phenomenon.
The ‘optimal’ population size drops with increasing dimensionality which is
something really unexpected. For the time being, no sound explanation for that
phenomenon is provided. I can only hypothesize that it is due to the learning
algorithm, the modified perceptron. Fig. 5.9 shows an example of the ellipsoid
2
This is not a good practice for production systems but in this early stage of the research such
a tuning is acceptable for discovering the potential of the proposed method.
Strange
behavior of
optimal
population
size
86
Great
expectations
from SVM
5
P OPULATION S HIFT
found by the perceptron. We can see that the ellipsoid is far from the ideal
circular decision boundary. It may be thus profitable to use smaller populations
which would impose less constraints on the ellipsoid that would be in turn less
deformed. On the contrary, as can be seen in Fig. 5.9 (dashed line), the Support
Vector Machine (SVM) finds maximum margin separating hyperplane which is
a very useful feature. However, for the time being, a method for SVM training,
that would ensure the ellipticity of the learned decision boundary, must still be
invented.
5
4
3
2
1
0
−1
−2
−3
−4
−5
−6
−4
−2
0
2
4
6
Figure 5.9: The difference in decision boundaries found by the modified perceptron algorithm (solid line) and the support vector machine (dashed line).
5.3.4
Summary
This chapter presented a solution to the last aim of this thesis—to find a method
usable in situation when the population members are placed in an area that does
not contain any local optimum. From many points of view, the best algorithm
able to accomplish this goal is CMA-ES.
In this thesis, a novel scheme for learning the Gaussian distribution in the
context of EDAs was presented. The method is based on training a classifier
with elliptic decision boundary; after the training, it should correctly classify the
selected and the discarded individuals if the elliptic decision boundary exists.
The experiments (carried out on low-dimensional test problems only, for the
time being) suggest, there is a huge potential in this method—for the sphere
and ellipsoidal fitness function the proposed method outperformed the stateof-the-art CMA-ES optimizer.
The proposed method was developed independently, but it turned out that it
could be described as an instance of the Learnable Evolution Model framework
of Wojtusiak & Michalski (2006) which is also based on constructing classifiers
of the selected and discarded population members. However, they use a different type of model—they use almost exclusively the AQ21 rules (see Wojtusiak
(2004)). With regard to the model type (Gaussian distribution) and the learning
algorithm (elliptic classifier), the proposed method is an original contribution to
the EC field.
5.3 Estimation of Contour Lines of the Fitness Function
However, as was already pointed out, the algorithm is very naive in its assumptions. As a result, at the present state it is able to optimize only convex
quadratic functions. The development of more efficient learning algorithm is an
open research area.
The possibility to use the learned center of the quadratic function as the center of the Gaussian has not been investigated yet. This feature could further
increase the efficiency of the algorithm. On the other hand, when using low
population sizes, the estimate of the Gaussian center can become unstable and
the search could be more erratic. There is also a promising possibility to look for
the center of the search distribution on the line going through the BSF individual
and the learned center of separating ellipsoid using a line search method.
All the advantages and limitations are not clear yet. The obvious ones are
listed below.
Advantages of the proposed method
Premature convergence resistant The proposed method estimates neither the
distribution of selected points, nor the distribution of selected mutation
steps; it estimates the contour lines of the fitness function. In principle, it
is impossible to converge to a point which is not at least locally optimal.
Rotational invariance The model is not sensitive to any rotation of the coordinate system.
Limitations of CMA-ES
Local search Similarly to CMA-ES, the proposed method belongs to the class of
asymptotically complete methods, but it exhibits features of a local search
in the neighborhood of one point.
Locally linear dependencies only The multivariate normal distribution can describe only linear dependencies (the model have similar decomposition
ability as UMDA with linear coordinate transforms described in Sec. 3.2).
It is not able to describe valleys in the form of curves (see next limitation).
It also cannot perform clustering (i.e. it cannot reproduce the distribution
of population members if they form clusters in the search space).
Need for separability by 1 ellipsoid The two classes of points (selected and discarded) must be separable by 1 ellipsoid—this is a requirement of the
modified perceptron algorithm (if this assumption is not satisfied, the algorithm will not stop). The proposed method in its present state is not
applicable to multimodal and noisy functions and to functions with nonlinear dependencies among variables. A learning algorithm that will relax
this assumption and will allow for class overlapping must be developed.
Non-optimality of the learned ellipsoid The ellipsoids learned by the modified perceptron algorithm are not optimal in many aspects (see Fig. 5.9).
Such non-optimalities become highly disturbing with increasing dimensionality—the algorithm performance deteriorates. There are algorithms
that fit better decision boundaries (e.g. support vector machines) but they
must be altered to produce elliptic, not just quadratic, decision boundaries.
87
88
5
P OPULATION S HIFT
Chapter 6
Conclusions and Future Work
This thesis dealt with the applications of probabilistic models and coordinate
transforms in EAs for real domains. The purpose of this chapter is to summarize
the contributions of this thesis to the EC field and to suggest the directions for
future research.
6.1 The Main Contributions
The main ideas presented in this thesis are the following:
Importance of the neighborhood structure learning After introducing the notions of optimization, Chapter 1 emphasized the fact that the neighborhood structure1 is the crucial part of any optimization algorithm—if the
neighborhood structure is not suitable for the particular optimization problem, it can hardly be solved efficiently, quickly, and reliably. In populationbased optimizers, the neighborhood structure is given by the variational
operators. The population is usually the only means for adaptation, the
operators remain static.
This in turn provides the motivation for the establishment of methods capable of adapting the neighborhood structure by means of adapting the
variational operators, i.e. there is an open space for inductive and transductive machine learning methods. One class of these methods is constituted by EDAs which use probabilistic modeling of the population distribution (often complemented with coordinate transforms) to describe the
particular neighborhoods.
Population-based search algorithms are viewed as methods performing a
search in local neighborhood of the population. This approach is not common in the EC field and provides a unifying view of all search algorithms.
Essential features of any optimization algorithm Chapter 1 also recognized the
differences between the discrete and real search spaces. For any realvalued optimization algorithm, it is crucial not only to address the design
problems we face in discrete domains, i.e. (1) to learn the structure of the
problem, and to be able (2) to focus the search to promising areas; the algorithm also has to be able (3) to shift the population in terms of a particular
1
As explained in the footnote 8 on page 13, the neighborhood is given by the variational operator; in this context, the neighborhood is not considered to be given only by some static metric or
topology.
89
90
6
C ONCLUSIONS
AND
F UTURE W ORK
distance in a particular direction (something that has no sense in discrete
domains), and (4) to make the first three requirements cooperate.
Many researchers tried to equip their algorithms with the above mentioned features, but none of them expressed all these requirements explicitly. Chapters 3, 4, and 5 were dedicated to each of the first three requirements.
Introduction and assessment of a new marginal probability histogram model
In Chap. 3.1, validation of present knowledge about the use of univariate
probabilistic models inside marginal product model was presented and
elaborated. Moreover, the max-diff histogram model was introduced and
compared to equi-width and equi-height histograms and to the univariate
mixture of gaussians. The max-diff histogram can be easily constructed
and proved to be comparable to the winner of this comparison, to the equiheight histogram.
The max-diff histogram has not yet been applied2 in the context of EDAs;
along with the comparison with the other univariate marginal models, it
seems to be an original contribution of the author.
The use of linear coordinate transforms The flexibility of the marginal product probabilistic model can be improved by employing linear coordinate
transforms (particularly PCA and ICA) which allow to describe greater
number of the search space constellations. Even though these preprocessing transforms were applied globally for the whole population, they are
able to improve the efficiency of the search algorithm using the marginal
product model on multimodal test functions.
The use of these transforms is not new in EAs, however, their comparison
presented in Sec. 3.2 of this thesis broadens the knowledge of their effect
on the EDA behavior. The study also revealed that the no-free-lunch theorem applies here as well; we do not know beforehand, which of these
transforms gives better results, or if any transform should be applied at
all.
The distribution tree model A new model of the joint probability distribution
was developed in Sec. 4.1. It has a tree structure and is based on the CART
framework. It can be described as a kind of multidimensional histogram.
The worst limitation of the model is its ability to create only axis-parallel
splits of the search space. It is also possible to create univariate probability
model in the form of distribution tree and use it in the UMDA framework,
however, this possibility was not pursued in this thesis. The model is an
original contribution of the author.
Application of kernel PCA The use of linear PCA and the need for model of
non-linear dependencies motivated the application of kernel PCA. Sec. 4.2
introduced the method and its embedding in EDA. It requires solving the
pre-image problem which is itself an optimization problem. New individuals generated from the KPCA model reproduce the original distribution
with a high fidelity, but the creation of individuals is very time consuming. For the sake of time demands, only the results for low dimensional
test problems are presented.
2
To the best of author’s knowledge.
6.2 Future Work
Nevertheless, these results suggest there is a huge potential in this method
when applied in situations which need to focus the search only. The model
is able to model curves and clusters of high quality solutions in the same
time. On the other hand, it is highly prone to premature convergence if
the training set (the selected individuals) does not surround the sought
optimum.
Enhancement of the CMA-ES algorithm Chapter 5 dealt with methods which
are very efficient in situations when a population shift is needed. The need
to estimate not the distribution of selected individuals, but rather the distribution of selected mutation steps is emphasized. The methods usually
search in an adaptive neighborhood of one point, resembling the behavior of local optimizers although they use a population of individuals. The
state-of-the-art CMA-ES algorithm was described and a method for its enhancement was presented (joint work of the author and Vojtěch Franc).
The original CMA-ES adapts the covariance matrix and the step size on
the basis of selected mutation steps. The proposed method estimates the
covariance matrix on the basis of binary classifier that is trained to discriminate between the selected and discarded individuals. The contour lines
of the normal search distribution should coincide with the contour lines of
the fitness function as much as possible. Although the proposed method
still has some naive assumptions which are hardly fulfilled for real-world
problems, it is very promising—on selected test problems it outperformed
the CMA-ES algorithm.
6.2 Future Work
The thesis opened more questions than it answered. The algorithms presented
in this thesis use rather sophisticated probability models, often in conjunction
with coordinate transforms. All of them exhibit some negative features which
constitute the base of the future work.
Methods for algorithm combinations Chapters 3, 4, and 5 presented algorithms
with strong and weak structural assumptions, and methods for efficient
population shift, respectively, which were identified in Sec. 1.4.2 as the
first three essential features for any optimization algorithm. The fourth
feature allowing these methods to cooperate was not investigated in this
thesis. It is a very broad topic deserving its own dissertation.
Remedy: Research of techniques allowing to combine the presented algorithms so that they would cooperate and not harm each other. The
first steps were already conducted in works about competing heuristics of
Tvrdı́k et al. (2001). However, this approach tends to select the more profitable local optimizers in the beginning of the search which could prevent
more global heuristics to find better solutions in further stages of evolution.
Another possibility is to use parallel models (see e.g. Cantù-Paz (2000) or
Pošı́k (2001)) of GAs. They can be easily extended so that each subpopulation uses its own search heuristic. The migration then constitutes the
means for information exchange between individual search algorithms.
Meta-learning This thesis showed that EDAs could be successfully applied to
many problems and many ‘time instants’ during evolution. However, it is
91
92
6
C ONCLUSIONS
AND
F UTURE W ORK
also obvious that particular models are most suitable to particular problems and particular ‘time instants’. Using one type of EDA to solve all
kind of problems is far from optimal; even using one type of probability
model during the whole course of evolution when solving one particular
problem is not optimal.
Remedy: Research and development of techniques that would allow us
to analyze what kind of problem we are facing, and in what area of the
search space the population is situated. This information would allow us
to select such an optimization technique that is appropriate regarding the
user requirements, the problem type, and the particular ‘time instant’.
Population memory The majority of algorithms and methods presented in this
thesis still use the population as the main tool for implementing the memory of the algorithm (with the CMA-ES being the exception). The probabilistic models that were used for creating new individuals are build from
scratch in each generation. Thus the population size still strongly influences the learning ability of the algorithm.
Remedy: Research of the techniques used in incremental learning to adapt
the probabilistic models. The model should be able to memorize characteristics of past populations and adapt itself when new promising individuals
are selected. This could also relax the need for using large populations.
Reduced parallelization ability GAs are popular (among other reasons) because they allow us to parallelize virtually all operations. EDAs, on contrary, have lower parallelization ability which is caused by the probabilistic model construction phase.
Remedy: Research of probabilistic models that can be constructed by merging independently learned parts. Such property would allow us to distribute even the model learning phase among several processors.
Time and space demands of learning The learning procedures used to create
the probabilistic models are often very time consuming and do not scale
well with increasing dimensionality. They also often need a population
with exponentially increasing size to encompass all the dependencies.
Remedy: Selection and research of structurally simple models which can
cover sufficiently rich class of distributions and are easily learnable. The
marginal product model is scalable, but structurally to simple. Equipped
with PCA (the complexity of which is O(D3 )), it encompasses larger class
of distributions at the expense of higher computational costs, etc. These
two opposite criteria can be possibly balanced by using some meta-learning
approaches. The incremental learning discussed above can help with the
demands on population sizes.
Ignoring the solution feasibility None of the presented methods took into account the feasibility of generated solutions, i.e. they assumed to solve
unconstrained optimization problems or optimization problems with box
constraints, respectively.
Remedy: There are many techniques dealing with infeasible solutions: we
can pretend they are all feasible, we can use penalty functions, or we can
use repairing mechanisms. However, the influence of these techniques
on the model used in the particular EDA is not clear. Moreover, models
6.2 Future Work
which would allow us to directly incorporate at least some of the constraints should be investigated.
Loss of paternity EDAs exhibit a feature which could be called the loss of paternity, i.e. we do not know which of the old individuals is the parent of
a new individual, or, more precisely, all the old individuals are parents of
all the new individuals. This prevents us from using various replacement
schemes and other operators that take advantage of such a parent-children
relation. These operators are often used to preserve diversity in the population.
Remedy: We have to use other means for diversity preservation. They are
usually based on the notion of distance (the more close an old individual
is, the more likely it is the parent of a new individual). Such substitutes,
however, might be computationally expensive.
LEM approach As suggested in Sec. 5.3, building generative probabilistic models on the basis of classifiers trained to discriminate between selected and
discarded individuals is very promising direction of research. Kernel methods showed to be very flexible (as exemplified by KPCA Sec. 4.2), so that
the expectations from SVM are great. However, the toughest problem is
that not all descriptions (models) of selected individuals provide an easy
way of instantiation, i.e. of creating new individuals belonging to the class
the model describes.
93
94
6
C ONCLUSIONS
AND
F UTURE W ORK
Appendix A
Test Functions
This section contains the description of various functions used throughout the
thesis to test and compare individual algorithms. The test functions were carefully chosen not to mess up this thesis with lots of results, yet to reveal the
advantages and disadvantages of individual optimization methods.
A.1 Sphere Function
The sphere function (a.k.a. quadratic function, or quadratic bowl) is the very
basic optimization problem. It is given by equation
f (x) =
D
X
x2d .
(A.1)
d=1
The optimum: f (0) = 0. It is usually used with the search space unconstrained,
or constrained by the hypercube x ∈ h−100, 100iD .
The function is unimodal, order 1 separable, symmetric, rotationally invariant. If
any algorithm is not able to solve this easy task, it has no chance to successfully
solve more complex functions.
A.2 Ellipsoidal Function
The ellipsoidal function (a.k.a. elliptical function) is a bit tougher version of the
sphere function. Individual axes are scaled, the task is given as
f (x) =
D
X
d−1
(106 ) D−1 x2d .
(A.2)
d=1
The optimum: f (0) = 0. It is usually used with the search space unconstrained,
or constrained by the hypercube x ∈ h−100, 100iD .
The function is unimodal, order 1 separable. The scaling of individual axes
makes this function more difficult—individual object variables are forced to
converge one after another. Algorithms prone to premature convergence have
difficulties maintaining the diversity in dimensions with lower weight while
optimizing the dimensions with higher weights.
A.3 Two-Peaks Function
The two-peaks function was originally introduced as maximization problem
(hence its name). For the purposes of this thesis, it was inverted to minimization
95
96
A
T EST F UNCTIONS
task, it is given by equation
D
X
f (x) =
ftwo−peaks (xd ) , where
(A.3)


5x



(A.4)
d=1
iff
10 − 5x
iff
ftwo−peaks (x) = 5 −

0.8(x
−
2)
iff



0.8(12 − x) iff
x ∈ h0, 1)
x ∈ h1, 2)
x ∈ h2, 7)
x ∈ h7, 12i
The optimum is f (1) = 0, the function is usually used with box constraints
x ∈ h0, 12iD .
The function is order 1 separable, but multimodal. These two features together
are the cause of the fact that the number of local optima grows as 2D with dimensionality. Moreover, as can be seen in Fig. A.1, the size of the basin of attraction
of the global optimum in relation to the size of the whole search space decreases
as (1/6)D . Without proper decomposition, this function is almost impossible to
solve. Methods that try to decompose the function based on the distribution of
good solutions have serious troubles solving this function.
12
5
4.5
10
4
3.5
8
3
6
2.5
2
4
1.5
1
2
0.5
0
0
2
4
6
8
10
12
0
0
2
4
6
8
10
12
Figure A.1: The basic 1D two-peaks function (left) and the contour lines of the
2D two-peaks function (right)
A.4 Griewangk Function
The Griewangk function is given by equation
f (x) = 1 +
D
D
Y
xd
1 X
cos √
x2d −
4000 d=1
d
d=1
.
(A.5)
The optimum is f (0) = 0, the function is usually used with box constraints
x ∈ h−5.12, 5.12iD .
The function is multimodal and non-separable. In the search space, the local
optima form a pattern resembling number 5 on a dice (see Fig. A.2, left).
However, Whitley et al. (1996) mentioned that the influence of interactions
gets smaller when Griewangk function is applied in more dimensions. This
phenomenon is illustrated in Fig. A.2, right. The relative importance of the
surounding local optima gets lower; already the diagonal cut of 10-dimensional
Griewangk function is pretty smooth.
A.5 Rosenbrock Function
97
2.5
3
D=1
D=2
D=5
D = 10
2
2
1.5
1
1
0
5
5
0
0.5
0
−5
0
−6
−5
−4
−2
0
2
4
6
Figure A.2: The 2D Griewangk function (left) and the fitness values along the
search space body diagonal for dimensions 1, 2, 5, and 10 (right).
A.5 Rosenbrock Function
The basic 2D Rosenbrock function (a.k.a. Banana function) is given by equation
f (x) = 100 · (x21 − x2 )2 + (1 − x1 )2
(A.6)
The optimum is f (1) = 0. The function is unimodal, but non-separable with high
degree of non-linear interaction between variables. It is very hard to optimize
using ordinary evolutionary algorithms. The basic form of this function is twodimensional (see Fig. A.3).
2
3000
1.5
2000
1
1000
0.5
0
0
2
−0.5
1
−1
0
−1
−1.5
−2
−2
−1
0
1
2
−2
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Figure A.3: 2D Rosenbrock function (left) and its contour lines (right).
There are several ways of extending this function to spaces of more dimensions:
1. sequential summing up scores of all non-overlapping pairs of variables,
PD/2
i.e. fmultidim = d=1 f (x2d−1 , x2d )
2. sequential summing up scores of all consecutive overlapping pairs of variP
ables, i.e. fmultidim = D−1
d=1 f (xd , xd+1 )
Suppose we want to evaluate a 10-dimensional object. The first way would
result in 5 calls to the basic f function, while the second way would call the f
function 9 times. In this thesis, the first possibility is used.
98
A
T EST F UNCTIONS
Appendix B
Vestibulo-Ocular Reflex Analysis
In this section, the optimization problem used to compare the Nelder-Mead simplex search and CMA-ES in Sec. 5.2 is introduced.
B.1 Introduction
Vestibulo-ocular reflex (VOR) is responsible for maintaining retinal image stabilization in the eyes during relatively brief periods of head movement (see
Stockwell (1997)). By analyzing the VOR signal, physicians can recognize some
pathologies of the vestibular organ which may result in e.g. failures of the balance of a patient. The recognition of the pathologies is usually done by examining the slow-phase eye velocity and several points of the frequency response
(gain and phase shift) of the vestibular system.
The principle of the frequency response measurement is relatively simple:
the patient is situated in a chair which is then rotated in a defined way following a source signal—sine wave or a mixture of sine (MOS) waves. The chair
with the patient is situated in the dark, the patient has his eyes open and performs some mental tasks which should distract him from mental visualization
that could prevent the eye movements which are subsequently measured. This
is called the head rotation test. Since the resulting eye signal is distorted by fast
eye movements, so-called saccades, they must be removed from the signal. This
is usually done by computing the angular velocity and the segments with the
velocity higher than a predefined threshold are simply omitted from the signal.
A method for discovering the right threshold was presented e.g. in Novák et al.
(2002). The resulting signal consists of segments of slow phase movements
which we are interested in. However, they are not aligned to form a smooth
signal (see Fig. B.1).
Simulated slow−phase segments of the VOR signal
0.1
0
−0.1
0
5
10
15
20
Figure B.1: Simulated VOR signal with saccades removed. This is the input of
the algorithm.
This VOR signal serves as a source for measuring the slow-phase velocity
and the frequency response. The measurement of frequency response is usually done on the basis of interpolating these segments with some smooth curves
99
100
B
V ESTIBULO -O CULAR R EFLEX A NALYSIS
and performing a Fourier transform of the resulting continuous signal. The frequency response created this way contains, however, some artifacts that come
from the artificial interpolation curves and are not generated by the vestibular
system.
In this chapter, a method for the direct estimation of the gain and phase lag
of the individual sine components of the underlying MOS signal is introduced.
It allows for the measurement of several points of the frequency response at the
same time. After the estimation, the VOR signal segments should match1 with
the corresponding parts of the estimated MOS signal, as shown in Fig. B.2.
2
1
0
−1
−2
0
5
10
15
20
Figure B.2: VOR signal segments aligned with the estimated MOS signal. The
parameters of the MOS signal are output of the algorithm.
B.2 Problem Specification
It is assumed that the source signal (which controls the rotation of the chair with
the patient) is formed as a mixture of sine waves:
y S (t) =
n
X
aSi sin(2πfi t + φSi ) ,
(B.1)
i=1
where y(t) is the source signal and ai , fi and φi are the amplitude, the frequency
and the phase shift of the individual sine components, respectively. The superscript S indicates the relation to the source signal. Note that the frequencies fi
are not marked with this superscript. It is assumed that the output signal of
the vestibular organ is of the same form as the input one, i.e. it contains only
sine components with the same frequencies as the source signal but possibly with
different amplitudes and phase shifts. It should be of the form
y(t) =
n
X
ai sin(2πfi t + φi ) .
(B.2)
i=1
If we knew the ai and φi parameters of the output MOS signal components, we
could calculate the amplification (ai /aSi ) and phase lag (φi −φSi ) at individual frequencies and deduce the state of the vestibular organ. (Ideally, we should have
amplification2 of 1 and minimal phase lag at all frequencies, which is not possible. However, physicians can analyze the deviations and diagnose the states
that are not OK.)
1
In fact, the slow-phases of the measured VOR signal should go in the opposite direction than
the source MOS signal. When the chair rotates to the right, the eyes should move to the left and
vice versa.
2
Or, rather -1 with respect to the previous footnote.
B.3 Minimizing the Loss Function
101
Unfortunately, we do not have access to the output MOS signal described by
Eq. B.2. We have only the measured VOR signal, i.e. the segments of the output
MOS signal that are left after filtering out the saccades from the eye-tracking
signal (see Fig. B.1)3 . However, we can search for the unknown parameters ai
and φi of the output MOS signal by solving the optimization task described in
the following text.
B.3 Minimizing the Loss Function
Let m be the number of segments of the VOR signal at hand, vj (t), j = 1 . . . m,
end be the initial and
be the actual j-th segment of the VOR signal and tini
j and tj
the final time instants for the j-th signal segment. As stated above, we can find
the parameters of the output MOS signal by searching the 2n-dimensional space
of points x, x = (a1 , φ1 , . . . , an , φn ). Such a vector of parameters represents an
estimate of the output MOS signal and we can compute the degree of fidelity
with which the MOS corresponds to the VOR signal segments by constructing a
loss function as follows:
L(x) =
m Z
X
tend
j
ini
j=1 tj
((vj (t) − v̄j ) − (y(t) − ȳj ))2 dt ,
(B.3)
where v̄j is the mean value of the j-th VOR signal segment and is computed as
v̄j =
1
end
tj − tini
j
tend
j
Z
tini
j
vj (t)dt ,
(B.4)
and ȳj is the mean value of the current estimate of the output MOS signal related
to the j-th segment and is computed as
1
ȳj = end
tj − tini
j
Z
tend
j
tini
j
y(t)dt .
(B.5)
Subtracting the means v̄j and ȳj from the VOR signal segments vj (t) and MOS
signal y(t), respectively, we try to match the VOR signal segment to the corresponding part of the MOS signal. If they match, their difference is zero, otherwise it is a positive number quadratically increasing with the difference. This
operation is carried out for all m VOR signal segments.
In practice we work with the discretized versions of the signals so that we
usually substitute the integral with a sum. The equations are then4
tend
L(x) =
j
m X
X
j=1
i=tini
j
((vj (i) − v̄j ) − (y(i) − ȳj ))2 ,
(B.6)
tend
v̄j
=
j
X
1
vj (i) ,
tend
− tini
ini
j
j
(B.7)
i=tj
tend
ȳj
=
j
X
1
y(i) .
tend
− tini
ini
j
j
(B.8)
i=tj
3
It is important to note that in this thesis only artificially generated (simulated) VOR signals
were used. This allows for assessing the precision of the proposed method.
4
In the equations B.6, B.7 and B.8, the arrays vj (i) are supposed to be indexed with i ranging
from tini
to tend
.
j
j
102
B
V ESTIBULO -O CULAR R EFLEX A NALYSIS
B.4 Nature of the Loss Function
In Figures B.3, B.4 and B.5 it is shown what the landscape of the loss function
L(a1 , φ1 , a2 , φ2 ) looks like if two of the parameters are kept fixed. For these figures, the optimal values of the parameters are set to x = (0.6, 1, 0.2, 0.2) (marked
with a small cross in the figures). It seems that the loss function L(x) exhibits
many features which are considered to be hard for any optimization algorithm,
namely:
Non-separability. It is not sufficient to optimize the parameters one after another. The profile of the loss function along one variable changes significantly with a change in another variable. See Figures B.3, B.4 and B.5—the
cross describing the optimum is not situated in the optimum of the cut
if the other parameters are not optimal as well. The function cannot be
decomposed to a set of simpler optimization tasks.
Long narrow valleys not aligned with coordinate axes. See Fig. B.3. Even gradient based algorithms have problems finding minimum of such a landscape. They have to perform many small steps along the bottom of the
valley before they hit the optimum.
Multimodality. See Fig. B.5. There are several local minima. In this case they
are caused by the periodicity of the sine function with respect to the phase
shift.
However, based on the experience when optimizing this function I hypothesize that the influence of local optima can be negligible—the optimizers exhibit
the behavior similar to behavior on unimodal functions, but with very narrow
and perhaps tortuous valleys leading to the global optimum.
B.4 Nature of the Loss Function
103
A−A Cut: Other parameters optimal
A−A Cut: Other parameters NOT optimal
1
1
A2
1.5
A2
1.5
0.5
0.5
0
0
0.5
1
0
0
1.5
0.5
1
1.5
A1
A1
Figure B.3: Cuts through the landscape of the loss function L(a1 , φ1 , a2 , φ2 ) with
φ1 and φ2 fixed at their optimal values (left) and at values different from the
optimal ones (right).
A−P Cut: Other parameters NOT optimal
3
2
2
1
1
P1
P1
A−P Cut: Other parameters optimal
3
0
0
−1
−1
−2
−2
−3
0
0.5
1
−3
0
1.5
0.5
A1
1
1.5
A1
Figure B.4: Cuts through the landscape of the loss function L(a1 , φ1 , a2 , φ2 ) with
a2 and φ2 fixed at their optimal values (left) and at values different from the
optimal ones (right).
P−P Cut: Other parameters optimal
3
2
2
1
1
P2
P2
P−P Cut: Other parameters optimal
3
0
0
−1
−1
−2
−2
−3
−3
−3
−2
−1
0
P1
1
2
3
−3
−2
−1
0
P1
1
2
3
Figure B.5: Cuts through the landscape of the loss function L(a1 , φ1 , a2 , φ2 ) with
a1 and a2 fixed at their optimal values (left) and at values different from the
optimal ones (right).
104
B
V ESTIBULO -O CULAR R EFLEX A NALYSIS
Bibliography
Amari, S., Cichocki, A. & Yang, H. H. (1996), A new learning algorithm for
blind source separation, in ‘Advances In Neural Information Processing 8’,
MIT Press, Cambridge, MA, pp. 757–763.
Auger, A., Schoenauer, M. & Vanhaecke, N. (2004), LS-CMA-ES: A secondorder algorithm for covariance matrix adaptation, in X. Y. et al., ed., ‘Parallel
Problem Solving from Nature VIII’, number 3242 in ‘LNCS’, Springer Verlag,
pp. 182–191.
Bäck, T. (1992), Self-adaptation in genetic algorithms, in F. J. Varela &
P. Bourgine, eds, ‘Towards a Practice of Autonomous Systems: 1st European
Conference on Artificial Life’, MIT Press, pp. 263–271.
Baluja, S. (1994), Population based incremental learning: A method for integrating genetic search based function optimization and competitive learning,
Technical Report CMU-CS-94-163, Carnegie Mellon University, Pittsburgh,
PA.
Baluja, S. & Davies, S. (1997), Using optimal dependency-trees for combinatorial
optimization: Learning the structure of the search space, in D. Fisher, ed., ‘14th
International Conference on Machine Learning’, Morgan Kaufmann, pp. 30–
38.
Bosman, P. A. & Grahl, J. (2006), ‘Matching inductive bias and problem structure in continuous estimation-of-distribution algorithms’, European Journal of
Operational Research .
Bosman, P. A. & Thierens, D. (1999), An algorithmic framework for density
estimation based evolutionary algorithms, Technical Report UU-CS-1999-46,
Utrecht University.
Bosman, P. A. & Thierens, D. (2000), Continuous iterated density estimation evolutionary algorithms within the IDEA framework, in ‘Workshop Proceedings
of the Genetic and Evolutionary Computation Conference (GECCO-2000)’,
pp. 197–200.
Breiman, L., Friedman, J., Stone, C. J. & Olshen, R. A. (1984), Classification and
Regression Trees, Kluwer Academic Publishers.
Cantù-Paz, E. (2000), Efficient and Accurate Parallel Genetic Algorithms, Vol. 1
of Genetic Algorithms and Evolutionary Computation, 1 edn, Kluwer Academic
Publishers. ISBN 0792372212.
Comon, P. (1994), ‘Independent component analysis - a new concept?’, Signal
Processing 36, 287–314.
105
106
BIBLIOGRAPHY
Davis, L. (1991), Handbook of Genetic Algorithms, Van Nostrand Reihold.
de Bonet, J. S., Isbell, C. L. & Viola, P. (1997), ‘MIMIC: Finding optima by estimating probability densities’, Advances in Neural Information Processing Systems 9, 424–431.
Eiben, A. E., Hinterding, R. & Michalewicz, Z. (1999), ‘Parameter control in evolutionary algorithms’, IEEE Trans. on Evolutionary Computation 3(2), 124–141.
Etxeberria, R. & Larrañaga, P. (1999), Global optimization using bayesian networks, in A. Rodriguez, M. Ortiz & R. Hermida, eds, ‘CIMAF 99, Second Symposium on Artificial Intelligence’, Adaptive Systems, La Habana, pp. 332–339.
Fogarty, T. C. (1989), Varying the probability of mutation in genetic algorithms,
in J. D. Schaffer, ed., ‘Proc. of the 3rd International Conference on Genetic
Algorithms’, Morgan Kaufmann, pp. 104–109.
Fogel, L. J., Angeline, P. J. & Fogel, D. B. (1995), An evolutionary programming approach to self-adaptation on finite state machines, in J. McDonnel,
R. Reynolds & D. Fogel, eds, ‘Proc. of the 4th Annual Conference on Evolutionary Programming’, MIT Press, pp. 355–365.
Friedman, J. H. (1987), ‘Exploratory projection pursuit’, Journal of the American
Statistical Association 82(397), 249–266.
Gallagher, M. R., Frean, M. & Downs, T. (1999), Real-valued evolutionary
optimization using a flexible probability density estimator, in ‘Genetic and
Evolutionary Computation Conference (GECCO-1999)’, Morgan Kaufmann,
pp. 840–864.
Goldberg, D. E. (1989), Genetic Algorithms in Search, Optimization, and Machine
Learning, Addison-Wesley.
Goldberg, D. E., Deb, K. & Clark, J. H. (1992), ‘Genetic algorithms, noise and the
sizing of populations’, Complex Systems 6, 333–362.
Goldberg, D. E. & Smith, R. E. (1987), Nonstationary function optimization using genetic algorithms with dominance and diploidy, in J. J. Grefenstette, ed.,
‘Proc. of the 2nd International Conference on Genetic Algorithms and Their
Applications’, Lawrence Erlbaum Associates, pp. 59–68.
Grefenstette, J. J. (1986), ‘Optimization of control parameters for genetic algorithms’, IEEE Trans. on System, Man, and Cybernetics 16(1), 122–128.
Hansen, N. (2005), ‘References to CMA-ES applications’.
URL: http://www.bionik.tu-berlin.de/user/niko/cmaapplications.pdf
Hansen, N. (2006), The CMA evolutionary strategy: A comparing review, in J. A.
Lozano, P. L. naga, I. naki Inza & E. Bengoetxea, eds, ‘Towards a New Evolutionary Computation: Advances on Estimation of Distribution Algorithms’,
number 192 in ‘Studies in Fuzziness and Soft Computing’, Springer, pp. 75–
102.
Hansen, N. & Ostermeier, A. (2001), ‘Completely derandomized self-adaptation
in evolution strategies’, Evolutionary Computation 9(2), 159–195.
BIBLIOGRAPHY
Harik, G. (1999), Linkage learning via probabilistic modeling in the ECGA,
Technical Report IlliGAL Report No. 99010, University of Illinois, UrbanaChampaign.
Harik, G., Cantù-Paz, E., Goldberg, D. E. & Miller, B. L. (1997), The gambler’s
ruin problem, genetic algorithms, and the sizing of populations, in ‘Proc. of
the 4th IEEE Conference on Evolutionary Computation’, pp. 7–12.
Harik, G., Lobo, F. & Goldberg, D. E. (1997), The compact genetic algorithm,
Technical Report IlliGAL Report No. 97006, University of Illinois, UrbanaChampaign.
Hastie, T., Tibshirani, R. & Friedman, J. (2001), The Elements of Statistical Learning,
Springer series in statistics, Springer Verlag.
Hotelling, H. (1933), ‘Analysis of a complex of statistical variables into principal
components’, Journal of Educational Psychology 24, 417–441.
Hyvärinen, A. (1999), ‘Survey on independent component analysis’, Neural
Computing Surveys 2, 94–128.
Hyvärinen, A. & Oja, E. (2000), ‘Independent component analysis: A tutorial’,
Neural Networks 13(4–5), 411–430.
Janikow, C. Z. & Michalewicz, Z. (1991), An experimental comparison of binary and floating point representations in genetic algorithms, in R. Belew &
L. Booker, eds, ‘Proc. of the 4th International Conference on Genetic Algorithms’, Morgan Kaufmann, pp. 151–157.
Joines, J. A. & Houck, C. R. (1994), On the use of non-stationary penalty functions to solve nonlinear constrained optimization problems with GAs, in
‘Proc. of the 1st IEEE Conference on Evolutionary Computation’, IEEE Press,
pp. 579–584.
Julstrom, B. A. (1995), What have you done for me lately? adapting operator
probabilities in a steady-state genetic algorithm, in L. Eshelman, ed., ‘Proc. of
the 6th International Conference on Genetic Algorithms’, Morgan Kaufmann,
pp. 81–87.
Larrañaga, P., Etxeberria, R., Lozano, J. A., Sierra, B., naki Inza, I. & Peña, J. M.
(1999), A review of the cooperation between evolutionary computation and
probabilistic graphical models, in A. Rodriguez, M. Ortiz & R. Hermida, eds,
‘CIMAF 99, Second Symposium on Artificial Intelligence’, Adaptive Systems,
La Habana, pp. 314–324.
Larrañaga, P. & Lozano, J. A., eds (2002), Estimation of Distribution Algorithms,
GENA, Kluwer Academic Publishers.
Lis, J. & Lis, M. (1996), Self-adapting parallel genetic algorithm with dynamic
mutation probability, crossover rate and population size, in J. Arabas, ed.,
‘Proceedings of the 1st Polish National Conference on Evolutionary Computation’, Oficina Wydawnica Politechniki Warszawskiej, pp. 324–329.
Michalewicz, Z. & Fogel, D. B. (1999), How To Solve It: Modern Heuristics,
Springer Verlag. ISBN 3540660615.
107
108
BIBLIOGRAPHY
Moerland, P. (2000), Mixture Models for Unsupervised and Supervised Learning, PhD thesis, Computer Science Department, Swiss Federal Institute of
Technology, Lausanne.
Mühlenbein, H., Mahnig, T. & Rodriguez, A. (1999), ‘Schemata, distributions,
and graphical models in evolutionary optimization’, Journal of Heuristics
5, 215–247.
Mühlenbein, H. & Paass, G. (1996), From recombination of genes to the estimation of distributions i. binary parameters, in ‘Parallel Problem Solving from
Nature’, pp. 178–187.
Nelder, J. & Mead, R. (1965), ‘A simplex method for function minimization’, The
Computer Journal 7(4), 308–313.
Neumaier, A. (2004), ‘Complete search in continuous global optimization and
constraint satisfaction’, Acta Numerica 13, 271–369.
Novák, D., Cuesta-Frau, D., Brzezný, R., Černý, R., Lhotská, L. & Eck, V. (2002),
Method for clinical analysis of eye movements induced by rotational test, in
‘Proc. of EMBEC’.
Očenášek, J. & Schwarz, J. (2002), Estimation of distribution algorithm for mixed
continuous-discrete optimization problems, in ‘2nd Euro-International Symposium on Computational Intelligence’, IOS Press, Košice, Slovakia, pp. 227–
232. ISBN 1-58603-256-9, ISSN 0922-6389.
Paredis, J. (1995), ‘Coevolutionary computation’, Artificial Life 2(4), 355–375.
Pelikan, M., Goldberg, D. E. & Cantú-Paz, E. (1998), Linkage problem, distribution estimation, and bayesian networks, Technical Report IlliGAL Report No.
98013, University of Illinois, Urbana-Champaign.
Pelikan, M. & Mühlenbein, H. (1999), The bivariate marginal distribution algorithm, in ‘Advances in Soft Computing – Engineering Design and Manufacturing’, pp. 521–535.
Poosala, V., Ioannidis, Y., Haas, P. & Shekita, E. (1996), Improved histograms for
selectivity estimation of range predicates, in ‘1996 ACM SIGMOD Intl. Conf.
Managment of Data’, ACM, ACM Press, pp. 294–305.
Pošı́k, P. (2001), Parallel genetic algorithms (in czech), Master’s thesis, Czech
Technical University, Prague.
Pošı́k, P. (2003), Comparing various marginal probability models in evolutionary algorithms, in P. Ošmera, ed., ‘MENDEL 2003’, Vol. 1, Brno University,
Brno, pp. 59–64. ISBN 80-214-2411-7.
Pošı́k, P. (2004), Using kernel principal components analysis in evolutionary algorithms as an efficient multi-parent crossover operator, in ‘IEEE 4th International Conference on Intelligent Systems Design and Applications’, IEEE,
Piscataway, pp. 25–30. ISBN 963-7154-29-9.
Pošı́k, P. (2005a), On the utility of linear transformations for population-based
optimization algorithms, in ‘Preprints of the 16th World Congress of the International Federation of Automatic Control’, IFAC, Prague. CD-ROM.
BIBLIOGRAPHY
Pošı́k, P. (2005b), Real-parameter optimization using the mutation step coevolution, in ‘IEEE Congress on Evolutionary Computation’, IEEE, pp. 872–
879. ISBN 0-7803-9364-3.
Pošı́k, P. & Franc, V. (2006), Using elliptic decision boundary as a probabilistic
model in EAs. Personal communication.
Rudlof, S. & Köppen, M. (1996), Stochastic hill climbing by vectors of normal
distributions, in ‘First Online Workshop on Soft Computing’, Nagoya, Japan.
Rudolph, G. (2001), ‘Self-adaptive mutations may lead to premature convergence’, IEEE Trans. on Evolutionary Computation 5(4), 410–413.
Schaffer, J. & Morishima, A. (1987), An adaptive crossover distribution mechanism for genetic algorithms, in J. Grefenstette, ed., ‘Proc. of the 2nd International Conference on Genetic Algorithms and Their Applications’, Lawrence
Erlbaum Associates, pp. 36–40.
Schlesinger, M. I. & Hlaváč, V. (2002), Ten Lectures on Statistical and Structural
Pattern Recognition, Kluwer Academic Publishers, Dodrecht, The Netherlands.
Schölkopf, B. & Smola, A. J. (2002), Learning with Kernels, MIT Press, Cambridge,
Massachusetts.
Schölkopf, B., Smola, A. J. & Müller, K.-R. (1996), Nonlinear component analysis
as a kernel eigenvalue problem, Technical report, Max-Planck-Institute für
biologische Kybernetik.
Schwefel, H.-P. (1995), Evolution and Optimum Seeking, Wiley, New York.
Sebag, M. & Ducoulombier, A. (1998), Extending population based incremental
learning to continuous search spaces, in ‘Parallel Problem Solving from Nature V’, Springer Verlag, Berlin, pp. 418–427.
Servet, I., Trave-Massuyes, L. & Stern, D. (1997), Telephone network traffic overloading diagnosis and evolutionary techniques, in ‘Third European Conference on Artificial Evolution’, pp. 137–144.
Shaefer, C. (1987), The argot strategy: Adaptive representation genetic optimizer
technique, in J. Grefenstette, ed., ‘Proc. of the 2nd International Conference on
Genetic Algorithms and Their Applications’, Lawrence Erlbaum Associates,
pp. 50–55.
Spears, W. (1995), Adapting crossover in evolutionary algorithms, in J. McDonnel, R. Reynolds & D. B. Fogel, eds, ‘Proc. of the 4th Annual Conference on
Evolutionary Programming’, MIT Press, pp. 367–384.
Stockwell, C. (1997), ‘Vestibular testing: past, present, future’, British Journal of
Audiology 31(6), 387–398.
Thierens, D. & Goldberg, D. E. (1991), Mixing in genetic algorithms, in R. Belew
& L. Booker, eds, ‘Proc. of the 4th International Conference on Genetic Algorithms’, Morgan Kaufmann, pp. 31–37.
Tipping, M. E. & Bishop, C. M. (1999), ‘Probabilistic principal component analysis’, Journal of the Royal Statistical Society 61, 611–622.
109
110
BIBLIOGRAPHY
Tsutsui, S., Goldberg, D. E. & Sastry, K. (2001), Linkage learning in real-valued
GAs with simplex crossover, in ‘Proceedings of EA’01’, Le Creusot, France.
Tsutsui, S., Pelikan, M. & Goldberg, D. E. (2001), Evolutionary algorithm using
marginal histogram models in continuous domain, Technical Report IlliGAL
Report No. 2001019, University of Illinois, Urbana-Champaign.
Tvrdı́k, J., Křivý, I. & Mišı́k, L. (2001), Evolutionary algorithm with competing
heuristics, in P. Ošmera, ed., ‘Proceedings of MENDEL 2001, 7th Int. Conference on Soft Computing’, Vol. 1, Brno University, pp. 58–64.
Whitley, D., Mathias, K., Rana, S. & Dzubera, J. (1996), ‘Evaluating evolutionary
algorithms’, Artificial Intelligence 85, 245–276.
Wojtusiak, J. (2004), AQ21 user’s guide, Reports of the Machine Learning and Inference Laboratory MLI 04-5, George Mason University.
http://www.mli.gmu.edu/papers/2003-2004/04-5.pdf.
Wojtusiak, J. & Michalski, R. S. (2006), The LEM3 system for non-darwinian evolutionary computation and its application to complex function optimization,
Reports of the Machine Learning and Inference Laboratory MLI 04-1, George
Mason University, Fairfax, VA.
URL: http://www.mli.gmu.edu/projects/lem.html
Wolpert, D. H. & Macready, W. G. (1997), ‘No free lunch theorems for optimization’, IEEE Trans. on Evolutionary Computation 1(1), 67–82.
Zhang, Q., Allison, N. M. & Jin, H. (2000), ‘Population optimization algorithm
based on ICA’.
URL: http://citeseer.ist.psu.edu/388205.html

Download Report

On the Use of Probabilistic Models and Coordinate Transforms in

Paperzz.com

Your Paperzz