On the Use of Probabilistic Models and Coordinate Transforms in Real-Valued Evolutionary Algorithms Petr Pošı́k Czech Technical University in Prague Faculty of Electrical Engineering Department of Cybernetics Technická 2, 166 27 Prague 6, Czech Republic [email protected] January, 2007 ii Rodičům, Marcele a Vlastimilovi. Bez vás bych tuto práci nikdy neměl šanci začı́t psát. Evičce a Vojtı́škovi. Bez vás bych tuto práci nikdy nedokončil. iv v Acknowledgements I would like to express my gratitude to many people that supported me and helped me with the work on this thesis. First of all, I would like to thank Vojtěch Franc whose expertise in kernel methods and machine learning allowed me to use them in such a great extent in this thesis. I would like to thank also to Jiřı́ Kubalı́k who has supported me for many years on my quest to optimize the evolutionary algortithms. And I would also like to thank to my supervisor Jiřı́ Lažanský for giving me so much freedom in the research and for his generous guidence. I am also very greatful to many people who inspired me in the recent years: Nikolaus Hansen, Marcus Gallagher, Jiřı́ Očenášek, Martin Pelikan, Peter Bosman, Jörn Grahl, Bo Yuan, and many others. I would also like to thank to my colleague Filip Železný who read the manuscript, provided many useful comments, suggested improvements and pointed out to many typos I made. Thanks also to my family and my friends. They were always ready to support me and to help me relax when I needed it. vi vii Abstract This thesis deals with black-box parametric optimization—it studies methods which allow us to find optimal (or near optimal) solutions for optimization tasks which can be cast as search for a minimum of real function with real arguments. In recent decades, evolutionary algorithms (EAs) gained much interest thanks to their unique features; they are less prone to get stuck in local minima, they can find more than one optimum at a time, etc. However, they are not very efficient when solving problems which contain interactions among variables. Recent years have shown that a subclass of EAs, estimation of distribution algorithms (EDAs), allows us to account for interactions among solution components in a unified way. It was shown, that they are able to reliably solve hard optimization problems, at least in discrete domain; their efficiency in real domain was questionable and there is still an open space for research. In this thesis, it is argued that a direct application of methods successful in discrete domain does not lead to successful algorithms in real domain. The requirements for a successful real-valued EDA are presented. Instead of learning the structure of dependencies among variables (as in the case of discrete EDAs) it is suggested to use coordinate transforms to make the solution components as independent as possible. Various types of linear and non-linear coordinate transforms are presented and compared to other EDAs and GAs. Special chapter is dedicated to methods for preserving the diversity in the population. The dependencies among variables in real domain can take on many more forms than in the case of discrete domain. Although this thesis proposes a number of algorithms for various classes of objective functions, a control mechanism which would use the right solver remains a topic for future research. viii Contents 1 Introduction to EAs 1.1 Global Black-Box Optimization . . . . . . . . . . . . . . . . 1.2 Local Search and the Neighborhood Structure . . . . . . . . 1.2.1 Discrete vs. Continuous Spaces . . . . . . . . . . . . 1.2.2 Local Search in the Neighborhood of Several Points 1.3 Genetic and Evolutionary Algorithms . . . . . . . . . . . . 1.3.1 Conventional EAs . . . . . . . . . . . . . . . . . . . 1.3.2 EA: A Simple Problem Solver? . . . . . . . . . . . . 1.3.3 EA: A Big Puzzle! . . . . . . . . . . . . . . . . . . . . 1.3.4 Effect of the Genotype-Phenotype Mapping in EAs 1.3.5 The Roles of Mutation and Crossover . . . . . . . . 1.3.6 Two Ways of Linkage Learning . . . . . . . . . . . . 1.4 The Topic of this Thesis . . . . . . . . . . . . . . . . . . . . . 1.4.1 Targeting Efforts . . . . . . . . . . . . . . . . . . . . 1.4.2 Requirements for a Successful EA . . . . . . . . . . 1.4.3 The Goals . . . . . . . . . . . . . . . . . . . . . . . . 1.4.4 The Roadmap . . . . . . . . . . . . . . . . . . . . . . 2 State of the Art 2.1 Estimation of Distribution Algorithms 2.1.1 Basic Principles . . . . . . . . . 2.1.2 EDAs in Discrete Spaces . . . . 2.1.3 EDAs in Continuous Spaces . . 2.2 Coordinate Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 4 5 6 8 9 11 12 13 15 16 16 17 17 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 21 21 22 23 25 3 Envisioning the Structure 3.1 Univariate Marginal Probability Models . . . . . . . . . 3.1.1 Histogram Formalization . . . . . . . . . . . . . 3.1.2 One-dimensional Probability Density Functions 3.1.3 Empirical Comparison . . . . . . . . . . . . . . . 3.1.4 Comparison with Other Algorithms . . . . . . . 3.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . 3.2 UMDA with Linear Population Transforms . . . . . . . 3.2.1 Principal Components Analysis . . . . . . . . . 3.2.2 Toy Examples on PCA . . . . . . . . . . . . . . . 3.2.3 Independent Components Analysis . . . . . . . 3.2.4 Toy Examples of ICA . . . . . . . . . . . . . . . . 3.2.5 Example Test of Independence . . . . . . . . . . 3.2.6 Empirical Comparison . . . . . . . . . . . . . . . 3.2.7 Comparison with Other Algorithms . . . . . . . 3.2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 27 28 28 31 36 36 37 37 38 38 40 41 43 46 46 ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x CONTENTS 4 Focusing the Search 4.1 EDA with Distribution Tree Model . . . . . 4.1.1 Growing the Distribution Tree . . . 4.1.2 Sampling from the Distribution Tree 4.1.3 Empirical Comparison . . . . . . . . 4.1.4 Summary . . . . . . . . . . . . . . . 4.2 Kernel PCA . . . . . . . . . . . . . . . . . . 4.2.1 KPCA Definition . . . . . . . . . . . 4.2.2 The Pre-Image Problem . . . . . . . 4.2.3 KPCA Model Usage . . . . . . . . . 4.2.4 Toy Examples of KPCA . . . . . . . 4.2.5 Empirical Comparison . . . . . . . . 4.2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 49 50 52 52 59 60 60 61 62 62 63 67 5 Population Shift 5.1 The Danger of Premature Convergence . . . . . . . . . . . 5.2 Using a Model of Successful Mutation Steps . . . . . . . . . 5.2.1 Adaptation of the Covariance Matrix . . . . . . . . 5.2.2 Empirical Comparison: CMA-ES vs. Nelder-Mead . 5.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Estimation of Contour Lines of the Fitness Function . . . . 5.3.1 Principle and Methods . . . . . . . . . . . . . . . . . 5.3.2 Empirical Comparison . . . . . . . . . . . . . . . . . 5.3.3 Results and Discussion . . . . . . . . . . . . . . . . . 5.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 69 71 72 73 77 78 78 83 85 86 6 Conclusions and Future Work 6.1 The Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 89 91 A Test Functions A.1 Sphere Function . . . A.2 Ellipsoidal Function . A.3 Two-Peaks Function A.4 Griewangk Function A.5 Rosenbrock Function 95 95 95 95 96 97 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B Vestibulo-Ocular Reflex Analysis B.1 Introduction . . . . . . . . . . B.2 Problem Specification . . . . . B.3 Minimizing the Loss Function B.4 Nature of the Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 . 99 . 100 . 101 . 102 List of Tables 3.1 3.2 3.3 3.4 3.5 3.6 UMDA empirical comparison: factor settings . . . UMDA empirical comparison: number of bins . . UMDA empirical comparison: results . . . . . . . PCA vs. ICA empirical comparison: factor settings PCA vs. ICA: Griewangk function . . . . . . . . . PCA vs. ICA: Two Peaks function . . . . . . . . . . . . . . . . 32 32 34 43 45 46 4.1 4.2 4.3 DiT-EA empirical comparison: factor settings . . . . . . . . . . . . DiT-EA empirical comparison: results . . . . . . . . . . . . . . . . KPCA-EA empirical comparison: results . . . . . . . . . . . . . . . 53 55 65 5.1 5.2 5.3 Settings for parameters of artificial VOR signal . . . . . . . . . . . Nelder-Mead vs. CMA-ES: success rates . . . . . . . . . . . . . . . CMA-ES and Separating Hyperellipsoid: population sizes . . . . 75 75 85 xi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii LIST OF TABLES List of Figures 1.1 1.2 1.3 1.4 1.5 Example of binary and integer neighborhoods . . The neigborhood of Nelder-Mead simplex . . . . . Neighborhoods induced by the bit-flip mutation . Neighborhoods induced by the 1-point crossover . Neighborhoods induced by the 2-point crossover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 6 14 14 14 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 Equi-width histogram model . . . . . . . . . . . . . . . . Equi-height histogram model . . . . . . . . . . . . . . . . Max-diff histogram model . . . . . . . . . . . . . . . . . . Mixture of Gaussians model . . . . . . . . . . . . . . . . . Bin boundaries evolution for the Two Peaks function . . Bin boundaries evolution for the Griewangk function . . Two linear principal components of the 2D toy data set . PCA and marginal equi-height histogram model . . . . . PCA and ICA components comparison . . . . . . . . . . Principal and independent components of 2D Griewangk Observed and expected frequency tables before PCA . . Observed and expected frequency tables after PCA . . . Monitored statistics layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 30 30 31 35 35 38 39 41 42 42 43 44 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 DiT search space partitioning . . . . . . . . . . . . . Efficiency of the DiT-EA on the 2D Two Peaks . . . . Efficiency of the DiT-EA on the 2D Griewangk . . . Efficiency of the DiT-EA on the 2D Rosenbrock . . . Efficiency of the DiT-EA on the 20D Two Peaks . . . Efficiency of the DiT-EA on the 10D Griewangk . . . Efficiency of the DiT-EA on the 10D Rosenbrock . . First six nonlinear components of the toy data set . KPCA crossover for clustering and curve-modeling Efficiency of the KPCA-EA on the 2D Two Peaks . . Efficiency of the KPCA-EA on the 2D Griewangk . . Efficiency of the KPCA-EA on the 2D Rosenbrock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 55 56 57 57 58 58 63 64 66 66 67 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 Coefficients d and c in relation to the selection proportion τ One iteration of EMNA and CMA-ES . . . . . . . . . . . . Vestibulo-ocular reflex signal . . . . . . . . . . . . . . . . . Aligned VOR signal segments . . . . . . . . . . . . . . . . . Nelder-Mead vs. CMA-ES: number of needed evaluations Nelder-Mead vs. CMA-ES: typical progress . . . . . . . . . CMA-ES vs. Separating Hyperellipsoid: covariance matrix CMA-ES vs. Separating Hyperellipsoid: average progress . Modified Perceptron vs. SVM: separating hyperellipsoid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 72 74 74 76 76 79 85 86 xiii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv LIST OF FIGURES A.1 Two Peaks function . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Griewangk function . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3 Rosenbrock function . . . . . . . . . . . . . . . . . . . . . . . . . . B.1 B.2 B.3 B.4 B.5 Simulated VOR signal . . . . . . . . . . . . . . . . . Aligned VOR signal segments . . . . . . . . . . . . . The fitness landscape cuts for the VOR analysis I. . The fitness landscape cuts for the VOR analysis II. . The fitness landscape cuts for the VOR analysis III. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 97 97 99 100 103 103 103 List of Algorithms 1.1 1.2 1.3 1.4 2.1 2.2 3.1 3.2 4.1 4.2 4.3 5.1 5.2 5.3 5.4 Hill-Climbing . . . . . . . . . . . . . . . . . . . . . . . . . . Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . . An iteration of Nelder-Mead downhill simplex algorithm Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . Estimation of Distribution Algorithm . . . . . . . . . . . . Coordinate transform inside EA . . . . . . . . . . . . . . . Generic EDA . . . . . . . . . . . . . . . . . . . . . . . . . . EDA with PCA or ICA preprocessing . . . . . . . . . . . . Function SplitNode . . . . . . . . . . . . . . . . . . . . . . . Function FindBestSplit . . . . . . . . . . . . . . . . . . . . . KPCA model building and sampling . . . . . . . . . . . . Perceptron Algorithm . . . . . . . . . . . . . . . . . . . . . Modified Perceptron Algorithm . . . . . . . . . . . . . . . Gaussian distribution sampling . . . . . . . . . . . . . . . Evolutionary model for the proposed method . . . . . . . xv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 5 7 8 22 25 33 45 51 51 62 81 82 83 84 xvi LIST OF ALGORITHMS Chapter 1 Introduction to Evolutionary Algorithms This chapter presents an introduction to the field of optimization. It informally mentions the relevant conventional techniques used to solve the optimization tasks. It gives a brief survey of the ordinary genetic and evolutionary algorithms (GEAs), presents them in a unifying framework with the conventional optimization techniques and points out some issues in their design. These issues directly lead to the emergence of a class of evolutionary algorithms based on the creation and sampling from probabilistic models, so-called Estimation of Distribution Algorithms (EDAs), which are described in next chapter in more detail. 1.1 Global Black-Box Optimization Optimization tasks can be found in almost all areas of human activities. A baker has to find the optimal composition of the dough to produce rolls which are popular among his customers. A company must create such a manufacturing schedule so that it minimizes the cost of its production. An engineer has to find such settings of a controller which deliver the best response of the system under control. The optimization task is usually defined as x∗ = arg min f (x), x∈S (1.1) so that among the members x of the set of all possible objects S, x ∈ S, we have to find the optimal x∗ which minimizes the objective (cost1 ) function f . The cost function usually tells us how good the individual objects are.2 In the examples given above, objects are the composition of the dough, the schedule, and the settings of the controller, respectively. Similarly, the objective functions are the satisfaction of baker’s customers, the cost of manufacturing, and e.g. the speed and precision of the system being controlled. There exist many algorithms which try to solve the optimization task. According to Neumaier (2004), the following classification of optimization methods (based on the degree of rigor with which they approach the goal) is made: 1 In this thesis, the terms cost function, objective function and evaluation function are used rather interchangeably. If the distinction between them is important, it is pointed out. 2 In many cases, it is sufficient to have a function which takes two objects and tells us which of them is better, so that we are able to produce a complete ordering of a set of objects. 1 Cost function 2 1 I NTRODUCTION TO EA S An incomplete method uses clever heuristics to guide the search, but gets stuck in a local minimum. An asymptotically complete method reaches the global optimum with probability one if allowed to run indefinitely long, but has no means to verify if the global optimum was already found. A complete method reaches the global optimum with certainty, assuming indefinitely long run time, and knows after a finite time that an approximate global minimum has been found (within prescribed tolerances). Black-box optimization With an algorithm from the last of the three3 categories we can predict the amount of work needed to find the global minimum within some prescribed tolerances, but this estimate is usually very pessimistic (exponential in the problem size) and serves as the lower bound on the efficiency of an algorithm. We speak of the black-box optimization if the form of the particular objective function is completely hidden from us. We do not know in advance if the function is unimodal or multimodal, if it can be decomposed to some subfunctions, we do not know its derivatives, etc. All we can do is to sample some object from the space S and have it evaluated by the cost function. If the search space is infinite, without any global information, no algorithm is able to verify the global optimality of the solution in a finite time. In such circumstances, the best algorithm we can construct for the black-box optimization problems is some asymptotically complete method. In the rest of this thesis, an algorithm is said to perform a global search, if the following condition holds: lim P (xtBSF = x∗ ) = 1, t→∞ Best-so-far solution (1.2) where xtBSF is the best-so-far solution found by the algorithm from the start of the search to the time instant t (it is the current estimate the algorithm gives us for the global optimum) and x∗ is the true global optimum. 1.2 Local Search and the Neighborhood Structure Local search In this section, several incomplete local search methods are described. A plausible and broadly accepted assumption for any optimization task is that the objective function must be a ‘reasonable’ one, i.e. it must be able to guide the search, to give some clues where to find an optimum. If this assumption does not hold, e.g. if we are presented with an objective function with very low signal-to-noise ratio or if we face a needle-in-the-haystack type of problem, then we can expect that the best algorithm we can use is the pure random search. The most often used optimization procedures are various forms of local search, i.e. incomplete methods. The feature that all local search procedures have in common is the fact that they repeatedly look for increasingly better objects in the neighborhood of the current object, thus incrementally improving the solution. They start with an initial object from the search space generated randomly or created as an initial guess of a solution. The search is carried out until a termination criterion is met. The termination criterion can be e.g. a limit on the number of iterations, sufficient quality of the current object, or the search can be stopped when all possible perturbations result in worse objects when compared to the current one. In that case we found a local optimum of the cost function. 3 In the original paper, there were four categories, but the last one dealt with the presence of rounding errors which is not relevant here, and thus was omitted. 1.2 Local Search and the Neighborhood Structure 3 Individual local search strategies differ • in the definition of neighborhood, and • in the way they search the neighborhood and accept better solutions. If the neighborhood is finite and precisely defined, it can be searched using strict rules. Very often, the neighbors are searched in random order—the local search algorithm can then accept the first improving neighbor, or the best improving neighbor, etc. In one particular variant of the local search, so-called hill-climbing, the iteration consists of a slight perturbation of the current object and if the result is better, it becomes the current object, i.e. it accepts the first-improving neighbor (see Alg. 1.1). Hill-climbing Algorithm 1.1: Hill-Climbing 1 2 3 4 5 6 7 begin x ← Initialize() while not TerminationCondition() do y ← Perturb(x) if BetterThan(y, x) then x←y end The cost function f (x) (which is hidden in the function BetterThan in Alg. 1.1) takes the object x and evaluates it. In order to do that, it must ‘understand’ the argument x, i.e. we have to choose the representation of the object. Many optimization tasks have their obvious ‘natural’ representations, e.g. amounts of individual ingredients, Gantt charts, vectors of real numbers, etc. However, often the natural representation is not the most suitable one—some optimization algorithms require a specific representation because they are not able to work with others, sometimes the task at hand can simply be solved easier not using the natural representation (this feature is strongly related to the neighborhood structure of the search space which is described below). In such cases we need a mapping function which is able to transform the object description from the search space representation (the representation used by the optimization algorithm when searching for the best solution) to the natural representation (suitable for evaluation of the objects). We say that a point in the search space represents the object. In all local search procedures, the concept of the neighborhood plays the fundamental role. The most straightforward and intuitive definition of the neighborhood of an object is the following: N (x) = { y | dist(x, y) < ǫ}, (1.3) i.e. the neighborhood N (x) of an object x is a set of such objects y so that their distance from the object x is lower then some constant. The neighborhood and its structure, however, is very closely related to the chosen representation of an object and to the perturbation scheme used in the algorithm. From this point of view, a much better definition of the neighborhood is this: N (x) = { y | y = P erturb(x)}, (1.4) Representation Neighborhood 4 1 I NTRODUCTION TO EA S i.e. the neighborhood is a set of such objects y which are accessible by one application of the perturbation. In Fig. 1.1, we can see an example of a small search space represented as binary and integer numbers (the objective values of individual points are not given there). Now, suppose that the perturbation for binary representation flips one randomly chosen bit of the point 101B (i.e. the neighborhood is a set of points with Hamming distance equal to 1) while the perturbation for integer representation randomly adds or subtracts 1 to the point 5D (i.e. the neighborhood is a set of points with Euclidean distance equal to 1). It is obvious that these two neighborhoods neither coincide, nor one is a subset of the other. Furthermore, imagine what happens if we enlarge the u ? ? ? Binary 000 001 010 011 100 101 110 111 Integer 0 1 2 3 4 5 6 7 6 u 6 Figure 1.1: Example of binary and integer neighborhoods Larger neighborhood, fewer local optima search space so that we need 4 or more bits to represent the points in binary space. The number of binary neighbors increases while the neighborhood in the integer space remains the same. It is worth mentioning that larger neighborhood of points results in lower number of local optima, but also in greater amount of time spent by searching the neighborhood. If we chose such a perturbation operator able to generate all points of the search space we would have only one local optimum (which would also be the global one), but we would have to search the whole space of candidate solutions. For the success of the local search procedures, it is crucial to find a good compromise in the size of the neighborhood, i.e. to choose the right perturbation operators. 1.2.1 Steepest descent Discrete vs. Continuous Spaces The optimization tasks defined in discrete spaces are fundamentally different from the tasks defined in continuous spaces. The number of neighbors in a discrete space is usually finite and thus we are able e.g. to test if a local optimum was found—we can enumerate all neighbors of the current solution, and if none is better, we can finish. This usually cannot be done in continuous space—the neighborhood is defined to be infinite in that case and something like enumeration simply does not make sense4 . To ensure that we found a local optimum in continuous space, we would need to know the derivatives of the objective function. These features can be observed considering another type of a local search algorithm called steepest descent (see Alg. 1.2). This algorithm examines all points in the vicinity of the current point and jumps to the best of them (i.e. the algorithm accepts the best improving neighbor). In case of discrete space, this can be done by enumeration. In continuous spaces, the neighborhood is given by a 4 Of course, even in continuous spaces we can define the neighborhood to be finite, but in that case (without any regularity assumptions) there is high probability of missing better solutions. 1.2 Local Search and the Neighborhood Structure 5 Algorithm 1.2: Steepest Descent begin x ← Initialize() while not TerminationCondition() do Nx ← { y | y = Perturb (x)} x ← BestOf(Nx ) 1 2 3 4 5 end 6 line going through the current point in the direction of the gradient. The points on this line are examined and the best one is selected as the new current point.5 It is interesting to note, that the steepest descent algorithm is a deterministic algorithm, while the hill-climbing is not. If started from the same initial point over and over, the steepest descent algorithm always gives the same solution (in both discrete and continuous space). On the contrary, the hill-climbing can give different results in each run as it depends on the order of evaluation of the neighbors which is usually random. Steepest descent also spends a lot of time on unnecessary evaluation of the whole neighborhood in case of a discrete search space. We can expect that it will find a local optimum with less iterations but very often with much more objective function evaluations than the hill-climbing—it depends on the size of neighborhood and on the complexity of the search problem. 1.2.2 Local Search in the Neighborhood of Several Points The algorithms described above work with one current point and perform the search in its neighborhood. One of possible generalizations is to use a set of current points (instead of mere one) and search in their common neighborhood. Again, the neighborhood N (X) of a group of points X, X = {x1 , x2 , . . . , xN }, can be defined using a ‘combination’ operator as N (X) = { y | y = Combine(X)}, (1.5) i.e. as a set of all points y which can result from some combination of the points in X. Individual algorithms then differ by the definition of the Combine operation. This approach is not very often pursued in the conventional optimization techniques. However, one excellent and famous example of this technique can be mentioned here, the Nelder-Mead simplex search introduced in Nelder & Mead (1965). This algorithm is used to solve optimization problems in RD . In Ddimensional space, the algorithm maintains a set of D + 1 points that define a simplex in that space (an interval in 1D, a triangle in 2D, a tetrahedron in 3D, etc.). Using this simplex it creates a finite set of candidate points which constitute the neighborhood of the simplex (see Eqs. 1.7 to 1.11 and Fig. 1.2). This neighborhood is neither searched in random order as in the case of hillclimbing, nor the best point is selected as in the case of steepest descent. Instead, not all neighbors are necessarily evaluated; there is a predefined sequence of assessing the quality of the points along with the rules how to incorporate them to the current set of points defining the simplex. 5 The search for the best point on the line is usually done analytically. The gradient in the best point is perpendicular to the line. Nelder-Mead simplex search 6 1 I NTRODUCTION TO EA S 4 ye 3 yr x1 2 yoc ys3 1 ys2 yic 0 x3 x2 -1 -1 0 1 2 3 4 5 6 7 Figure 1.2: Points forming the neighborhood in Nelder-Mead simplex search. The defining simplex is plotted in dashed line, the points in neighborhood are marked with black dots. Suppose now, that the simplex is given by points x1 , x2 , . . . , xD+1 ∈ RD which are ordered in such a way that f (x1 ) ≤ f (x2 ) ≤ . . . ≤ f (xD+1 ), f is the cost function. During one iteration, the neighborhood of the simplex consists of points which are computed as follows: x̄ = D 1 X xd , D d=1 (1.6) yr = x̄ + ρ(x̄ − xD+1 ), (1.7) ye = x̄ + χ(xr − x̄), (1.8) yoc = x̄ + γ(xr − x̄), yic = x̄ − γ(x̄ − xD+1 ), ysi = x1 + σ(xi − x1 ), (1.9) (1.10) i ∈ 2, . . . , D + 1, (1.11) where the coefficients should satisfy ρ > 0, χ > 1, χ > ρ, 0 < γ < 1, 0 < σ < 1. The standard settings are ρ = 1, χ = 2, γ = 0.5, σ = 0.5. One iteration of the algorithm is presented as Alg. 1.3. Although this algorithm maintains a set of current points and the neighborhood is defined dynamically during the run, it still preserves the characteristics of an incomplete method. Nelder-Mead search is applicable in black-box optimization (which does not hold for the steepest descent). The hill-climbing is not able to adapt the size and shape of its neighborhood which is the most important feature of the Nelder-Mead method. However, the author is not aware of any systematic empirical comparison of the above mentioned methods. 1.3 Genetic and Evolutionary Algorithms Genetic and evolutionary algorithms (GEAs) have been already known for a few decades and they proved to be a powerful optimization and searching tool in many research and application areas. They can be considered a stochastic generalization of the techniques which search in the neighborhood of several points described above. They maintain a set of potential solutions during the search and are able to perform a search in many neighborhoods of various sizes and structures in the same time. They are inspired by the processes that can be observed in nature, mainly by natural selection and variation. The terminology used to describe the GEA 1.3 Genetic and Evolutionary Algorithms 7 Algorithm 1.3: An iteration of Nelder-Mead downhill simplex algorithm Input: x1 , x2 , . . . , xD+1 so that f (x1 ) ≤ f (x2 ) ≤ . . . ≤ f (xD+1 ) 1 begin /* Reflection */ 2 compute yr using Eq. 1.7 3 if f (x1 ) ≤ f (yr ) < f (xD ) then Accept(yr ); exit 4 if f (yr ) < f (x1 ) then /* Expansion */ 5 compute ye using Eq. 1.8 6 if f (ye ) < f (yr ) then Accept(ye ); exit if f yr ≥ f (xD ) then /* Contraction if f (yr ) < f (xD+1 ) then /* Outside contraction compute yoc using Eq. 1.9 if f (yoc ) ≤ f (yr ) then Accept(yoc ); exit 7 8 9 10 else /* Inside contraction compute yic using Eq. 1.10 if f (yic ) < f (xD+1 ) then Accept(yic ); exit 11 12 13 /* Shrinking compute points ysi using Eq. 1.11 MakeSimplex(x1 , ys2 , . . . , yD+1 ) end 14 15 16 */ */ */ */ is also borrowed from the biology and genetics. A potential solution or a point in the search space is called a chromosome or an individual. Each component of the chromosome (each variable of the solution) is called a locus and a particular value at certain locus is called an allele. The set of potential solutions maintained by the algorithm is called a population. The cost function (which is usually minimized as the name suggests) is translated to the fitness function which describes the ability of an individual to survive in the current environment (and should be thus maximized). The search performed by the GEAs is called an evolution. The evolution consists of many iterations called generations and in each generation the algorithm performs operations like selection, crossover, mutation, and replacement which shall be described below. GEAs gained their popularity mainly for two reasons: 1. they are conceptually easy, i.e. it is easy to describe the processes taking place inside them and thus they are easy to implement, and 2. they have shown better global search abilities6 then conventional methods, i.e. for many problems, GEAs are able to find even such local optima which are hard to find for local optimizers, and furthermore the operators 6 We cannot decide if certain algorithm performs local or global search until we know the representation of solution and the definition of the neighborhood induced by the variation operators. E.g. the hill-climbing algorithm is usually regarded as local search method because its perturbation operator is usually local. However, it can perform the global search as well if the neighborhood contained all possible solutions, i.e. if the probability of generating a solution is greater than 0 for all possible ones. Chromosome, locus, allele, population, fitness, etc. Why are they so popular? 8 1 I NTRODUCTION TO EA S of GEAs are usually chosen in such a way that they ensure the asymptotic completeness of the algorithm. In subsequent sections, the above mentioned topics are discussed. First, the well-established types of GEAs are described, and finally the general structure of an evolutionary algorithm is presented and explained. 1.3.1 Conventional EAs The EAs are sometimes classified by the evolutionary computation (EC) community into four groups: Evolutionary programming (EP) which works with finite automata and numeric representations and has been developed mainly in the United States since 1960s, Evolutionary strategies (ES) which work with vectors of real (rational) numbers and were invented in Germany in late 1960s, Genetic algorithms (GAs) that work with binary representations and have been known since 1970s, and Genetic programming (GP) which works with programs represented in the tree form and is the most recent member of the EA family. Structure of an evolutionary algorithm There are other types of EAs which are not included in the above list, but these are pretty recent and differ only by the operators used in them. In fact, there is no need to draw any hard boundaries between the above mentioned four types of EAs. They influence each other and cooperate by exchanging ideas and various techniques. What all the EAs have in common is the evolutionary cycle in which some sort of selection and variation must be present. A typical evolutionary scheme is presented as Alg. 1.4. Not all of the presented operations have to be included, neither they must follow this order. Algorithm 1.4: Genetic Algorithm 1 2 3 4 5 6 7 8 9 10 11 12 Initialization begin X (0) ← Initialize() f (0) ← Evaluate(X (0) ) g←0 while not TerminationCondition() do XPar ← Select(X (g) , f (g) ) XOffs ← Crossover(XPar ) XOffs ← Mutate(XOffs ) fOffs ← Evaluate (XOffs ) [X (g+1) , f (g+1) ] ← Replace(X (g) ,XOffs ,f (g) , fOffs ) g ←g+1 end The initialization phase creates the first generation of individuals. Usually, they are created randomly, but if the designer has some knowledge of the placement of good solutions in the search space, he can use it and initialize the population in a biased way. The population is then evaluated using the fitness function and the main generational loop starts. 1.3 Genetic and Evolutionary Algorithms The selection phase designates the parents, i.e. it chooses some of the individuals in the current population to serve as a source of genetic material for creation of new offspring. The selection should model the phenomenon that better individuals have higher odds to reproduce, thus it is driven by the fitness values. There are many selection schemes based on the following three basic types: proportional which takes the raw fitness values and based on them it calculates the probabilities to include that individual to the parental set; rank-based which does not use the raw fitness values to compute the probabilities, instead it sorts the individuals and (to simplify it) the rank becomes the fitness value; and tournament which takes small random groups of individuals from the current population and designates the best in the group as a parent. After the set of parents is determined, the variational operators—crossover and mutation—are carried out to create new, hopefully better, offspring individuals. The crossover performs the search in the neighborhood of several individuals as it usually takes two or more parents, combines them and creates one or more offspring. The particular process used to combine the parents to produce new individuals is strongly dependent on the chosen representation and will not be described here (for more information see e.g. Goldberg (1989), Michalewicz & Fogel (1999)). The mutation is equivalent to a local search strategy in the neighborhood of one point—it usually takes one individual, perturbs it a bit, and the resulting ‘mutant’ is joined with the other offspring. After the creation of the offspring population is completed, all of the members are evaluated. Now, we have the old population and the population of the offspring solutions. Among all individuals of these two populations the competition for the natural resources takes place. Not all the parents survive to the next generation. Sometimes, the principle of the survival of the fittest holds, at least in a probabilistic form, i.e. the fitter the individual, the higher the probability it will survive; sometimes the old population is replaced by the new one as a whole. The implementation of this procedure is called the replacement. Although it is possible to use a dynamic population size, most often it is required to be constant and the replacement discards many generated solutions. 1.3.2 9 Selection Crossover Mutation Replacement EA: A Simple Problem Solver? The EA metaheuristic is very simple. In a great majority of problems, if one compiles the fitness function and decides the representation, he can simply use some of the ‘standard’ operators suitable for the chosen representation. The EA will then run and will produce some results. Using such an EA one can easily evolve e.g. matrices, graphs, nets, sets of rules, images, designs of industrial facilities, etc. Making EAs run without any special domain knowledge is thus a very easy job for a wide range of applications. This flexibility and wide applicability of EAs is their greatest strength and weakness in the same time. For all these reasons, it is very tempting to describe the EAs as simple problem solvers. In fact, the things are not that easy. If we apply this naive approach, our EA is very likely to suffer from several potential negative features described below. Instantiation of EA EA constitutes a general framework, a template. To make it work one must choose a representation of solutions and define some ‘call-back’ functions (initialization, selection and replacement methods, crossover and mutation opera- EA Related Issues 10 1 No-FreeLunch Theorem I NTRODUCTION TO EA S tors, parameters of the methods, etc.) which are then plugged in the template. After that, an instance of the EA is created and can be used. The task of defining these functions can be called the instantiation of EA. For some types of EAs, the amount of ‘things to be set up’ is really large, and this can be the source of frustration of many EA designers. To find a really good instantiation is often very hard problem itself. The problem of a good instantiation is closely related to the so-called NoFree-Lunch Theorem (Wolpert & Macready (1997)) which states: All searching procedures based on the sampling of the search space have –on average– the same efficiency across all possible problems. Something similar holds even for the representation, and for the variational operators. When designing the EA with the aim of better efficiency, we have to apply the domain specific knowledge, which will help the algorithm to ‘search in the right direction’, but will deteriorate the algorithm behavior in other domains. Similarly, if our aim is broader applicability, we have to develop an algorithm that will e.g. learn the problem characteristics from its own evolution at the expense of lower efficiency.7 In that case we need very flexible variational operators that are able to adapt to changing living environments. Linkage, Epistasis, Statistical Dependency The basic versions of EA usually treat the individual components of the solutions as statistically independent of each other. If this assumption does not hold, we speak of the linkage, epistasis, or statistical dependency of individual components. The linkage presents a severe issue to EAs. It is closely related to the issue of the instantiation. Sometimes, a problem at hand can be successfully solved using a basic version of EA, but choosing a bad representation we can accidentally introduce the dependencies to the algorithm and greatly deteriorate the efficiency of the EA. Most often, however, it is not easy (or it is impossible) to come out with a representation which reduces or removes the dependencies because we know nothing or very little about the fitness function. Premature Convergence During the evolution it can easily happen that one sub-optimal individual takes over the whole population, and the search almost stops. This phenomenon is called premature convergence. It is not an issue itself, rather it is a symptom of a bad instantiation of the EA. Some types of EAs are more resistant against this phenomenon—the algorithms that rely more on the mutation type of variational operator than on the crossover type. One of the EA parameters has a significant effect on the premature convergence—the population size. If it is small, there is much bigger chance that the premature convergence emerges. A too large population, on the other hand, wastes computational resources. 7 I hypothesize that the No-Free-Lunch theorem holds even for the algorithms that learn problem characteristics during their run, i.e. although they have the ability to adapt to one class of problems, they are misled in the ‘complementary’ class of problems. All the work with creating algorithms able to learn is carried out in the hope that the class of problems solvable by these algorithms corresponds to a great extent with the class of problems people try to solve in their real life. 1.3 Genetic and Evolutionary Algorithms 1.3.3 11 EA: A Big Puzzle! To make EA run is an easy task. To make it run successfully is a very hard task. If we are interested in creating an EA instance that is tuned (or is able to tune itself) to the problem being solved, we have to solve a big puzzle and carefully assemble individual pieces so that they cooperate and do not harm each other in order to overcome at least some of the above mentioned problems. This effort can be aimed at creating algorithms which have a lower number of parameters that must be set by the designer, and are more robust so that we can apply them to broader range of problems without substantial changes. The effort can be aimed at automatic control of some EA parameters so that the algorithm itself actively tries to fight the premature convergence. The effort can be aimed at detection of the relations between variables, and at incorporating this extra knowledge into the evolutionary scheme. The possibilities to optimize the EA are countless, however they can be classified to several groups: Theory It would be great if we had a kind of EA model which would tell us, how the particular instantiation of EA would behave. Such general and practically applicable models, however, do not exist at present. The interactions between individual parts of EA can be easily described, but hardly analyzed. In spite of that, there are some attempts to build theoretical foundations for EAs. Some attempts to build a theoretical apparatus on the optimal population sizing for GAs can be found in Goldberg (1989), Goldberg et al. (1992), or Harik, Cantù-Paz, Goldberg & Miller (1997), and on optimal operator probabilities in Schaffer & Morishima (1987), or Thierens & Goldberg (1991). Eiben et al. (1999) present a judgment of these theoretical models: ‘. . . the complexities of evolutionary processes and characteristics of interesting problems allow theoretical analysis only after significant simplification in either the algorithm or the problem model. . . . the current EA theory is not seen as a useful basis for practitioners.’ Deterministic Heuristics The characteristics of the population change during the evolution, and thus a static setting of EA parameters is not optimal. The deterministic control changes some parameters using a fixed time schedule. The works listed in the next paragraph show that for many realworld problems even non-optimal choice of schedule often leads to better results than a near-optimal choice of a static value. A deterministic schedule for changing the mutation probability can be found in Fogarty (1989). Janikow & Michalewicz (1991) present a mutation operator which has the property to search the space uniformly initially, and very locally at later stages. In Joines & Houck (1994), the penalties of solutions to constrained optimization problem are transformed by deterministic schedule resulting in dynamic evaluation function. Feedback Heuristics With feedback heuristics, the changes of parameters are triggered by some event in the population. The typical feature of EAs using feedback heuristics is the monitoring of some population statistics (relative improvement of population, diversity of population, etc.). Optimize the optimization algorithm! 12 1 I NTRODUCTION TO EA S Julstrom (1995) presents an algorithm that uses one of the mutation or the crossover operator more often than the other based on their previous success. Davis (1991) used a similar principle for multiple crossover operators. Lis & Lis (1996) present a parallel model where each subpopulation uses different settings of parameters. After certain time period, the subpopulations are compared and the settings are shifted towards the values of the most successful subpopulation. Shaefer (1987) adapts the mapping of variables (resolution, range, position) based on convergence and variance of the genes. A co-evolutionary approach was used to adapt evaluation function e.g. by Paredis (1995). Self-Adaptation The roots of the self-adaptation can be found in ES where the mutation steps are part of individuals. The parameters of EA undergo the evolution together with the actual solution of the problem. Bäck (1992) self-adapts mutation rates in GA. Fogel et al. (1995) uses selfadaptation to control the relative probabilities of five mutation operators for the components of a finite state machine. Spears (1995) added one extra bit to the individuals to discriminate if 2-point or uniform crossover should be used. Schaffer & Morishima (1987) present a self-adaptation of the number and locations of crossover points. Diploid chromosomes (each individual has one additional bit which determines if one should use the chromosome itself or its inversion) are also a form of self-adaptation (see e.g. Goldberg & Smith (1987)). In the context of ES, Rudolph (2001) has proved that the self-adaptation of mutation steps can lead to premature convergence. Meta Optimizers To adapt the settings of an EA we can use another optimizer. This procedure usually involves letting the EA run for some time with certain settings and ‘measuring’ its performance. Similarly to self-adaptation, the optimization process takes usually longer time since we search a much larger space (object parameter space + EA parameter space). Example of this approach can be found e.g. in Grefenstette (1986). New Types of EAs Although many researchers reported promising results using adaptation in EAs, it is not clear if the adaptation can solve our problems completely. Eiben et al. (1999) state that one of the main obstacles of optimizing the EA settings is formed by the interactions between these parameters. If we use some basic form of EA (which treats the variables independently of each other) to optimize the EA parameters (which usually interact with each other), we can hardly find the optimal settings. However, new types of EAs (or new types of variational operators and other EA components) can be of great help. Algorithms that have fewer parameters to set (or optimize), and address perhaps the most severe issue in EA design—the interactions among variables. Before some methods that would allow us to construct such algorithms or operators that account for dependencies among variables can be described, we have to clarify the influence of chosen representation on the operator-induced neighborhood structure and the roles of crossover and mutation in EAs. 1.3.4 Effect of the Genotype-Phenotype Mapping in EAs As stated earlier, optimization algorithms can work with two (or more) representations of possible solutions. The representation that can be directly inter- 1.3 Genetic and Evolutionary Algorithms preted as a solution and evaluated is called the phenotype; the other representation on which the variational operators are applied is called the genotype. This subsection shortly demonstrates the phenotype-level effect of this genotypephenotype mapping when it is used with variational operators on the genotype level. The phenotype used in the examples below is a pair of integer numbers. Both are encoded as 5-bit binary numbers, i.e. the genotype is a bit string of length 10 and the mapping itself is of type B 10 → I 2 . The first example in Fig. 1.3 shows all possible neighbors of a given individual induced by a single bit-flip mutation operator applied to the genotype. The neighbors are located on lines parallel to the coordinate axes with the origin in the given point. The number of neighbors is independent of the given point. The second example in Fig. 1.4 shows the neighborhood structure of two parents induced by a 1-point crossover. With this kind of crossover only one of the phenotype coordinates can be different from both parents. The number of neighbors varies depending on the pair of parents. The last example in Fig. 1.5 shows the neighborhood structure of two parents induced by a 2-point crossover. In this case the neighbors can differ in both phenotype coordinates from both parents, i.e. the neighborhood structure can be rather complex and its size is variable as in previous case. These examples demonstrate that simple operations on the genotype level can induce rather complex patterns on the phenotype level. If some genotypephenotype mapping is employed in the algorithm, then e.g. a local search procedure performed on the genotype level can hardly be described as a local search on the phenotype level; the genotype level local search may seem to be very unstructured on the phenotype level. Using a genotype that induces a high number of neighbors with mutation and crossover operators (e.g. binary representation) enhances the global search abilities of the algorithm. However, the usefulness of the genotype-phenotype mapping (as used in the above examples) is questionable and more or less accidental because the mapping is ‘hard-wired’ to the algorithm, i.e. for some instances of optimization tasks it can be advantageous, but for other ones of the same type and complexity it may fail completely. As designers of EAs, we should take control of this aspect and use adaptive genotype-phenotype mappings rather than the obvious hard-wired solutions. 1.3.5 The Roles of Mutation and Crossover One of the EA advantages when compared to other techniques is the ability to escape from local optima and to use the population as a facility to learn the structure of the problem8 . Is it the crossover, or the mutation, what ensures these features?9 The mutation generates offspring in the local neighborhood of one point, i.e. it has only one point as its input. As explained in the previous subsection, if some genotype-phenotype mapping is present, the mutation can generate new individuals that can be hardly described as neighbors in the sense of a phenotype distance. Without such mapping, however, there is no chance for mutation 8 To learn the structure of the problem: create a variational operator which is able to modify itself based on the previous population members given as its input with the aim of reaching higher probability of creating good individuals compared to a non-adaptive operator. 9 The ability to escape from local optima is also ensured by the stochastic selection process, i.e. by the ability to temporarily worsen the fitness of the population. 13 Phenotype and genotype 14 1 30 30 25 25 20 20 15 15 10 10 5 5 0 0 5 10 15 20 25 30 0 0 5 I NTRODUCTION 10 15 20 TO 25 EA S 30 Figure 1.3: The neighborhoods induced by the binary mutation for points (18,12) and (16,16). 30 30 25 25 20 20 15 15 10 10 5 5 0 0 5 10 15 20 25 30 0 0 5 10 15 20 25 30 Figure 1.4: The neighborhoods induced by the binary 1-point crossover for pairs (15,15)-(16,16) and (15,15)-(31,31). 30 30 25 25 20 20 15 15 10 10 5 5 0 0 5 10 15 20 25 30 0 0 5 10 15 20 25 30 Figure 1.5: The neighborhoods induced by the binary 2-point crossover for pairs (15,15)-(16,16) and (18,17)-(12,10). 1.3 Genetic and Evolutionary Algorithms to generate a non-neighbor; thus, it cannot escape from a local optimum. Based on one individual (with which the mutation works), we cannot learn anything important about the global structure of the problem. With the crossover, on the other hand, we can learn the problem structure. In the context of GAs, Harik (1999) described the role of crossover as follows: 15 Learning global structure with mutation is impossible! ‘. . . The GA’s population consists of chromosomes that have been favored by evolution and are thus in some sense good. The distribution that this population represents tells the algorithm where to find other good solutions. In that sense, the role of crossover is to generate new chromosomes that are very much like the ones found in the current population. . . . ’ Harik emphasizes that the distribution of the selected individuals is the key element that should help us in creating new individuals. To take advantage of the distribution (to estimate it and use in the creation process) we need to use an operator that takes more than one parent as input; we have to use crossover with more parents (even more than 2, perhaps the whole population). This enables us to account for the dependencies among variables. The exact meaning of ‘. . . very much like . . . ’ depends on the particular operator, i.e. on the neighborhood structure that it induces. For such a crossover, some more complex models of the distribution of the selected individuals must be used. Each of them, however, has its own assumptions of varying strength. The two extremes are: • A flexible model with weak structural assumptions is able to fit the data very well. The neighborhood of several points is then usually very similar to a unification of the individual neighborhoods. • An inflexible model with strong structural assumptions usually fits the data with less fidelity, but if the problem being solved satisfies the assumptions, this model can create offspring individuals even in promising areas far away from the individual neighborhoods of the parents. Models from the whole range in between the above mentioned cases are able to learn the structure of the problem.10 Models with strong assumptions, however, have higher ability to escape from local optima, if their assumptions fit the solved problem well. They can be usually considered to be parametric models, i.e. they enforce one particular type of model and we learn only the model parameters. 1.3.6 Two Ways of Linkage Learning As stated above, the linkage presents a serious issue for EAs, but there are some possibilities to overcome this limitation. Among other methods, I would like to point out the following two: Probabilistic models In the framework of estimation of distribution algorithms (see Chap. 2.1), we can account for the statistical dependencies by using a probabilistic model that is flexible enough to cover these interactions. This approach originated in the field of GAs and was then transfered to real-valued spaces. 10 The meaning of the term problem structure is clarified in footnote 8 on page 13. More parents are needed. 16 1 I NTRODUCTION TO EA S Coordinate-system transforms The individuals can be transformed from the search space into another one in which the linkage is greatly reduced. There we can apply ordinary variational operators. This method is much more usual in real-valued spaces—for ordinal (or even categorical) variables we could hardly find any profitable transform. Probabilistic model + coordinate transform In the real-valued spaces, many of the complex probabilistic models can be viewed as a combination of learning (1) a possibly non-linear coordinate transform and (2) a much simpler probabilistic model of distribution of the transformed data points. Sampling from such a complex model then amounts to (1) sampling from the simple probabilistic model in the transformed space and (2) converting the new data points into the original space via the inverse transform. It is often difficult to distinguish which method an algorithm uses, they often cooperate. Both of the above mentioned methods of linkage learning can be used as parts of crossover and mutation operators, but with different purposes. In case of mutation they are used to estimate the structure of the local neighborhood of the current data point, in case of crossover, some structural characteristics of the whole population are estimated. 1.4 The Topic of this Thesis This section gives a bird-eye view on the goals of this thesis. It describes the location of my efforts on the map of EA design. 1.4.1 Classification of EAs Targeting Efforts EAs can be classified using many criteria. Here I present three that are relevant to the topic of this thesis: • Representation. EAs can work with many different representations (binary strings, real vectors, trees, etc.). • Linkage learning approach. The learning of the problem structure is usually based: 1. on the analysis of the population distribution in the search space, or 2. on the analysis of the fitness landscape. The basic difference between these two approaches lies in the fact that the fitness landscape is usually fixed during the evolution while the distribution of the population changes every generation. • Presence of selection. Traditional EAs use selection in its original form, i.e. only a subset of the population is used to create new individuals. In other words, neither crossover, nor mutation takes the fitness values into account. Recently, several algorithms able to work without selection (even though they may use it) have emerged. They use such forms of variational operators that work even with bad individuals (that would be otherwise discarded by selection) and create offspring by weighting the population members. Target area of this thesis With regard to these criteria, the algorithms presented in this thesis can be described as follows: 1.4 The Topic of this Thesis • they use vectors of real numbers as the representation, • they use distribution-based linkage learning, and • they use the selection operator in its original sense. If any presented algorithm does not match these specifications, it will be explicitly mentioned. 1.4.2 Requirements for a Successful EA In the area of real-coded EAs, there are several ‘empirical’ features which are essential for a successful EA: 1. Envisioning the structure. The algorithm should be able to create individuals in previously unseen areas of the search space. This can be accomplished by a crossover operator with a less flexible model if the population distribution suggests such a possibility. This is clearly the hardest part in the real-valued spaces as we do not know the type of the structure, we even do not know what complexity of the problem structure should we expect. 2. Focusing the search. The algorithm must be able to intensify the search in areas where several population members already reside. Such an operation is usually carried out by some interpolating or averaging. It requires several individuals to estimate this kind of structure and is thus accomplished by crossover with highly flexible models. 3. Population shift. The two previous requirements apply to discrete spaces as well. In real-valued spaces, the algorithm must be able to efficiently shift the population members to better values. It usually amounts to searching the local neighborhoods of individual population members and it is usually accomplished by mutation-like heuristics. In discrete spaces, something like ‘shifting’ the population does not make sense since no individual can make several steps in the same direction which is often a must for a real-valued EA. 4. Making it cooperate. All the above mentioned features are essential and they must work together to successfully solve the problem. 1.4.3 The Goals A design of a simple algorithm which would satisfy all the above requirements is a very hard task and a too ambitious goal. However, if we had efficient methods for the first three requirements, we could make them work together in several ways; the information interchange among them would be the crucial part. Making several heuristics cooperate is a very broad topic—it deserves its own dissertation (or even several of them). This thesis, thus, aims at the first three requirements, i.e. it aims at using the probabilistic models and coordinate transforms in the following situations: 1. The problem has no (or weak) interactions among variables. Many empirical problems have been tackled by methods (even by local search methods) that assumed independence of individual object variables. Population-based optimization algorithms should further improve 17 18 1 I NTRODUCTION TO EA S the efficiency of such algorithms in terms of global search abilities and should take advantage of the fact that by mixing good partial solutions often a new better solution is obtained (even in previously unseen areas of the search space). 2. The problem has linear interactions among variables. Algorithms with the assumption of independence are too constrained. The use of coordinate transforms to decorrelate the problem variables should give more degrees of freedom to the model which should be, in turn, useful for broader range of problems. 3. The problem has higher-order interactions among variables. The solutions in promising areas form various clusters or curves in the search space. The models should be very flexible and data driven to be able to describe such kind of distributions; this in turn limits their ability to shift the population. 4. The population members are placed in an area that does not contain any local optimum. In this situation, many EAs have problems with premature convergence (their ability to search outside the area occupied by the population is limited) or with insufficient speed of convergence. This situation does not usually arise when using conventional direct search methods (e.g. NelderMead). New type of EA should be able to traverse the search space efficiently. 1.4.4 The Roadmap Chapter 1 introduced the notions of optimization, local search algorithms, evolutionary algorithms (viewed as methods for local search in the neighborhood of several points), and thus showed the location of the topic of this thesis on the map of optimization methods. It emphasized the fact that the neighborhood structure is the crucial part of the optimization algorithm—if the neighborhood structure is not suitable for the particular optimization problem, it can hardly be solved efficiently, quickly, and reliably. In Chapter 2, the estimation of distribution algorithms are introduced, along with the way of incorporating the coordinate transforms into them. The chapter surveys the various kinds of probabilistic models used in EAs for continuous and discrete optimization problems, because the discrete models are often generalized for the use in continuous domain. A few works of explicit applications of coordinate transforms found in the literature are surveyed as well. The univariate probabilistic models and the EAs that use them are elaborated in the first part of Chapter 3. Three types of histograms and mixture of gaussians are compared on a few test problems. In the second part of the chapter, the probabilistic models are coupled with the linear coordinate transforms to decorrelate the object variables (or to make them as independent as possible). Principal component and independent component analysis are compared and evaluated. Chapter 4 presents two flexible models able to describe higher-order interactions among variables. The first one is a distribution tree model which builds on the classification and regression tree learning method, while the second model is built on kernel principal component analysis. 1.4 The Topic of this Thesis Chapter 5 presents theoretical explanation of the premature convergence phenomenon for one particular instantiation of EA. Then, it describes the evolutionary algorithm with covariance matrix adaptation—one of the most successful EA that uses linear coordinate transform and successfully prevents the premature convergence. A comparison of this method to Nelder-Mead search strategy is presented. Finally, an improvement of the method is suggested and evaluated. Chapter 6 concludes the thesis, lists its main contributions and provides guidelines for future work. Chapters 3, 4, and 5 contain results of experimental comparisons of the presented methods with competitors which I consider to be relevant. The test problems are carefully chosen to reveal the main advantages and disadvantages of the individual optimizers. 19 20 1 I NTRODUCTION TO EA S Chapter 2 State of the Art This chapter contains a survey of techniques for probabilistic modeling and coordinate transforms with respect to EAs. First, the general characteristics of estimation of distribution algorithms are described, along with some methods for probabilistic modeling in discrete spaces. Even though the discrete probabilistic distributions are not the main topic of this thesis, many of the continuous distributions are based on their discrete counterparts. Finally, a short section gives attention to the use of coordinate transforms in real domains. 2.1 Estimation of Distribution Algorithms In Chapter 1, the basic functionality of evolutionary algorithms is shortly described. The principle they use is very simple: the EA usually starts with randomly initialized population of individuals from which more promising ones are selected to breed new individuals by means of (general or specialized) variation operators. This process is repeated until some termination condition is satisfied. Section 1.3.5 pointed out the necessity of using multi-parent1 variational operators which offer us the possibility to learn the linkage information. At this place we can use many methods developed for statistics, machine learning or data mining. We can e.g. model the fitness landscape with a (possibly non-linear) regression model of known type for which the place of optimum is known. This estimate of optimum can be taken as the offspring coming out from such kind of generalized crossover operator. There is, however, another very general approach to the variation phase. Instead of the classic select-crossover-mutate strategy, we can use a select-modelsample approach (see Alg. 2.1). These algorithms are called Estimation of Distribution Algorithms (EDAs), Probabilistic Model Building Genetic Algorithms (PMBGAs), or Iterated Density Estimation Algorithms (IDEAs), and they form a substantial part of this thesis. 2.1.1 Basic Principles During evolution, the EDAs build an explicit probabilistic model. After each selection phase, only the good-enough individuals should serve as the basis for building the model which (more or less) describes the distribution of good individuals in the search space. When the model is ready, EDA produces new 1 A multi-parent operator means an operator with two or more individuals as inputs. 21 Model and sample 22 2 S TATE OF THE A RT Algorithm 2.1: Estimation of Distribution Algorithm 1 2 3 4 5 6 7 8 9 10 11 12 Do not search for the optimum, search for the right distribution! begin X (0) ← Initialize() f (0) ← Evaluate(X (0) ) g←0 while not TerminationCondition() do XPar ← Select(X (g) , f (g) ) M ← CreateModel(XPar ) XOffs ← Sample(M ) fOffs ← Evaluate (XOffs ) [X (g+1) , f (g+1) ] ← Replace(X (g) ,XOffs ,f (g) , fOffs ) g ←g+1 end offspring individuals by sampling from this model, creating more individuals in the promising regions than elsewhere. The overall characteristics and features of an EDA as a whole depend mainly on the type of the probabilistic model employed. We can use a very simple model which can be built up very quickly; the resulting EDA is very fast and accurate when solving easy problems, however it may not solve problems which involve interactions between variables (like an ordinary EA). The main advantage of EDAs is the fact, that we can use a model of arbitrary complexity. If we employ some sophisticated probabilistic model that can cover (possibly nonlinear) dependencies in the population, the resulting EDA will be able to solve very hard problems. Although EDAs will then need large populations and large number of evaluations to solve it, the conventional EAs might not be able to solve such problems at all. 2.1.2 No interactions Pairwise dependencies EDAs in Discrete Spaces Probably the first algorithm for the binary representation which uses the principles of EDAs is the population based incremental learning (PBIL) of Baluja (1994). The PBIL works in binary space and adapts a vector of probabilities pi that the i-th component of the solution is zero. First, it generates a few individuals in accordance with the probability vector, then it selects the best one, and modifies the vector of probabilities with a kind of Hebbian learning rule, i.e. the model is built by exponential forgetting the values of the old best vectors and moving towards the current best vector. A similar principle is used by the compact genetic algorithm (cGA) (Harik, Lobo & Goldberg 1997). Similarly to the PBIL, the cGA uses a vector of probabilities, but adapts it by a different learning rule which resembles a tournament selection. The univariate marginal distribution algorithm (UMDA) of Mühlenbein & Paass (1996) is already a proper EDA—it selects the promising solutions and models the distribution of zeros and ones by computing a vector of probabilities. All the above mentioned algorithms treat the variables independently of each other and their behavior is similar. There are also models which detect the dependencies between pairs of variables. They are the first non-trivial EDAs as they are able (to certain extent) solve the linkage problem. They differ mainly in the form of a model they create; the models include a chain, a tree, or a forest. The algorithm called mutual-information-maximizing input clustering (MIMIC) pre- 2.1 Estimation of Distribution Algorithms sented by de Bonet et al. (1997) builds a model of bivariate dependencies in the form of a chain. The sampling is then based on the chain rule, i.e. on the probability of the first variable, on the conditional probability of the second variable given the value of the first one, etc. Baluja & Davies (1997) used the dependency trees to encode the relationships between the variables, while Pelikan & Mühlenbein’s (1999) bivariate marginal distribution algorithm (BMDA) employed a forest distribution (a set of mutually independent dependency trees). These algorithms are able to identify and properly mix building blocks2 of size two, but it is still insufficient to solve more complex problems. EDAs with more sophisticated probabilistic models can describe even multivariate dependencies between the variables. The bottleneck of these methods is the fact that they need complex algorithms to learn the distribution. It results in a significant learning time overhead. Furthermore, in almost all cases we end up only with a sub-optimal model of the joint probability distribution. The factorized distribution algorithm (FDA) of Mühlenbein et al. (1999) can contain multivariate marginal and conditional probabilities. However, the factorization of the distribution must be known a priori, e.g. given by an expert. The extended compact genetic algorithm (ECGA) (Harik 1999) uses different principles. It divides the variables into several mutually independent groups. Each of these groups is described by a complete joint distribution. The division into groups is carried out by a greedy manner with respect to the Bayesian information criterion. The most complex model for discrete variables that can be learned is probably a general Bayesian network (the chain, tree and forest are special cases of Bayesian network). The first of EDAs that uses this complex model is the Bayesian optimization algorithm (BOA) of Pelikan et al. (1998). The network for a population of good solutions is learned every generation by a greedy algorithm which tries to minimize the Bayesian-Dirichlet metric between the distribution encoded in the model and the observed distribution. This network is then sampled to create new individuals. BOA is able to solve problems of bounded difficulty in subquadratic time. The same model was independently used in EDA by Etxeberria & Larrañaga (1999) with the name estimation of Bayesian network algorithm (EBNA). The FDA was later improved to be able to learn the factorization, and the modified version is called learning factorized distribution algorithm (LFDA). 2.1.3 EDAs in Continuous Spaces As stated in Sec. 1.2.1, there are fundamental differences between discrete and continuous spaces. Another example of this difference: • All possible conditional PDFs p(X2 |X1 ) with two binary variables X1 and X2 can be described with only two numbers: p(X2 = 1|X1 = 0) and p(X2 = 1|X1 = 1). • A model of a conditional PDF p(X2 |X1 ) with two continuous variables X1 and X2 can take much more complex forms with various parameters for each possible value of X1 —mere two numbers (like in the case of binary PDF mentioned above) usually do not suffice. Thus, searching for the right probabilistic model in continuous domain involves not only the factorization of PDF, but also the search for the right type of distributions along their parameters. The search for the right model structure has to 2 Building blocks are defined as short over-average schemata of low size. For more information, see e.g. Goldberg (1989). 23 Multivariate dependencies 24 2 S TATE OF THE A RT be often limited to a class of (rather simple) models (or is carried out only in the space of model parameters). Nevertheless, many continuous EDAs use direct generalizations of the models used in discrete EDAs. No dependencies Bivariate and multivariate dependencies Clustering The easiest EDAs in the continuous spaces take into account no dependencies among variables. The joint density function is factorized as a product of univariate independent densities. E.g. the univariate marginal distribution algorithm for continuous domains (UMDAC ) by Larrañaga et al. (1999) belongs to this type of EDAs. This algorithm performs some statistical tests in order to determine which of the theoretical density functions fits the particular variable best. Then, it computes the maximum likelihood estimates of the parameters for chosen univariate marginal PDFs. Another example of EDA using a model without dependencies is the stochastic hill-climbing with learning by vectors of normal distributions (SHCLVND) of Rudlof & Köppen (1996). In this algorithm, the joint density is formed as a product of univariate independent normal densities where the vector of means is updated using a kind of Hebbian learning rule and the vector of variances is adapted by a reduction policy (at each iteration it is modified by a factor lower than 1). Sebag & Ducoulombier (1998) propose an extension of PBIL for continuous domains (PBILC ), however their approach is very similar to an instance of ES (see Schwefel (1995)). Servet et al. (1997) present a progressive approach; for each dimension, an interval in which the search takes place is maintained, along with a probability that the respective component of the solution comes from the right half of the interval. In case that the probability gets close to 1 (or 0), the interval is updated to the right (or left) half. Finally, there are several attempts to use histogram models (see e.g. Bosman & Thierens (1999), Tsutsui, Pelikan & Goldberg (2001), Pošı́k (2003)). The next step is to take advantage of models which cover bivariate or multivariate dependencies among variables. They include probabilistic models based on bivariate normal distributions (MIMICG C ), on the multivariate normal distributions where the complete covariance matrix is estimated (EMNA), on Gaussian networks (EGNA), etc. (Larrañaga & Lozano (2002)). Bosman & Thierens (2000) also used a multivariate normal distribution with full covariance matrix. MBOA of Očenášek & Schwarz (2002) learns a Bayesian network with decision trees and univariate normal-kernel leaves. The search domain is decomposed into axis-parallel partitions; the variables are approximated by univariate kernel distributions. Sometimes the good solutions form multiple ‘clouds’ in the search space. For this type of distribution, a mixture of normal distributions is a very useful model. Example of this approach was presented by Gallagher et al. (1999); the algorithm proposed there (AMix) uses an adaptive mixture of multivariate gaussians which can change the number of components during evolution. Learning such complex class of models, however, requires large number of samples, and is usually carried out by some kind of iterative algorithm (e.g. EM algorithm). Another approach is pursued by Bosman & Thierens (2000); multivariate gaussian with relatively small variance is centered around each individual. The joint probability function is then a superposition of all those kernels. This model has the advantage of capturing various types of distribution; furthermore, we can think of it to be a non-parametric model (when we keep the variance of kernels constant). This way of estimating the true PDF is in fact equal to Parzen windows. The behavior of such an EDA is exactly the same as the behavior of mutative-only ES where the mutation is carried out by adding a normally distributed random vector to all selected parent individuals. 2.2 Coordinate Transforms 25 Algorithm 2.2: The Use of Coordinate Transform Inside EA 1 2 3 4 5 6 7 8 9 10 11 12 13 begin X (0) ← Initialize() f (0) ← Evaluate(X (0) ) g←0 while not TerminationCondition() do XPar ← Select(X (g) , f (g) ) YPar ← DirectTransform(XPar ) YOffs ← ProduceOffspring(YPar ) XOffs ← InverseTransform(YOffs ) fOffs ← Evaluate (XOffs ) [X (g+1) , f (g+1) ] ← Replace(X (g) ,XOffs ,f (g) , fOffs ) g ←g+1 end 2.2 Coordinate Transforms Sec. 1.3.6 states that in continuous domain the probabilistic modeling is often accompanied with a kind of coordinate transform. However, the transforms can be used even with the ordinary reproduction operators of conventional GEAs (see Alg. 2.2). After the selection, the parents are transformed into a different coordinate system where the creation of offspring individuals might be easier or more successful. The offspring is created either using the crossover and mutation operators (in case of conventional GEAs) or by probabilistic model building and sampling (in case of EDAs). Then the new individuals are transformed back to the original space, are evaluated and can compete for survival with the old population members. From this it immediately follows that if the coordinate transform should be adapted using the selected parents, it has to be reversible. The possible transforms can be classified into several types with various levels of complexity and usefulness: • Linear coordinate-wise transform. This group contains e.g. change of origin, change of scale of individual coordinates, standardization, etc. These methods are far too simple to yield any important profit for the algorithm. • Non-linear coordinate-wise transform. These transforms allow for adaptive focusing to promising regions of the search space, however, they are not able to reduce the dependencies among variables. • Linear transform with rotation. Methods which reduce the amount of dependencies in the population by transforming the selected individuals as a whole, like principal component analysis or independent component analysis, belong to this category. These methods are able to greatly improve the performance of the algorithm if the problem corresponds to the model. • General non-linear transform. This category contains transforms which employ some sort of clustering, mainly mixture models. They are able to perform different coordinate transforms in different areas of the search space. The evolution then evokes a parallel model of EA where the population is divided to several demes which can evolve independently of the others. Reversible transform 26 2 S TATE OF THE A RT The survey of literature dealing with coordinate transforms in the context of EAs is very short. Only a few papers dealt with coordinate transforms in the way just described. Much more often, the transforms are part of the probabilistic model, thus: • All the EDAs that use a PDF factorized to univariate components perform a non-linear coordinate-wise transform (although they do not mention it explicitly). • All the EDAs that use some of the forms of the principal component analysis or independent component analysis perform a linear rotational transform. Again, they are usually used as part of the PDF model, but they can be used independently of the model. • All the EDAs that use a kind of probabilistic mixture model perform a nonlinear transform. The transform could be used as a standalone component of EA. The independent component analysis was explicitly used as a part of EA by Zhang et al. (2000). Pošı́k (2005b) used the principal component analysis in a co-evolutionary scheme and Pošı́k (2005a) compared the usefulness of principal and independent component analysis. Finally, Pošı́k (2004) combined a nonlinear transformation in the form of kernel principal component analysis with a pure random search to construct a generalized crossover operator with interesting characteristics. Chapter 3 Envisioning the Structure This chapter aims at the most difficult part of the optimization task—at the envisioning of the structure of distribution of good solutions in the search space. It is only fair to say that there is nothing oracular about the methods presented here. Since the structure can be generally very complex, there is no easy, fast and reliable method of envisioning the structure from a finite sample (selected population members). Instead, models with strong structural assumptions are used and if the structure of the problem does not correspond to them, then the model performance is low. Although we cannot assume that the optimization task at hand has suitable structure, an empirical justification of this approach is presented by conventionally used methods (e.g. line search) which have similar assumptions and still are often employed to solve real-world problems. First, univariate marginal probability models are described and their efficiency is demonstrated on selected optimization problems. Then, they are combined with coordinate transforms to relax the strong assumptions, at least to some extent. Disclaimer 3.1 Univariate Marginal Probability Models This section builds on the work published in Pošı́k (2003) and concentrates on models that assume the statistical independence of individual variables. This allows us to factorize the joint probability distribution as a product of univariate marginal distributions: p(x) = D Y pd (xd ) (3.1) d=1 where p(x) is the joint multivariate probability distribution and pd (xd ) are the 1D marginals. This kind of model is actually the simplest reasonable one and algorithms which use this type of model fall into the class of Univariate Marginal Distribution Algorithms (UMDAs). One source of the differences among the individual UMDA class algorithms is the choice of suitable 1-dimensional PDF. In the literature we can find examples of using histogram models, (unimodal) gaussian PDFs, mixtures of gaussians, or Parzen windows-based density estimators. Again, the individual models differ in flexibility: when less flexible models are used (e.g. gaussian PDF or histogram model with low number of bins) we can perform sampling between the training points, while in case of more flexible models (mixture of gaussians or Parzen windows) the sampling is more limited to local neighborhoods of 27 UMDA 28 3 E NVISIONING THE S TRUCTURE training points. Which of the two cases is better depends on the type of problem we are about to solve. In following subsections, several types of marginal densities are described. The purpose is to evaluate the behavior of the UMDA algorithm (without mutation) using 4 types of marginal probability models. The first two were considered in the work of Tsutsui, Pelikan & Goldberg (2001), namely the equi-width and equi-height histograms. Pošı́k (2003) expanded the set of models to the maxdiff histogram and the univariate mixture of gaussians. 3.1.1 Histogram in continuous domain Histogram Formalization Before the description of the particular types of histograms, let us formalize the notion of 1D histogram model. Further subsections will describe various instances of this formalization. A histogram is a non-parametric model of a univariate PDF. It consists of C bins and each c-th bin, c ∈ h1, Ci, is described by three parameters: bc−1 and bc are the boundaries of the bin and pc is the value of PDF if bc−1 < x ≤ bc . When constructing a histogram, we have a set {xn }N n=1 of N data points. Let nc be the number of data points falling between bc−1 and bc . Then, the values of pc are proportional to nc . pc = αnc (3.2) Let dc be the width of the c-th bin, dc = bc − bc−1 . The following condition must hold if the p(x) is a PDF: 1= Z ∞ p(x)dx = −∞ C X pc d c = α C X nc dc (3.3) c=1 c=1 From this equation we immediately get for the normalizing constant α α = PC 1 c=1 nc dc 3.1.2 (3.4) One-dimensional Probability Density Functions The individual models of PDF are described here in detail. Equi-Width Histogram (HEW) All bins have the same width This type of histogram is probably the best known and the most-widely used one, although it exhibits several unpleasant features. The domain of the respective variable is divided to C equally large bins. Since the size of each bin can be written as dc = d = bCC−b0 , the equation for all the values of PDF reduces to pc (x) = nc dN (3.5) When constructing the histogram, we first determine the boundaries of each bin, count the respective number of data points in each bin, and use the equation 3.5 to compute the values of PDF for this histogram model. Figure 3.1 shows an example of the equi-width histogram. 3.1 Univariate Marginal Probability Models 29 Equi-Height Histogram (HEH) This type of histogram fixes the probabilities of individual bins instead of fixing their widths. Each bin then contains the same number of data points. This means that in areas where the data points are sparse, the bin boundaries are far from each other, while in regions where the data points are dense, the boundaries are pretty close. Since the probability of each bin is constant and equal to p(bc−1 < x ≤ bc ) = n 1 N = C , we can write 1 (3.6) pc (x) = dc C All bins contain the same number of points Equation 3.6 holds only theoretically. In real cases, we can hardly expect that the number of data points can be divided by the number of bins without any remainder. One simple method how to construct this type of histogram is to compute the number of data points each bin should contain (the difference of these values across all bins is maximally 1). Then we put the boundary of consecutive bins to the half of the distance of the two neighboring data points where the first one should come into one bin and the second one to the other bin. It is also possible to build an empirical CDF with some kind of interpolation between points and using this estimate to determine the quantiles of the distribution at levels C1 , C2 , . . . , C−1 C , which then serve as the bin boundaries b1 , . . . bC−1 . Then, the number of data points in each bin can be exactly the same (and possibly fractional) for all the bins. An example of the equi-height histogram is shown in Fig. 3.2. As can be seen, it is actually not the height of the bins what is fixed, but rather their area (i.e. the probability of generating a point belonging to the bin). Max-Diff Histogram (HMD) This type of histogram is reported to have several good features (very small errors, short construction time), see e.g. Poosala et al. (1996). The max-diff histogram does not put any ‘equi-something’ constraint on the model. Instead, it is defined by a very simple rule. When constructing max-diff histogram with C bins, we need to specify C − 1 boundaries (the first and the last boundaries are equal to minimum and maximum of the respective variable domain). If we compute all the N − 1 distances between all pairs of neighboring data points, we can select the C − 1 largest of them and put the boundaries to the half of the respective intervals. This procedure has a natural explanation. We try to find consistent clusters in data, and all data points in such a cluster are put to the same bin. It is just an approximation and we can argue what results this procedure gives when the data does not show any structure, but it turned out to be very precise and useful in many situations. Figure 3.3 shows an example of the max-diff histogram. Univariate Mixture of Gaussians (MOG) The univariate mixture of gaussians (MOG) is the last marginal PDF model compared in this section. It is included in this comparison for several reasons. We can consider all the histogram models to be a certain kind of mixture model as well—mixture of uniforms. From this point of view, the mixture of gaussians can give interesting results when compared to the histograms. Histograms (viewed as the mixtures of uniform distributions) are usually limited in the ‘fidelity’ with which they can model the actual distribution. Hence, Put the boundaries to the largest gaps 30 3 E NVISIONING THE S TRUCTURE Equi−width Histogram 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 5 10 15 20 Figure 3.1: Left: equi-width histogram. Right: bivariate PDF as a product of two marginal equi-width histograms. Equi−height Histogram 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 Figure 3.2: Left: equi-height histogram. Right: bivariate PDF as a product of two marginal equi-height histograms. Max−diff Histogram 0.12 0.1 0.08 0.06 0.04 0.02 0 0 5 10 15 20 Figure 3.3: Left: max-diff histogram. Right: bivariate PDF as a product of two marginal max-diff histograms. 3.1 Univariate Marginal Probability Models 31 they are constructed with a pretty large number of bins (i.e. components, in the language of mixture models). On the contrary, the mixture model is usually capable to describe the actual distribution with much higher degree of ‘fidelity’, thus it allows us to drastically decrease the number of components. The MOG is given in the form p(x) = C X c=1 αc N (x; µc , σc2 ) (3.7) where αi are coefficients of individual gaussian components and N (x; µc , σc2 ) is the probability of generating value x using the c-th component with mean value µc and standard deviation σc . The fitting of MOG to the data sample requires much more effort. In MOG, the individual components overlap each other. We cannot therefore exactly decide which component each data point belongs to (all data points belong to all components with certain probability). As a result, to create the model we have to use an iterative learning scheme—the Expectation Maximization (EM) algorithm (see e.g. Schlesinger & Hlaváč (2002), Moerland (2000)) which searches for the maximum likelihood estimates of the parameters αc , µc and σc . Figure 3.4 shows an example of the MOG model. Mixture of Gaussians 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 Figure 3.4: Left: univariate mixture of gaussians. Right: bivariate PDF as a product of two marginal MOG models. Sampling To sample new points from the models described above, a sampling method from Tsutsui, Pelikan & Goldberg (2001) called Baker’s stochastic universal sampling (SUS) was used. It uses the same principles as the (in EC community perhaps more known) remainder stochastic sampling (RSS). The random values are generated independently for each dimension. First, the expected number of data points which should be generated by each bin is computed according to the bin (or component) probabilities given by equation 3.2. Then, the integer part of the expected number of offspring is generated by sampling the uniform (in case of histograms) or the normal (in case of MOG) distribution with the proper parameters. The rest of points is then sampled by stochastically selecting bins (or components) with probability given by the fractional parts of the expected numbers of offspring for each bin (component). 3.1.3 Empirical Comparison The settings of the whole full-factorial experiment are surveyed in Table 3.1. Use the EM algorithm 32 3 E NVISIONING THE S TRUCTURE Table 3.1: Settings of individual factors of the experiment Factor Levels Test Function Probability Model Number of Components Population Size 20D Two Peaks, 10D Griewangk HEW, HEH, HMD, MOG Low, High 200, 400, 600, 800 Two reference functions, namely the 20 dimensional Two Peaks (see Sec. A.3) and the 10 dimensional Griewangk function (see Sec. A.4), were selected for the comparison. For both test problems, all four UMDAs described in Sec. 3.1.2 were compared. For each of the models, the number of bins (components) has two levels, High and Low. The exact number of bins (components) varies (see Table 3.2). The High Table 3.2: Number of bins (components) used for individual algorithms and test functions. Function Model Low High Two Peaks Griewangk HEW, HEH, HMD MOG HEW, HEH, HMD MOG 60 3 50 3 120 6 100 6 number of bins was computed as the size of the variable domain divided by the chosen precision (for equi-width model), i.e. 12/0.1 = 120 for the Two Peaks function, and 10/0.1 = 100 for Griewangk. The Low number of bins is simply one half. The mixture models are much more flexible and thus dramatically less components were used than in the case of histograms. Values for this kind of model were determined by hand after a few experiments. The levels are 6 and 3 for the High and Low number of components, respectively. The initial values for the centers of individual components of mixture were constructed using the k-means clustering method. In order to give some indication of how complex the test problems are, they were also optimized with the so called line search (LS) method (see Whitley et al. (1996)). This method is not intended to be a universal optimization algorithm, rather it is a heuristic which is quite effective for separable problems and thus it can give us some insight in the complexity of the evaluation function. Evolutionary Model A very simple evolutionary model that is used for the most of experiments throughout this thesis is presented as Alg. 3.1. The differences when compared with the general EDA algorithm (Alg. 2.1) are the following: • selection of parents is not used at all (i.e. all population members become parents), and • replacement scheme simply takes N best individuals from both parents and offspring. 3.1 Univariate Marginal Probability Models 33 Algorithm 3.1: Generic EDA begin Initialize and evaluate the population while not TerminationCondition do Based on the current population, create N new offsprings Evaluate them Join the old and the new population to get a data pool of size 2N Use the truncation selection to select the better half of the data points (returning the population size back to N ) 1 2 3 4 5 6 7 end 8 Monitored statistics For each of possible factor combinations, the algorithm was run 20 times. Each run was allowed to continue until the maximum number of 50,000 evaluations was reached. We say, that the algorithm found the global optimum if for each variable d the condition |xbest − xopt d d | < 0.1 holds (i.e. if the difference of the best found solution xbest from the optimal solution xopt is lower then 0.1 in each of the coordinates). In all experiments, three statistics were tracked: 1. The number of runs in which the algorithm succeeded in finding the global optimum (NoFoundOpts). 2. The average number of evaluations needed to find the global optimum computed from those runs in which the algorithm really found the optimum (AveNoEvals). 3. The fitness of the best solution the algorithm was able to find averaged over all 20 runs (AveBest). Results and Discussion The results of our experiments are presented in Table 3.3. The results of the line search heuristic correspond to the fact that the Two Peaks function is separable and that one of the sampling points was set precisely in the optimum. The Griewangk function is more complex and the line search heuristic was not able to solve it in all experiments. This suggests that the test function does not lost all of its complexity when used in more dimensional form (as mentioned in Sec. A.4). EDAs should be able to solve the easy (when the variables are known to be independent beforehand) Two Peaks problem. Let us first consider the results for the HEW model. The results of the experiment are in accordance with the results reported by Tsutsui, Pelikan & Goldberg (2001)—the HEW model is the least flexible one. Furthermore, it can be characterized by one important feature: if in any stage of evolution, any bin of histogram for any dimension happens to be empty, it is impossible to generate a value from this bin in the following evolution cycles. Generaly, if the PDF is equal to zero somewhere in the search space, the algorithm has no possibilities to overcome it and to start searching in these ‘zero areas’.1 This is also documented by the very weak results obtained by the UMDA with HEW model. If the ‘resolution’ (the number of bins) of 1 This drawback can be avoided by the so-called Bernoulli’s adjustment, i.e. by adding a small HEW model and zero areas 34 3 E NVISIONING THE S TRUCTURE Model Number of Bins (Components) Low High 200 400 600 0 0 5,00560 20 6530 0,10710 20 6270 0,00003 7 6571 0,87390 1 13200 4,82110 20 11080 0,02070 20 14920 0,00008 20 11860 0,00011 5,05280 20 16050 0,01040 20 27960 0,04990 20 17130 0,00930 13 15954 0,00370 18 6667 0,00074 17 6482 0,00110 15 5693 0,00190 17 20353 0,00210 18 12867 0,00074 17 12376 0,00110 15 10613 0,00180 16 25763 0,00230 18 18967 0,00074 18 19200 0,00074 18 16100 0,00074 HEH HEH MOG HMD Griewangk HEW LS MOG HMD Two Peaks HEW LS Function Table 3.3: UMDA empirical comparison: results HEH slightly better than HMD Population Size 800 200 20 2421 0,00000 1 1 47200 5600 4,95510 3,48040 20 20 20760 7720 0,07490 0,03740 20 20 47080 6770 2,20380 0,00260 19 16 23032 5613 0,13970 0,19650 18 14056 0,00250 14 1 27886 5800 0,00320 0,00500 20 16 25000 6650 0,00000 0,00150 16 18 25850 6456 0,00140 0,00074 19 19 21726 5894 0,00037 0,00037 400 600 800 17 10306 2,51180 20 10620 0,00690 20 10980 0,00001 20 10480 0,00008 19 15189 2,50030 20 15570 0,00430 20 17490 0,00230 20 15780 0,00660 20 19600 2,52340 20 20400 0,03520 20 25280 0,05740 20 20720 0,06370 15 12320 0,00081 18 12711 0,00074 17 12235 0,00110 19 12084 0,00037 18 18867 0,00095 20 18270 0,00000 19 18221 0,00037 17 17682 0,00110 19 24926 0,00083 20 23920 0,00000 20 23720 0,00000 18 23644 0,00074 Statistics NoFoundOpts AveNoEvals AveBest NoFoundOpts AveNoEvals AveBest NoFoundOpts AveNoEvals AveBest NoFoundOpts AveNoEvals AveBest NoFoundOpts AveNoEvals AveBest NoFoundOpts AveNoEvals AveBest NoFoundOpts AveNoEvals AveBest NoFoundOpts AveNoEvals AveBest NoFoundOpts AveNoEvals AveBest NoFoundOpts AveNoEvals AveBest the histogram is lower than the precision required, the algorithm becomes very inaccurate. If the ‘resolution’ is sufficient, the efficiency of the algorithm is very dependent on the population size. This can be seen especially in the case of the Two Peaks evaluation function. The other three models, HEH, HMD, and MOG, seem to be much more robust with respect to the parameters. None of them also exhibits the unpleasant feature of HEW model (‘zero areas’) described in previous paragraph. The models are constructed in such a way that they do not allow for a zero probability density (during the evolution, the value of PDF can become ’very small’ in some areas, but is never zero, theoretically). HEH and HMD models showed very similar behavior, although on these two test functions the HEH model seems to work better. However, the differences are very insignificant. Both are very successful in the number of runs they succeed to find the global optima. The average number of evaluations they need to find the optima is comparable for both models. However, looking at the results table, we can see that the average best achieved fitness scores are still highly dependent on the size of population. In Figures 3.5 and 3.6, the typical evolution of the bin boundaries (for hisconstant to all bin probabilities so that even the empty ones are not omitted from furter use. However, other types of histograms do not suffer from this bottleneck so that we can use them instead of the HEW model. 3.1 Univariate Marginal Probability Models Evolution of bin boundaries for the HEH model 35 Evolution of bin boundaries for the HMD model Evolution of component centers for the MOG model 12 12 12 10 10 10 8 8 8 6 6 6 4 4 4 2 2 2 0 0 0 0 20 40 60 80 100 20 40 60 80 100 0 0 20 40 60 80 100 Figure 3.5: Two Peaks function — Evolution of bin boundaries for HEH and HMD models and evolution of component centers for MOG model. Evolution of bin boundaries for the HEH model 5 Evolution of bin boundaries for the HMD model 5 Evolution of component centers for the MOG model 6 4 2 0 0 0 −2 −5 0 20 40 60 80 100 −5 0 20 40 60 80 100 −4 0 20 40 60 80 100 Figure 3.6: Griewangk function — Evolution of bin boundaries for HEH and HMD models and evolution of component centers for MOG model. togram models) and components centers (for the MOG model) are presented. These typical representatives depict cases in which all the models converged to the right solution (for this particular dimension). However, to find the global optimum the algorithm had to converge to the right solution in all the dimensions and that is often not the case for the MOG model (as can be seen from the results table). For the MOG model, we can see that several components of the mixture simply stopped evolving. The reason for this is very simple—their mixture coefficients dropped to zero and the components simply vanished.2 Sometimes, the component that resides in the area of global optimum vanishes before it can take over the population. This can be thought of as an analogy of the premature convergence problem. The UMDA with the equi-width histogram turned out to be very unreliable and non-robust algorithm. UMDA with the other two types of histogram models, equi-height and max-diff, seems to be effective and robust for separable functions, and for functions with a small degree of interactions among variables. In the latter case (for a Griewangk function), they outperformed the simple line search heuristic. The mixture of gaussians model is a representative of a qualitatively different type of model. Although it exhibits a bit worse behavior then the two above mentioned types of histograms, it is fair to say that it was used with a considerably smaller number of components and that it offers a few additional advantages. E.g., with almost no effort, it can be easily extended from a set of independent marginal mixtures of gaussians to a general multidimensional mixture of gaussians which will be able to cover some interactions among variables. (Extending the histogram models to more dimensions would require much more memory space than extending the mixture of gaussians.) 2 This ‘component vanishing’ is an analogy to the zero areas of the HEW model and the Bernoulli adjustment should work as a remedy. Vanishing of MOG components 36 3 3.1.4 E NVISIONING THE S TRUCTURE Comparison with Other Algorithms We have seen that the UMDA is able to solve order-1 decomposable problems (Two Peaks) reliably, and when used with sufficiently large population, it can solve even non-decomposable functions (Griewangk). Of course, in that case its efficiency depends on the amount of dependencies. What should we expect from comparisons with other methods? Comparisons with several other algorithms are postponed to Sec. 4.1.3. Two Peaks function is very hard for any optimization algorithm which is not aware of the independence of individual variables. Local search strategies would find the global optimum in almost no run. GAs would have problems when performing crossover; nevertheless, if they knew the right crossover points (any multiple of the number of bits used to encode one variable), they should get close to the global optimum, but due to the discretization, their precision would be limited. The line search heuristic would have in general case similar problems with insufficient precision when each line search is implemented as a grid search. When using e.g. the Brent’s method of line search, it would very often end up in the local optimum, not in the global one. 3.1.5 Summary This section solved the first goal of the thesis—to find an optimization method for situations when the problem has no or weak interactions among variables. The results of an empirical comparison of the UMDA type of EDA using several types of marginal probability models was described. All the UMDAs used (with the exception of UMDA with equi-width historgram) belong to the class of asymptotically complete search algorithms provided the sought optimum lies inside the enclosing hypercube. The algorithms were viewed also from the robustness point of view, i.e. from the point of view of their (non-) dependency on the algorithm parameters setting. The efficiency of UMDAs is not expected to be very high for more complex functions, i.e. for functions with interactions among variables. UMDA Advantages Simplicity 1D PDFs used by UMDA are easy to learn and sample. Scalability The time and space demands of a probabilistic model factorized as a product of D univariate marginal distributions grow only linearly with the dimensionality of the problem. UMDA Limitations No interactions The UMDAs are not able to deal with problems that exhibit the dependencies among individual variables. UMDAs are not able to solve such problems reliably. What’s Next? More flexible models have to be used if we wanted to solve problems with interactions. One possibility which preserves (more or less) the advantages of UMDAs and incorporates at least linear dependencies among variables is to preprocess the population of individuals with linear coordinate transforms. 3.2 UMDA with Linear Population Transforms 37 3.2 UMDA with Linear Population Transforms This section is based on the work of Pošı́k (2005a). The UMDA is enriched with linear coordinate transforms that should reduce or completely get rid of the linear dependencies found among the problem variables. The main requirement on such a transform should be its reversibility. We need the ‘forward’ part of the transform to reduce the dependencies among variables so that the evolution can take place in the transformed space. The ‘backward’ part of transform (or inverse transform) is to transform newly generated offspring back into the original space in order to be evaluated—the objective function is defined only in the original space. In this section, two possible coordinate transforms are shortly described. Both of them form foundation for building more complex models and are used in some way in subsequent experiments. 3.2.1 Reversibility Principal Components Analysis Perhaps the best known linear coordinate transform is the so-called principal components analysis (PCA). It is used mainly for dimensionality reduction in multivariate analysis and its applications range from data compression, through image processing, to visualisation, pattern recognition, or time series prediction. The PCA is most commonly defined as a linear projection which maximizes the variance in the projected space (Hotelling (1933)). We have to find such an orthogonal coordinate system, in which the variance of the data is maximized along the axes. It can be easily accomplished by performing the eigendecomposition of the data sample covariance matrix. One additional property of PCA is worth mentioning: among all orthogonal linear projections, the PCA projection minimizes the squared reconstruction error. To formalize the computations, let us denote the set of centered data points at hand (the population) as X = (x1 , x2 , ...xN ), where each xn is a column vector. The population matrix X is of size D × N , where D is the dimension of the input space and N is the population size. The covariance matrix of size D × D is given by 1 (3.8) C = XXT . N If we find the eigendecomposition of this matrix, we find the linear transform which decorrelates the components of individuals, i.e. we need to compute a diagonal matrix λ of order D and full-rank square matrix V of size D × D such that the condition λV = CV holds. Matrices λ and V contain the eigenvalues and eigenvectors of the covariance matrix C, respectively. Then, the transform Y =V×X PCA maximizes variances, minimizes square reconstruction error Eigen decomposition (3.9) rotates the coordinate system of the population matrix X in such a way that the coordinates of individual data points in matrix Y are not correlated. The inverse transform can be done simply by inverting the eigenvectors matrix V, i.e. X = V−1 × Y = VT × Y. (3.10) Moreover, a probabilistic formulation of PCA offers an alternative way of estimating the principal axes using the maximum-likelihood framework in which we can use a computationally efficient, iterative expectation-maximization (EM) algorithm (Tipping & Bishop (1999)). This formulation is appealing also from other points of view—it allows us to compare the probabilistic PCA model with other probabilistic models, or to estimate the whole mixture of PCAs. Probabilistic formulation 38 3 3.2.2 E NVISIONING THE S TRUCTURE Toy Examples on PCA As already stated, PCA is a linear coordinate transform which decorrelates the object variables. PCA has no means for recombining the data; if we first decorrelate the data and then transform them back using the inverse transform we should get the same data points. That is why the PCA can be used only as a part of other variational technique (crossover operator or probabilistic model). In this section, the decorrelated data are modeled by a product of marginal equiheight histograms (see Sec. 3.1.2). We can visualize the outcome of PCA by means of contour plots of the linear components extracted from the data (see Fig. 3.7 for a 2D example). The contours are lines with the same value of the respective linear component, i.e. the principal axes are perpendicular to the contour lines. We can see that individual contours are parallel to each other and contours of the first principal component are perpendicular to the contour lines of the second principal component. PCA captures the high-level structure of the data set, but fails to describe the details. Principal Comp. 1 Principal Comp. 2 10 10 8 8 6 6 4 4 2 2 0 0 0 5 10 0 5 10 Figure 3.7: Two linear principal components of the 2D toy data set Nevertheless, PCA in conjunction with e.g. marginal histograms can be relatively successful in describing more details. Data sampled from such kind of model can be very similar to those in the training data set (see Fig. 3.8). Results of some experiments with the PCA preprocessing are postponed to Sec. 3.2.6. It can be seen there that for some functions the PCA is useful and for other functions not. These results and their discussion suggest that we need a facility which would recognize if the PCA is useful for the particular situation. The discussion of this problem is presented in the next section. 3.2.3 Decorrelate? Make independent! Independence definition Independent Components Analysis The PCA described in Sec. 3.2.1 can be also described in the following terms: it is a linear transform which minimizes the correlations3 (the ‘first-order’ dependencies) among variables. It would be very nice to have similar algorithm minimizing a compound criterion of dependency which would take into account also ‘higher-order’ interactions among variables. It is very hard to test for the independence of continuous variables. The notion of independence is usually defined by the PDFs. Let the p(X1 , X2 ) be the joint PDF of variables X1 and X2 . We say that X1 and X2 are independent if the 3 In fact, PCA not only minimizes the correlations, but avoids them completely. 3.2 UMDA with Linear Population Transforms 9 39 Training data points Points sampled from PCA−Hist 8 7 6 5 4 3 2 1 0 0 2 4 6 8 10 Figure 3.8: An example of using the PCA with univariate marginal equi-height histograms joint PDF can be factorized as follows: p(X1 , X2 ) = p1 (X1 )p2 (X2 ), (3.11) where p1 and p2 are the marginal PDFs. This definition can be extended to any number of variables. The PCA makes the variables uncorrelated, but some part of dependencies among variables is hidden also in higher-order moments, i.e. to make the variables uncorrelated is not sufficient. If the variables are independent, they are also uncorrelated, but the opposite is not true. The independent components analysis (ICA) (see e.g. Hyvärinen (1999)) is a rather recent data analysis technique. Its primary aim is to find such a linear transform which makes the transformed variables as independent of each other as possible. This goal makes the ICA a very appealing preprocessing technique for the use in EAs. ICA Basics We can define the ICA in several ways. If we select the mutual information (MI) as the measure of dependency, we can define the ICA as a process of finding such a linear transform which minimizes the MI. Unfortunately, direct minimization of MI over all possible linear transforms would require to estimate the density functions—very hard, very uncertain, and often very time consuming work. Hyvärinen & Oja (2000) show that: . . . ICA estimation by minimization of mutual information is equivalent to maximizing the sum of non-gaussianities of the estimates, when the estimates are constrained to be uncorrelated. . . . From the above citation, it is clear that due to the constraint of uncorrelated projections, the ICA need not estimate the joint probability density—the prob- Mutual information 40 Independece, nongaussianity, negentropy 3 E NVISIONING THE S TRUCTURE lem is simplified to a great extent and can be solved just by searching for 1dimensional subspaces with the greatest measures of non-gaussianities of the projections. The general measure of non-gaussianity is usually the negentropy. If H(X) is the differential entropy of a random variable X, then the negentropy of a random variable J(X) is defined as J(X) = H(XGauss ) − H(X), Negentropy approximations (3.12) where the XGauss is a random variable with normal distribution and the same variance as the variable X. It is well known fact, that a gaussian variable has the largest entropy among all random variables with equal variance, thus the negentropy is always nonnegative and zero for normal distribution, so that it can be used as a measure of non-gaussianity. Nevertheless, it remains only theoretical measure because one still has to estimate the probability distributions. In practice, we have to resort to some approximations of negentropy. Some of the approximations can be found in Hyvärinen & Oja (2000). They usually emphasize the differences from the normal distribution, i.e. they return greater values if the empirical distribution is spiky, multimodal, or heavy-tailed. Finding the most independent directions is judged only by the shape of the 1D projection distributions. During the EA run it can easily happen that the distribution in some direction has a more non-gaussian shape then the distribution in a direction which is really independent. This can mislead the EA that uses ICA as a preprocessing step. ICA Features The fact that ICA can be performed by searching for the most non-gaussian directions is appealing also from an empirical point of view. The most nongaussian projections (i.e. spiky, multimodal, clustered, etc.) are the most interesting ones. This principle was also used in a statistical method for visualization of the most interesting views of the data—in projection pursuit by Friedman (1987). In the ICA model, only one of the independent components can have normal distribution. A greater number of gaussian variables would make the ICA model unidentifiable because all rotations of D-dimensional gaussian random cloud of data points are in fact equivalent. 3.2.4 No difference between PCA and ICA preprocessing Toy Examples of ICA The first toy example emphasizes the fundamental difference between PCA and ICA. As can be seen in Fig. 3.9, the first principal component (PC) discovered by PCA (upper left picture) is the direction with the greatest variance. Only the second PC (upper right picture) describes the structure of the data set, i.e. the presence of two data clusters. In ICA, the situation is completely different. The first independent component (IC) describes the main information hidden in the data set—the cluster structure (lower left picture). Although it is interesting to see, which projections are of the main interest for PCA and ICA, when modeling this particular toy data set by a probabilistic model with marginal histograms, it does not matter if we preprocess the data set with PCA or ICA. We use the full set of the discovered components and their order does not matter. Both models should behave almost equally. However, the situation is different for the second toy data set. This data resemble the population that can be seen in certain phases of evolution when 3.2 UMDA with Linear Population Transforms 41 PC 2 PC 1 6 6 4 4 2 2 0 0 2 4 6 0 0 2 IC 1 6 4 4 2 2 0 2 6 IC 2 6 0 4 4 6 0 0 2 4 6 Figure 3.9: Comparison of the two principal components (PCA, first row) and the two independent components (ICA, second row) for a toy data set. optimizing the 2D Griewangk function (see Sec. A.4). The effects of using PCA and ICA can be seen in Fig. 3.10. In this case, the components are completely different and a bit unexpected. It is the PCA that actually discovers the most independent components, while the ICA (operating on the basis of maximizing the nongaussianity of the projections) almost does not rotate the data. This is an example of a data set when the ICA fails to find the independent components and PCA gives better result. This phenomenon can be also observed in the results of experiments in Sec. 3.2.6 and is described in more detail in the next section. Although it is not presented here, imagine the same data set without the cloud in the middle. Both PCA and ICA would return almost the same components as in the case when the middle cloud is present. In this case, however, the ICA would return really independent directions, while PCA would make the estimation of the probability density using marginal histograms very inaccurate. 3.2.5 Example Test of Independence Let us use a data pattern similar to that seen in Fig. 3.10, i.e. the population observed during the evolution of the 2D Griewangk function (the data points form five clusters in a pattern which evokes the number 5 on a dice). One way of assessing the independence of the individual coordinates is to discretize them, form a kind of frequency table and use the χ2 -test of independence. For the sake of simplicity, each of the two variables is divided to 3 equal intervals. The 100 data points are uniformly divided into the 5 clusters. The observed frequency table is then depicted in Fig. 3.11 (left). The frequency tables are matrices and their cells contain the number of data Huge difference between PCA and ICA preprocessing 42 3 E NVISIONING THE S TRUCTURE PC 1 PC 2 5 5 0 0 −5 −5 −10 0 10 −10 0 IC 1 IC 2 5 5 0 0 −5 −5 −10 10 0 10 −10 0 10 Figure 3.10: Principal and independent components for a data set similar to a population when evolving 2D Griewangk function. 20 20 20 20 40 20 20 40 40 20 40 100 16 8 16 40 8 4 8 20 16 8 16 40 40 20 40 100 Figure 3.11: Observed (left) and expected (right) frequency table points which belong to the respective D-dimensional interval (2-dimensional in this example). Let the observed frequency table be O with the entries Oi,j , P i, j = 1, 2, 3. Let us further describe the marginal sums of rows as Oi,: = j Oi,j (the last column of the tables in Fig. 3.11) and the marginal sums of columns as P O:,j = i Oi,j (the last row of the tables in Fig. 3.11). The lower right entry of the table is N —the overall number of data points, i.e. the sum of all table entries. The χ2 -test only compares the observed frequencies with the expected ones. In order to use this test as the test of independence we have to construct the frequency table expected when the assumption of independence holds. Let the expected frequency table be E. Its entries (see Fig. 3.11, right) can be easily computed as Oi,: O:,j . (3.13) Ei,j = N The test statistic Chi2 will then be a measure of how much the observed frequency table differs from the expected one and we can define it as follows: Chi2 = X (Oi,j − Ei,j )2 i,j Ei,j . (3.14) 3.2 UMDA with Linear Population Transforms 43 The Chi2 random variable has the χ2 distribution. If the number of rows is I and the number of columns is J, the number of degrees of freedom for the χ2 distribution is (I − 1)(J − 1), i.e. for this example it is equal to 4. The p-value of this test (computed as p = 1 − CDF χ2 (dof, Chi2 ), where CDF χ2 (dof, Chi2 ) is a value of the cumulative distribution function of the χ2 distribution with dof degrees of freedom computed at the point Chi2 ) is the probability of observing this or greater Chi2 if the assumption of independence holds. Thus, the p-value can be interpreted as a measure of independence of the two variables. If the p-value is close to zero we can be pretty sure that the variables are not independent. If the p-value is close to 1, the variables can be considered as independent (strictly speaking, we do not have enough evidence to prove their dependence). Coming back to our example, computing the test statistic and the p-value 2 2 2 results to Chi2 = 4 (20−16) + 4 (0−8) + (20−4) = 4 + 32 + 64 = 100 and p = 16 8 4 2 1 − CDF χ (4, 100) = 0, thus we can say that based on our finite sample from the distribution, there is almost no chance that the variables are independent. Now, we try to use the same test for the data rotated by the outcome of the PCA analysis. In that case the frequency tables are depicted in Fig. 3.12. 20 20 20 20 20 60 20 20 20 60 20 100 4 12 4 20 12 36 12 60 4 12 4 20 20 60 20 100 Figure 3.12: Observed (left) and expected (right) frequency table for data rotated by PCA. 2 2 2 + 4 (20−12) + (20−36) = In this case, the results of the test are Chi2 = 4 (0−4) 4 12 36 2 16 + 21.33 + 7.11 = 44.44 and p = 1 − CDF χ (4, 44.44) = 5.2 × 10−9 . Even in this case we can hardly describe the variables as independent in an absolute sense, but we can state that they are ‘more independent’ than in the previous case (unrotated, or rotated by ICA). 3.2.6 Empirical Comparison The settings of the whole full-factorial experiment are surveyed in Table 3.4. Table 3.4: Settings of individual factors of the experiment comparing PCA and ICA population preprocessing Factor Levels Test Function Probability Model Transform Number of Components Population Size 2D and 20D Two Peaks, 2D and 10D Griewangk HEH None, PCA, ICA 120 for Two Peaks, 100 for Griewangk 20, 50, 100, 200, 400, 600, 800 This experiment demonstrates the influence of population preprocessing using PCA and ICA. The 2- and 10-dimensional Griewangk function with the 2and 20-dimensional Two Peaks function were selected for this demonstration. For each factorial setting, 20 runs were carried out, each was allowed to run for 50,000 evaluations. 44 3 E NVISIONING THE S TRUCTURE Monitored Statistics During experiment, several measures of the algorithm efficiency were tracked: • BSF (Best-so-far fitness). Average fitness of the best individual after 50,000 evaluations in all 20 runs. • StdevBSF. The standard deviation of the best fitness after 50,000 evaluations in all 20 runs. • Found0.1 (Found0.01, Found0.001). As the evolution progresses, it is checked how many times (out of the 20 runs) the best solution is in the ‘0.1 neighborhood’ (0.01, 0.001 respectively) of the global optimum. E.g. for 0.1 T | < 0.1 for all d must hold. neighborhood, the condition |xBSF − xOP d d • WhenFound0.1 (WhenFound0.01, WhenFound0.001). The average number of evaluations needed to get to the 0.1 neighborhood (0.01, 0.001 respectively) is computed only from the runs in which the algorithm succeeded to get that close to the global optimum. In the results tables, the statistics are organized in cells and their layout is following: Found0.1 WhenFound0.1 BSF (StdevBSF) Found0.01 Found0.001 WhenFound0.01 WhenFound0.001 Figure 3.13: Layout of monitored statistics in the results tables. Evolutionary Model For the comparison, the UMDA with marginal histogram models was used. As stated in Sec. 3.1, this kind of algorithm assumes the individual variables to be statistically independent of each other. The first set of experiments was carried out using the evolutionary model described in Sec. 3.1.3. The second and third set of experiments uses the PCA and ICA preprocessing, respectively (the evolutionary model is shown as Alg. 3.2). The only difference is in the steps 4 and 7 which perform the forward and inverse transforms; otherwise both the evolutionary models are identical. Results and Discussion Table 3.5 shows the effect of the PCA preprocessing when optimizing the Griewangk function. We can see that for 2D version of the function, the algorithm with PCA is better in the quality of the final solution, in the speed of finding the solution, and even in the reliability of finding the global solution. The reason for this superior behavior is that the PCA makes the individual features of chromosomes uncorrelated (and the most independent, in this case) so that the marginal models make smaller errors when estimating the joint probability density. The ICA, on the other hand, is many times misled by the population (see Fig. 3.10) and gives only slightly better solutions (but more often) than the UMDA without any preprocessing. In case of the 10D Griewangk function, the population has a very different shape and the advantages of ICA come out. UMDA with ICA preprocessing is more precise and more reliable than UMDA with PCA. 3.2 UMDA with Linear Population Transforms 45 Algorithm 3.2: EDA with PCA or ICA preprocessing 1 2 3 4 5 6 7 8 9 10 11 begin Initialize and evaluate the population while not TerminationCondition do Based on the current population, perform the PCA or ICA of the parents Model the transformed parents by marginal histograms Sample N new offspring from the model Transform the new offspring to the original space using inverse PCA or ICA transform Evaluate them Join the old and the new population to get a data pool of size 2N Use the truncation selection to select the better half of the data points (returning the population size back to N ) end Alg Hist Func Table 3.5: Influence of PCA and ICA preprocessing when optimizing the Griewangk function. The shaded cells are the best performers. 50 100 0,0052 (0,0035) 0,0058 (0,0029) 0,0061 (0,0024) 6 5 4 6 6 8601 19301 331 2 2 0 0,0054 (0,0025) 6 0 0 468 400 600 800 0,0035 (0,0025) 0,0035 (0,0028) 0,0024 (0,0026) 14 0 0 801 12 1 1301 601 0 3,83e-4 (0,0017) 7,77e-4 (0,0023) 9,31e-6 (4,082e-5) 19 17 17 19 18 17 19 17 17 18 17 17 20 252 412 237 544 893 392 1186 1805 783 2805 3964 1394 3761 6429 PCA-Hist 4,95e-4 (0,0015) 6 156 ICA-Hist 5,3e-4 (0,0017) 8 10 125 0,0045 (0,0037) 0,0012 (0,0027) 7,4436e-4 (0,0023) 3,7231e-4 (0,0017) 8 17 18 19 0,0037 (0,0038) 10 297 5 9 480 0 2253 14613 0,0044 (0,0037) 8 479 7 0 3445 11180 0,0039 (0,0062) 14 PCA-Hist 0,1223 (0,0892) 0,0519 (0,0807) 3,706e-4 (0,0017) 4 19 0 1 0 1296 1775 593 0 0 19 17 7 2984 4 0,0046 (0,0034) 7 887 4 5301 18 4 6928 12 0,0036 (0,0036) 0 0 18 1,122e-5 (5,018e-5) 20 19 19 3198 14537 17280 1547 11033 16226 0,001 (0,0023) 0,0015 (0,0027) 11 7 5 18 12 10 17 10 5 1062 23729 21718 3255 28904 33518 6498 32239 37288 0,0011 (0,0027) 17 17 16 3,698e-4 (0,0017) 3,702e-4 (0,0017) 19 19 19 19 19 9 3523 20583 23676 6338 12490 15134 12189 19036 26001 17906 28012 37612 23622 37180 49157 0 0 0,0034 (0,0037) 8 8 11 7 1100 1409 1207 2488 19 14 3887 19 Hist 6,4e-4 (0,0017) 2 3301 3551 301 11401 ICA-Hist 2D Griewangk 158 10D Griewangk Population Size 200 20 2,2836e-9 (8,73e-9) 20 20 2337 3842 19 20 5292 4793 20 20 20 1463 2260 3050 2913 4528 6353 5698 20 20 20 20 2,9802e-11 (6,16e-12) 20 20 20 20 1,5385e-8 (5,4e-9) 20 20 20 7613 10413 9612 15652 21232 13924 23014 32254 18646 31246 43006 0 (0) 0 (0) 20 0 (0) 0 (0) 20 20 8,32e-15 (2,17e-15) 20 20 20 20 3,8e-10 (9,9e-11) 20 20 20 1,02e-7 (4,5e-8) 20 20 20 9168 12588 10833 17973 24513 16083 25503 35673 20705 34265 47105 For other functions, the preprocessing can have different influences. As an example, the results of the same algorithms optimizing the 2D and 20D Two Peaks function are presented in Tab. 3.6. In this case, UMDA without any preprocessing is the best solver because it assumes the independence which is fulfilled in this case. PCA and ICA rotate the coordinate system, and in this case, they can make the problem only harder to solve when using UMDA. UMDA using the PCA is thus much worse than UMDA without it, overall. The PCA preprocessing has no chance to be successful in this case—in fact, the PCA actually prevents the algorithm from finding good enough solutions. The 20D Two Peaks function is very hard to optimize if we do not assume the independence of individual variables. Even for ICA, it is very hard to estimate the right structure of the problem from the finite sample (population). The UMDA with ICA preprocessing does not get so good results as UMDA without it, but it outperforms the UMDA with the PCA. This suggests that the rotational structure discovered by ICA is better for use with UMDA. 46 3 E NVISIONING THE S TRUCTURE Alg 100 0,5791 (0,5784) 0,0504 (0,2235) 7,1e-16 (2,8e-15) 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 1469 396 936 1401 541 1621 2621 1161 2941 4621 1531 3871 6601 1281 4761 8801 11 5 5 19 19 265 1585 6461 243 867 19 0 (0) 0,9236 (0,3701) 0,6536 (0,4641) 5 2 3 7 564 396 951 PCA-Hist 0,8583 (0,4806) 2 700 ICA-Hist 1,2092 (0,6393) 4 3 174 13,455 (2,3201) 3,6682 (1,2294) 0 0 0 1,0836 (0,644) 2 422 0 2 522 0 2 1201 1033 3319 0,4561 (0,4761) 12 459 9 975 0 2 5669 0,2872 (0,4372) 9 15 1281 857 0 2 13 2356 13 3005 0,0816 (0,0555) 20 14 1 6 3 400 600 800 0 (0) 4,4e-17 (2e-16) 7,3e-13 (4,7e-13) 0,4816 (0,4616) 11 8 7 0,4667 (0,4377) 0,4398 (0,41) 12 6 4 13 6 4 8544 18565 21588 8543 24146 24038 4205 28960 21606 6839 25737 30207 0,1565 (0,364) 17 2161 16 5096 14 6843 0,0735 (0,0701) 20 13 2 0,0201 (0,0523) 0,0167 (0,0515) 0,0361 (0,0896) 20 17 17 20 18 18 20 16 14 3676 10430 14647 2294 13319 17497 6499 20174 22639 0,0038 (7e-4) 0,0027 (0,0059) 20 20 18 20 20 0,0368 (0,0070) 20 20 20 7 9726 25730 18601 6161 15032 23501 9941 18421 26890 15181 26371 39121 19161 35041 48687 14,93 (2,008) PCA-Hist 2D Two Peaks 50 0 ICA-Hist 20D Two Peaks Population Size 200 20 Hist Hist Func Table 3.6: Influence of PCA and ICA preprocessing when optimizing the Two Peaks function. 0 3.2.7 0 0 14,87 (1,907) 0 0 25,5829 (1,6473) 0 0 0 21,398 (0,554) 0 0 0 20,0724 (0,0886) 0 0 0 20,0001 (8,554e-4) 0 0 0 20,0053 (5,8681e-4) 0 0 0 20,046 (0,0058) 0 0 0 27,4542 (1,1316) 18,1138 (5,8225) 4,1084 (4,3175) 12,4599 (4,521) 15,2307 (1,5758) 15,2542 (1,6338) 0 1 6 1 0 0 0 0 4778 0 0 2 2 0 0 0 0 0 0 18436 18796 22596 35295 Comparison with Other Algorithms We can expect that the efficiency of the local optimizers will not be affected by any rotational transform. In fact, many local optimizers for real domain incorporate their own on-line version of PCA. The Two Peaks function remains hard for them, but they should be faster when solving the Griewangk function (compared to UMDAs). The line search heuristic should be influenced by population preprocessing in a similar, or even more severe way than UMDAs. It would be only seldom able to successfully solve any of the two problems when used with PCA or ICA. The use of rotational transforms on the phenotype level would make GAs rotationally invariant (similarly to any other EA). Their performance would probably be worse than in the unrotated case—inaccurately estimated rotation would break even the separability of groups of bits. 3.2.8 Summary The second part of this chapter suggests a solution to the second aim of this thesis—it provides a way of adapting the search algorithms to situations when there are linear dependencies among the object variables. Generally speaking, it is the independence of individual variables which plays the major role in the EA efficiency if we use the UMDA. From the experimental results, it is obvious we need a tool which evaluates the quality of transforms suggested by the PCA and ICA and enables us to choose the better one (or to choose none of them at all). This role might be played by the tests of independence. If we wanted to test the independence based on current data set directly using the definition 3.11, it would require to build some approximations of the PDFs. There would be many problems related to this approach, e.g. choosing the right form of a PDF model is an intricate issue (selecting a wrong model can make the results unusable). The mutual information can be also used as a measure of independence. The less the mutual information, the less the dependence among variables. Similar features characterize also the Kullback-Leibler (KL) divergence which is very of- 3.2 UMDA with Linear Population Transforms ten used as a measure of difference between two distributions. It can be used to measure the dependency among variables if we measure the KL divergence of the joint density and the product of marginal densities. However, the main problem still remains—we need to estimate the densities. Some authors have used the polynomial density expansions (Comon (1994), Amari et al. (1996)), but the estimates are not very precise if the distributions differ greatly from the normal one. It is also possible to use a method similar to that presented in Sec. 3.2.5: discretize the domain of each variable, treat the data points as measured on categorical scale, and then use the χ2 -test of independence for frequency tables. Of course, this test is only an approximation but its value lies mainly in the fact that it is relatively fast, not complicated, and can be easily generalized to more dimensions. However, this approach has several practical implications. First, one of the assumptions of the χ2 -test is that all the entries of the expected frequency table should be at least 5. Increasing number of dimensions (possibly with the number of bins in each dimension) leads very quickly to a situation in which we simply do not have enough data to perform this test. Second, the multivariate test of independence answers the question ‘Are all the variables independent of each other?’, i.e. it gives no indication of situations when we have two or more groups of variables in which the variables are dependent, but among which are not. This information has a great value for us because it allows for the factorization of the joint PDF, i.e. it allows us to efficiently fight the curse of dimensionality. For the time being, I do not offer any precise and efficient method which would choose the best way of population preprocessing—it remains an open issue for further research. An empirical rule of thumb: if time demands are not an issue, try all variants—without preprocessing, with PCA, or with ICA. If a scalable optimizer is needed, do not preprocess the population. If the optimization algorithm has to be rotationally invariant, use ICA preprocessing—in general case, it has better chances to be successful than PCA (however, it is also more time demanding). Advantages of UMDA with Linear Population Preprocessing Simplicity Still relatively simple approach. Higher flexibility Model can encompass larger class of probability distributions, reduces linear dependencies. Rotational invariance All the rotated instances of the same problem should be solved, on average, in similar time. Limitations of UMDA with Linear Population Preprocessing Higher time and space demands The transforms induce additional computational burden. Worse scalability The demands of the transforms increase more rapidly with dimensionality of the problem when compared with the pure UMDA. The practical applicability of UMDA with population preprocessing depends on the complexity of fitness evaluation; if the individuals are evaluated quickly, we do not want to wait on the preprocessing. If the evaluation of 47 48 3 E NVISIONING THE S TRUCTURE one individual takes several minutes, we can afford spending some time on preprocessing. Chapter 4 Focusing the Search The previous chapter, Envisioning the Structure, described methods with strong structural assumptions—they assumed that individual variables are independent, or, that they can be made independent by removing the dependencies using linear coordinate transforms. If this assumption was satisfied, the majority of probability models were also able to focus the search to the area where promising solutions lied (with the only exception being the model based on the equi-width histograms). This chapter presents two models with weaker structural assumptions, i.e. two EAs that are able to find and use some (even non-linear) dependencies among variables. This also implies that they generally need larger populations to estimate the dependencies correctly. 4.1 EDA with Distribution Tree Model The distribution tree (DiT1 ) is an attempt to construct a model which would be • flexible enough to cover some non-linear interactions between variables, • generative so that new population members can be created in accordance with the distribution encoded by the model, and • still simple and fast to build. It is based mainly on the Classification and Regression Trees (CART) framework introduced by Breiman et al. (1984), but it differs from CART in several important aspects. The primary objective of DiT is not to present a model to classify or predict new, previously unseen data points, but rather to generate new individuals so that their distribution in the search space is very similar to that of the original data points. When presented with the training data set (the population members selected for mating), the distribution tree is built by recursively searching for the ‘best’ axis-parallel split. The leaf nodes of DiT present a partitioning of the search space. Each partition is represented by a multidimensional hyper-rectangle whose edges are aligned with the coordinate axes. Each leaf node includes certain number of individuals; based on it, the number of offspring for the respective node is calculated. It can be considered a kind of multidimensional histogram. Two examples of the DiT partitioning of the search space can be seen in Fig. 4.1. 1 In this article for the distribution tree model the abbreviation ‘DiT’ is used in order to prevent messing it with the ‘DT’ abbrev. commonly used for the decision trees. 49 DiT roots DiT at a glance 50 4 Griewangk function F OCUSING THE S EARCH Rosenbrock function 5 2 4 1.5 3 1 2 1 0.5 0 0 −1 −0.5 −2 −1 −3 −1.5 −4 −5 −5 0 5 −2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 Figure 4.1: DiT partitioning of the search space for the Griewangk function (left) and for the Rosenbrock function (right) When compared to other structure-building algorithms, e.g. the MBOA of Očenášek & Schwarz (2002), the DiT is much simpler model. The tree here is a kind of temporary structure; after the model is created we can forget the tree. Only the leaf nodes are important because they form a mixture of uniform distributions. The structure in MBOA plays much more important role as it captures the dependency structure of individual variables. The DiT algorithm is able to cover some kind of interactions but the information is hidden only in the leaves. 4.1.1 Find best split and recurse Growing the Distribution Tree This subsection gives more detailed description of the algorithm which is used to grow the DiT. As already stated, it is basically the CART algorithm using different split criteria. Let us denote the population as X = {x1 , x2 , . . . , xN }, N is the population size, xn = (xn1 , xn2 , . . . , xnD ) is the n-th individual, n = 1, . . . , N , D is the dimensionality of the search space. Further, there are three parameters of the algorithm: minT oSplit, describing the minimal node size2 allowed to be split, minInN ode, the minimal number of data points that must remain in the left and right part of the node after splitting, and maxP V alue, the maximal p-value for which we consider the split statistically significant. Furthermore, we have a matrix consisting of minimal and maximal values for each dimension (the box constraints of the search space). The tree building algorithm is very simple (see Alg. 4.1). First, the set of individuals belonging to the node being processed is determined (in the beginning, all individuals belong to the root node). Then, it is checked if the node has enough data points to be split (compare the node size with the minT oSplit parameter). If there are not enough individuals, the node processing is stopped. Otherwise, the best split point is found. If the split fulfills all the conditions (if it is statistically significant enough, and if the size of the potential left and right nodes is going to be at least minInN ode), the node data points are divided into two sets which should belong to the left and the right node, respectively. Finally, based on these two subsets of individuals, the procedure is recursively applied to create the left and right subtrees. 2 To prevent misunderstanding, let me point out that ‘size of node’ or ‘node size’ refers to the number of data points which belong to the respective node. When I describe the ‘physical size’ of node in DiT I use the term ‘node volume’. 4.1 EDA with Distribution Tree Model 51 Algorithm 4.1: Function SplitNode 1 2 3 4 5 6 7 8 9 begin if NumberOfPointsInCurrentNode() < minT oSplit then exit split ← FindBestSplit(minInN ode) if GoodEnoughSplit(split.Chi2 , maxP V alue) then node.leftSubtree ← SplitNode(split.left) node.rightSubtree ← SplitNode(split.right) return node end Searching for the Best Split The heart of this procedure is constituted by the way the search for the best split is performed. The splits can be placed between each pair of the successive data points in each dimension. If we have N D-dimensional data points in the node, we have D × (N − 1) possible splits. Each of these candidate splits is evaluated via hypothesis testing. The null hypothesis is the assumption that in each leaf node the data points have approximately uniform distribution. This assumption is tested against the actual observed situation via the χ2 -test. Eventually, the best split is selected provided that for this split the test says ‘there is only a negligibly small probability of no difference between the uniform and the observed distribution’. Otherwise we have no reason to split the node—even if its size is large. Algorithm 4.2: Function FindBestSplit 1 2 3 4 5 6 7 8 9 10 11 12 begin for all dimensions do Build a list of candidate split points for all candidate split points do Determine expected frequencies in the left and right subnode Determine observed frequencies in the left and right subnode Compute the test statistic Chi2 Perform the χ2 -test, determine the p-value if the current split is better than the best split found so far then Remember it return the best split found end Performing the χ2 -test For each particular candidate split, the observed sizes of the left and the right subnode are simply the number of points which belong to them, let us denote them as NLobs and NRobs . For both subnodes we can also directly compute the expected sizes, NLexp and NRexp , knowing the size of node as a whole and the relative volumes of the left and right subnodes. The χ2 -test allows us to compare The best split 52 4 F OCUSING THE S EARCH the observed sizes with the expected ones. First, we compute the test statistic Chi2 = The p-value as a measure of the split quality (NLexp − NLobs )2 (NRexp − NRobs )2 + . NLexp NRexp (4.1) The random variable Chi2 has χ2 distribution with 1 degree of freedom. It describes to what extent the observed and expected frequencies differ from each other. Thus, we can compute the p-value of the test (the probability of observing this or greater Chi2 assuming the uniform distribution in the node) as 1 minus the value of the cumulative distribution function of χ2 distribution with 1 degree of freedom at point Chi2 . p = 1 − CDF χ2 (1, Chi2 ) (4.2) This way we can select the split that gives us the highest discrimination from the uniform distribution with a sufficiently high probability, i.e. among all candidate splits, the one with the lowest p-value is selected, and is realized if its p-value is lower than maxP V alue. 4.1.2 Sampling from the Distribution Tree The process of creating new individuals, i.e. sampling from the distribution tree, is straightforward. The leaf nodes of the tree fully cover the whole search space. For each node, its limits are known as well as the number of offspring. We can thus apply the SUS described in Sec. 3.1.2. Simply stated, in each node the number of offspring generated using D-dimensional uniform number generator is the same as the number of parents belonging to the node. Since an empty leaf node is not possible, this ensures that the search will not stop in any area of the search space. 4.1.3 Empirical Comparison The carried out experiments compare the DiT evolutionary algorithm to several other evolutionary techniques. The same evolutionary model was used for all experiments and is described in Sec. 3.1.3. For the performance evaluation of the EA using the DiT several reference optimization problems were used: 2- and 20-dimensional Two Peaks function (Sec. A.3), 2- and 10-dimensional Griewangk function (Sec. A.4), and finally 2and 10-dimensional Rosenbrock function (Sec. A.5). Involved Evolutionary Techniques The distribution tree evolutionary algorithm (DiT-EA) is compared to the genetic algorithm (with low and high resolution), to histogram UMDA, to random search and to the line search heuristic. The settings of all algorithms are summarized in Tab. 4.1.3. A short decription of each of the compared algorithms follows: Random Search Sampling individuals one by one with uniform distribution on the whole search space—that is the simple random search. It is the dummiest way to solve a given task; in other words it is the ‘trial and error’ approach without any learning. Although it is the dummiest way, it is not always the worst way as the experiments showed. 4.1 EDA with Distribution Tree Model 53 Table 4.1: Summary of settings of individual algorithms involved in the study Algorithm Setting All EA Model Selection No. of Evaluations LS Step GA-low Crossover Crossover Rate Resolution GA-high Hist DiT Two Peaks Function Griewangk Rosenbrock Steady-state Truncation 50,000 0.01 14 bits Uniform 100% 14 bits 12 bits Crossover Crossover Rate Resolution 56 bits Uniform 100% 56 bits 55 bits Histogram type No. of Bins 120 Equi-height 100 41 Min to Split Min in Node Max p-value 5 3 0.001 Line Search In order to give some indication of how complex the test problems are, they were also optimized with the so called line search (LS) method (see Whitley et al. (1996)). This method is not intended to be a universal optimization algorithm, rather it is a heuristic which is very effective for separable problems and thus it can give us some insight in the complexity of the evaluation function. Its principle is very easy: start from randomly generated data point, randomly select a dimension, discretize and enumerate it, move to the best data point in the unidimensional enumerated space, and repeat for another, yet unused dimension. The discretization length was set to 0.01. This means that for Two Peaks function the LS algorithm evaluates 1201 data points for each dimension. If the LS gets stuck (no further improvement possible), it is restarted from another randomly generated point. Genetic Algorithms It is very hard to compare the GA to EDA in continuous domain. While EDA searches the space of real numbers (or rational numbers when evolving in silico), the GA evolves binary strings and searches a discretized space. The discretization, i.e. the resolution, must be specified by the user in advance. This way the user can deteriorate the GA performance with insufficient precision setting. In the experiments, two GAs were used—one with low resolution (allowing the GA to find such a solution the distance of which from the global optimum is not larger than 0.001), and second with high resolution (allowing the GA to find such a solution the distance of which from global optimum is not larger than 2.22−16 which is the distance from 1.0 to the next largest floating point number in MATLAB). 54 4 F OCUSING THE S EARCH Similarly to the other algorithms, the truncation selection was used. It is not the best setting for GA because this selection type has relatively high selection pressure and can result in premature convergence. Uniform crossover was used for GA in all experiments. In EDAs, the model building and sampling phase of evolution can be described as a generalized crossover operation. Since the mutation was not used in other types of algorithms, I did not use it for the GAs either. UMDA with Equi-height Histogram The univariate marginal distribution algorithm (UMDA) with equi-height histogram model was described in Sec. 3.1. Similarly to the line search heuristic, the histogram UMDA should be able to solve separable problems very efficiently, but it should run in trouble when solving problems with higher degree of dependency between variables. The number of bins for each histogram is set in such a way that if all the bins had equal width, none of them would be larger than 0.1. Distribution Tree Evolutionary Algorithm The distribution tree, its construction and sampling is described in section 4.1. The only parameters to be set are the minimal number of individuals that must remain in each node (minInN ode set to 3), the minimal node size allowing the split in the node (minT oSplit set to 5), and the maximal p-value for which we consider a split statistically significant (maxP V alue set to 0.001). Results and Discussion 2D Two Peaks 2D Griewangk 2D Rosenbrock The statistics which were tracked during evolution are described in Sec. 3.2.6. The overall results for all problems are shown in Table 4.2. For population based algorithms, experiments with population sizes of 20, 50, 100, 200, 400, 600, and 800 were carried out. Each experiment was repeated 20 times, and the average best fitness and its standard deviation based on all 20 runs was measured. The population size which resulted in the best average score after 50,000 evaluations was selected to be included in the results table. The results for the 2D Two Peaks function are presented in Fig. 4.2. We can see that the GA with a low resolution is not able to overcome the limitation of the insufficient bit string length. It quickly finds the optimum, but only in its own small search space. The GA with high resolution solves the problem much better. The histogram UMDA is very efficient solving this task (already for small populations)—the Two Peaks function is separable. The behavior of the DiT-EA is comparable to the GA with high resolution but it can reach more precise results. The results for the 2D Griewangk function are shown in Fig. 4.3. We can clearly see the difference between the Two Peaks and the Griewangk function. While the histogram UMDA was the best ‘solver’ in case of separable Two Peaks function, it completely fails to solve the Griewangk function which is nonseparable. Neither of the GAs solves this function extremely well. The DiT-EA is clearly the winner for this function. In Fig. 4.4, the behavior of tested algorithms on the Rosenbrock function is depicted. Histogram UMDA failed to solve this task because it is non-separable. The GA with high resolution is the best algorithm for this function. The DiT-EA and the GA with low resolution gained similar results. For this function the DiT-EA suffers from the fact that the valley of the Rosenbrock function is not aligned with the coordinate axes. Axis parallel splits made by the DiT cannot cover this kind of interaction sufficiently well. However, let me point out that 4.1 EDA with Distribution Tree Model 55 Table 4.2: Results of experiments on all test functions. The layout of individual statisics in each table cell is described in Fig. 3.13, the number presented in the last line of each table cell is the population size which produced the best average results. 10D Rosenbrock 10D Griewangk 20D Two Peaks 2D Rosenbrock 2D Griewangk 2D Two Peaks Func Algorithm Random Line Search GA low res. GA high res. Hist UMDA DiT EA 0,1788 (0,0752) 0 (0) 0,0018 (0) 2,5e-11 (3,9e-11) 0 (0) 4,8e-13 (2,3e-13) 20 2 0 2998 26021 20 20 20 20 20 20 20 20 20 20 20 20 20 133 133 133 1341 4261 6621 1801 5371 9601 541 1621 2621 1771 2,1e-4 (2,5e-4) 20 6 0 3365 20138 0,0012 (0,0021) 20 5 0 3134 3017 0 (0) 20 20 20 912 912 912 0,0051 (0,0053) 20 35,3914 (1,8198) 0 0 0 0 2533 0 20 400 600 200 600 2,3e-5 (8,7e-5) 1,3e-7 (5,1e-7) 0,0024 (0,0026) 0 (0) 20 18 17 0 0 0 20 18 14 20 20 0 0 20 20 20 20 961 3421 6441 600 600 800 400 5,3e-5 (1,4e-4) 9e-7 (3,1e-6) 0,0044 (0,0042) 5e-5 (9,6e-5) 20 18 14 20 20 19 16 3 0 2134 20 16 5 2131 16014 29041 800 800 800 600 0,8168 (0,7827) 0,9627 (0,6612) 0,0027 (0,0059) 27,1593 (2,7677) 10 9 0 22921 22921 22921 25721 41379 0,4874 (0,0874) 20 1981 26868 33848 1921 28261 35268 3887 2721 16623 27258 2881 15441 24843 2351 0 (0) 20 6241 10111 7 3 0 25430 38534 20 20 18 0 0 0 9941 18421 26890 400 400 400 100 0,2975 (0,3412) 0,0063 (0,0031) 0,0083 (0,0057) 3,698e-4 (0,0017) 2,5e-4 (4,9e-4) 7 3 2 19 7 7 3 2 2 1 19 19 18 3 0 21523 21523 21523 30534 36401 45201 35801 43601 47201 17906 28012 37612 32401 49401 27,0552 (10,418) 0 0 0 1,168 (0,4596) 0 0 0 400 400 600 600 1,9947 (1,3471) 2,2752 (0,9745) 0,6373 (0,1714) 1,8554 (0,6581) 0 0 0 0 0 400 0 0 0 400 0 0 400 0 600 2D Two Peaks, all algorithms 5 10 Average Best−So−Far Fitness 0 DiT−EA, pop. size: 600 Hist−UMDA, pop.size: 200 GA−low, pop. size: 400 GA−high, pop. size: 600 0 10 −5 10 −10 10 −15 10 −20 10 0 1 2 3 4 No. of Evaluations 5 6 4 x 10 Figure 4.2: 2D Two Peaks function—Comparison of DiT-EA against other algorithms 56 4 F OCUSING THE S EARCH 2D Griewangk, all algorithms 0 Average Best−So−Far Fitness 10 −5 10 −10 10 −15 10 DiT−EA, pop. size: 400 Hist−UMDA, pop. size: 800 GA−low, pop. size: 600 GA−high, pop. size: 600 −20 10 0 1 2 3 4 No. of Evaluations 5 6 4 x 10 Figure 4.3: 2D Griewangk function—Comparison of DiT-EA against other algorithms even the GA with high resolution did not find a solution of such a quality when used with lower population sizes. In the case of population size lower than 800 it found a solution comparable to that of DiT-EA and GA with lower resolution. 20D Two Peaks 10D Griewangk 10D Rosenbrock In Fig. 4.5 we can see the evolution on the 20-dimensional Two Peaks function. Both GAs have almost the same behavior independently of the resolution. Again, since this is a separable function the histogram UMDA is able to solve it very quickly and precisely. The DiT-EA failed to solve this problem and its performance is only slightly better than that of random search. Figure 4.6 shows the efficiency of all algorithms when solving the 10-dimensional Griewangk function. Both GAs have similar behavior (as in the case of 20D Two Peaks function) and did not discover a good solution. Overcoming the curse of dimensionality with its ‘univariate look on the world’, the histogram UMDA was able to find similar solutions as the DiT-EA; they are the winners for this case. In Fig. 4.7 we can see that the way in which the 10-dimensional Rosenbrock function is constructed still ensures a great amount of separability—it is separable by pairs. The histogram UMDA can take advantage of this fact and it expresses the best behavior among all algorithms. The DiT-EA is only slightly better than both GAs which (again) express almost the same behavior independently on resolution. None of the presented algorithms, however, was able to adapt the distribution to the curved valley of the Rosenbrock function. As expected, the Hist-UMDA was very efficient when solving separable problems, i.e. the 2D and 20D Two Peaks function, and the 10D Rosenbrock function as well. Although the variables in the basic 2D Rosenbrock function are not independent, due to the way the 10D version was constructed, each pair of variables in the 10D Rosenbrock function is independent of the rest. 4.1 EDA with Distribution Tree Model 2D Rosenbrock, all algorithms −1 10 DiT−EA, pop. size: 600 Hist−UMDA, pop. size: 800 GA−low, pop. size: 800 GA−high, pop. size: 800 −2 Average Best−So−Far Fitness 57 10 −3 10 −4 10 −5 10 −6 10 −7 10 0 1 2 3 4 No. of Evaluations 5 6 4 x 10 Figure 4.4: 2D Rosenbrock function—Comparison of DiT-EA against other algorithms 20D Two Peaks, all algorithms 2 Average Best−So−Far Fitness 10 DiT−EA, pop. size: 100 Hist−UMDA, pop.size: 400 GA−low, pop. size: 400 GA−high, pop. size: 400 1 10 0 10 −1 10 −2 10 −3 10 0 1 2 3 No. of Evaluations 4 5 4 x 10 Figure 4.5: 20D Two Peaks function—Comparison of DiT-EA against other algorithms 58 4 THE S EARCH 10D Griewangk, all algorithms 0 10 Average Best−So−Far Fitness F OCUSING DiT−EA, pop. size: 600 Hist−UMDA, pop. size: 600 GA−low, pop. size: 400 GA−high, pop. size: 400 −1 10 −2 10 −3 10 −4 10 0 1 2 3 4 No. of Evaluations 5 6 4 x 10 Figure 4.6: 10D Griewangk function—Comparison of DiT-EA against other algorithms 10D Rosenbrock, all algorithms 3 Average Best−So−Far Fitness 10 DiT−EA, pop. size: 600 Hist−UMDA, pop. size: 400 GA−low, pop. size: 400 GA−high, pop. size: 400 2 10 1 10 0 10 −1 10 0 1 2 3 4 No. of Evaluations 5 6 4 x 10 Figure 4.7: 10D Rosenbrock function—Comparison of DiT-EA against other algorithms 4.1 EDA with Distribution Tree Model The DiT-EA was the best competitor when solving both variants of the Griewangk function (see Fig. 4.3 for an illustration), although the amount of dependencies among variables is greatly reduced for the 10D version when compared to the 2D version. In both cases it was able to find a solution of significantly better quality. In the majority of the rest of the involved test problems, its efficiency was better or comparable to that of GAs. The 20D Two Peaks function is the only exception (see Fig. 4.5). The DiT-EA was able to find solutions only slightly better than those obtained by the random search. I hypothesize that this poor behavior is caused by the great number of local optima of this function (to be precise, the 20D Two Peaks function has 220 local optima). To capture the problem structure, the DiT-EA would need a huge population. Hundreds of individuals allow the algorithm to make only limited number of leaf nodes with ‘almost the same’ density. For the 2D functions, the GA-high found better solutions than the GA-low. GA with low resolution suffered from the insufficient bit string length. For the multidimensional functions, however, the difference between the two GAs vanished and they performed almost identically along the whole evolution when compared by a human eye. 4.1.4 Summary This section of the thesis described and evaluated a new and original3 model of probability distribution—the distribution tree. The model is an attempt to solve the third aim of the thesis—to create a probabilistic model able to focus the search to promising areas of the search space in situations when the problem has higher-order interactions among variables. The results of EA empowered with the DiT model are of mixed quality, yet they are very promising for certain kind of optimization problems. The way of DiT construction suggests that the DiT-EA should be very efficient when solving problems with several well-distinguished local optima with a ‘symmetric’ shape (rectangular, spherical, . . . ), given the population size is sufficient to reveal them. The population members after selection then form clusters in the search space and the DiT-EA is able to identify them and to use this information during evolution (the DiT model tries to identify rectangular areas of the search space with significantly different densities). This hypothesis is supported by the results observed for the Griewangk function. Furthermore, the DiT presents a model built on the basis of rigid statistical testing which is not rather common in the field of EAs. The experiments also showed that the members of EDA class of algorithms (histogram UMDA and DiT-EA) in almost all cases outperformed the GAs— they were able to find more precise solutions with less time needed. An open area for future work is the fact that the model building algorithm allows very simple generalization for discrete variables. Only the procedure for creating the set of candidate splits would have to be changed. Advantages of DiT-EA Simplicity The building procedure is based on the well-known CART algorithm. 3 To the best of the author’s knowledge 59 60 4 F OCUSING THE S EARCH Non-parametric model The DiT is non-parametric in the sense that it does not estimate parameters of any theoretical distribution—it is driven mainly by the data. Clustering capable The DiT model can identify and use certain kind of clusters in the population—clusters which are recognizable by univariate looks on the population. Thus, the interactions represented by such clusters can be covered by this model. Limitations of DiT-EA Axis-parallel splits only Creating only axis-parallel splits results in a limited ability to search the promising areas of the space if they look e.g. like valleys which are not parallel to the coordinate axes. This drawback can be reduced by employing some coordinate transform in each node. The PCA or ICA presented in Sec. 3.2 seem to be natural candidates for making the DiT model rotationally invariant and capable of oblique splits. This possibility, however, remains an open topic for future work and is not pursued in this thesis. Lack of decomposition ability The DiT model is not able to decompose the given problem to several independent subproblems which is exemplified by the results for the Two Peaks function. What’s Next? The next section presents a model able to describe even elongated clusters not aligned with the coordinate axes, i.e. a situation which cannot be described by DiT model successfully. 4.2 Kernel PCA In this section, another model capable of covering very general kind of interactions is presented. As the name suggests, the kernel principal components analysis (KPCA) builds on the PCA method (described in Sec. 3.2.1) and preserves several of its important characteristics. Furthermore, it uses some techniques from the kernel methods which became very popular recently. 4.2.1 PCA in terms of dot products KPCA Definition As stated earlier, the PCA usually amounts to performing the eigendecomposition of the data sample covariance matrix C. However, it can be shown (see e.g. Schölkopf et al. (1996), Moerland (2000)) that after performing the eigendecomposition of the dot product matrix of size N × N computed as K= Kernel methods 1 T X X, N (4.3) we arrive to the same non-zero eigenvalues and corresponding eigenvectors. Thus, we can express the whole PCA in terms of dot products. The kernel methods gained their popularity mainly due to the simplicity with which they allow a broad range of originally linear methods to become non-linear. We can make non-linear any linear algorithm which is expressed in terms of dot products. 4.2 Kernel PCA 61 Suppose we have a function Φ : RD → F which presents a non-linear mapping from the input space to the so-called feature space. Then, we can define a function k : RD × RD → R as a dot product of two data points transformed to the feature space: k(xi , xj ) = hΦ(xi ), Φ(xj )i (4.4) If we compute a so-called kernel matrix K so that the individual elements of the matrix are Kij = k(xi , xj ), we have a dot product matrix of the data points transformed to the feature space. We can now perform the linear PCA in the feature space via the eigendecomposition of the kernel matrix and this whole process is called kernel PCA. Suppose now we want to use the KPCA for feature extraction, i.e. for an input point x we want to find the projection of the image Φ(x) onto the principal axes of the feature space F . The i-th element yi of the image y (the i-th nonlinear feature of x) can be computed as a projection of the image Φ(x) on the i-th eigenvector of the kernel matrix K, i.e. i fi (x) = yi = hv , Φ(x)i = N X vji k(xj , x) , Kernel trick KPCA feature extraction (4.5) j=1 i ) is the i-th normalized eigenvector of the kernel mawhere vi = (v1i , v2i , . . . , vN trix K. The best thing about the kernel methods is that we need not know the mapping function Φ. It can be shown that if the function k satisfies certain conditions (Mercer’s conditions, see e.g. Schölkopf & Smola (2002)), it corresponds to a dot product in some non-linearly mapped feature space F . Thus, we can compute the KPCA just by selecting any valid kernel without explicitly prescribing the mapping Φ. In the literature, the most often used kernels are the polynomial kernel, the radial basis function (RBF) kernel and the sigmoidal kernel. In the rest of this section, the RBF kernel of the form 2 i j − k(x , x ) = e kxi −xj k 2σ 2 (4.6) is used. 4.2.2 The Pre-Image Problem Equation 4.5 provides a way to carry out the projection of input data points onto the non-linear principal components of the feature space. To be able to use the KPCA as a generative model inside EDA (or as a generalized crossover operator) we need to address the inverse transform as well, i.e. we should be able to find an unknown pre-image x, x ∈ RD , of a given image y, y ∈ F so that y = Φ(x). In general, this presents rather intricate issue. The mapping Φ induced by the selected kernel k usually maps the low-dimensional input data points to a highdimensional (possibly even infinite dimensional) feature space F . The feature space is usually of much higher cardinality so that a one-to-one mapping almost never exists. This means that for an input point x there always exists its image Φ(x) in the feature space, but the opposite is not true—for a randomly chosen image in feature space there only seldom exists a precise pre-image in the input space. However, we can try to look for at least approximate pre-images. The idea is really simple: if we have an image y given as an expansion of training points P j images, y = N j=1 αj Φ(x ), we can find its approximate pre-image z in such a Precise pre-images do not exist! Pre-image approximation 62 4 F OCUSING THE S EARCH way that the image of z, Φ(z), minimizes the euclidean distance to y, i.e. we solve the following optimization problem: z = arg min ky − Φ(x)k x∈RD (4.7) Schölkopf et al. (1996) provide a solution to this optimization problem for the RBF kernel in the form of the following fixed point iterative algorithm: zn+1 = 4.2.3 j 2 /(2σ 2 ))xj j=1 αj exp(− x − zn PN 2 j 2 j=1 αj exp(− kx − zn k /(2σ )) PN (4.8) KPCA Model Usage The KPCA differs from the linear transforms (PCA, ICA) in a substantial way— it is a much more flexible model and is able to cover very complex interactions. Thus in case of KPCA, it almost does not matter what kind of ‘variation’ we use in the feature space, because the final distribution of offspring depends mainly on the nonlinear transform learned by the KPCA from training data (i.e. from selected parents). In my implementation of KPCA model, simple uniform random sampling in the feature space is carried out. The principle of the KPCA model building and sampling (KPCA generalized crossover) as used in this thesis is described in Alg. 4.3. Algorithm 4.3: KPCA model building and sampling 1 2 3 4 5 6 7 8 9 begin Standardize the population Train the KPCA model on the whole population Transform the population to feature space F via Eq. 4.5 Determine the bounding hypercube in the feature space Randomly sample N images from the hypercube Find the pre-images of the new individuals using Eq. 4.8 iteratively Destandardize the offspring end 4.2.4 Clusters, curves, . . . Toy Examples of KPCA An EDA which uses the KPCA model in a way shown in Alg. 4.3 can be (a bit simplistically) described as a random search in the feature space. The KPCA amounts to performing the linear PCA in the non-linearly transformed input space, thus in feature space it preserves all the characteristics of the PCA. If we wanted to visualize the non-linear components extracted from the data, we could do it e.g. via the contour plots of the components. In case of PCA we would see a set of parallel contour lines perpendicular to the respective principal axe (as can be seen in Fig. 3.7). In case of KPCA we can see rather complex set of curves which describe the data with much greater fidelity (see Fig. 4.8). The majority of kernels require a few parameters; however, the precise form of the model is mainly data driven—the training data set distribution is the main factor influencing the final form of the KPCA model. The KPCA is thus very efficient in modeling various types of dependencies—clusters, curves, etc. In Fig. 4.9 we can see that the KPCA is able to perform a kind of clustering (modeling two –or more– clusters of data points) and a kind of ‘curve modeling’ 4.2 Kernel PCA 63 Component 1 Component 2 Component 3 10 10 10 5 5 5 0 0 0 0 5 10 0 Component 4 5 10 0 Component 5 10 10 5 5 5 0 0 0 5 10 0 5 10 10 Component 6 10 0 5 0 5 10 Figure 4.8: First six nonlinear components of the toy data set (describing each cluster as a ring) in the same time. This figure is an example of the ability of the KPCA model to generate new data points which are ‘very similar’ to the training data points. This makes it an ideal crossover operator for variables which are closely dependent on each other; in the same time, the KPCA crossover is not suitable for variables which are not closely related. In that case, the KPCA is very limited in introducing a new genetic material to help in the evolution (as you can see in Fig. 4.9, the KPCA crossover does not generate any data point in areas, where no training points are present). The search based solely on the KPCA is then strongly biased by the individuals from the initialization phase of the evolution. 4.2.5 Empirical Comparison The carried out experiments were aimed at comparing the KPCA model used as generalized crossover operator against several other algorithms with different means of generating new individuals. No mutation was used in all experiments. The evolutionary model described in Sec. 3.1.3 was used and the population was allowed to evolve for 50,000 evaluations. All experiments were repeated 20 times. For the comparison, the 2-dimensional versions of the Two Peaks, Griewangk and Rosenbrock functions were chosen. Again, the population size 20, 50, 100, 200, 400, 600, and 800 were tested for all the algorithms and the best achieved result is reported. Importance of good initial population 64 4 F OCUSING THE S EARCH Training data points Data points sampled from KPCA 8 7 6 5 4 3 2 1 0 2 4 6 8 10 Figure 4.9: Example of using the KPCA Crossover. The KPCA is able to perform clustering and a ‘curve-fitting’ in the same time. Involved Crossover Operators Experiments on the following EAs with various crossover operators were carried out: GA A genetic algorithm with the uniform crossover. Binary representation with a resolution allowing the GA to find such a solution the distance of which from the global optimum is not larger than 2.22−16 . UMDA The univariate marginal distribution algorithm as described in Sec. 3.1 in more detail. Empirical equi-height histograms were used as 1D PDF models. The number of histogram bins was set to such a number that if all the bins had equal width, none of them was larger then 0.1. DiT-EA EDA with the DiT model (distribution tree model, in more detail described in Sec. 4.1). The parameters of the model used for comparison were minT oSplit = 5, minInN ode = 3, and maxP V alue = 0.001. Real-valued EA with SPX EA with the simplex crossover operator (described in Tsutsui, Goldberg & Sastry (2001)). Each new individual is created by randomly sampling D + 1 points from the population, and generating one point uniformly from the simplex formed by the selected D + 1 points. EDA with KPCA EA with the KPCA crossover presented in this section. The kernel parameter σ was set to 1. When performing the inverse transform, the number of principal components used was set as the minimal number of components able to describe at least 99.99% of the variance in the feature space, but not less then 10. Results and Discussion The statistics tracked along the course of evolution are described in Sec. 3.2.6. The overall results are shown in Tab. 4.2.5. 4.2 Kernel PCA 65 Table 4.3: Comparison of the EA with KPCA model against algorithms with different crossover operators. KPCA SPX DiT UMDA GA Alg. Function Two Peaks Griewangk Rosenbrock 2,5e-11 (3,9e-11) 1,3e-7 (5,1e-7) 9e-7 (3,1e-6) 20 20 20 1801 5371 9601 20 20 18 20 20 19 1921 28261 35268 2881 15441 24843 600 600 800 0 (0) 0,0024 (0,0026) 0,0044 (0,0042) 20 20 20 14 541 1621 2621 3887 0 0 16 3 2351 2134 0 200 800 800 4,8e-13 (2,3e-13) 0 (0) 5e-5 (9,6e-5) 20 20 20 20 20 20 1771 6241 10111 961 3421 6441 20 16 600 400 600 0,5063 (0,5114) 0 (0) 0 (0) 10 9 5541 10112 6 20 20 20 20 20 20 8751 431 996 1481 1481 2981 3961 100 100 200 0 (0) 0 (0) 0 (0) 20 20 20 20 756 3261 4221 941 100 5 2131 16014 29041 20 20 21211 24891 100 20 20 20 289 694 1036 50 The Two Peaks function is separable (having no interactions between the variables) and thus the UMDA was expected to solve such a problem very efficiently. Surprisingly, the EA using KPCA crossover was able to solve this problem with comparable efficiency and reliability as the UMDA (see Fig. 4.10). This is, however, caused by the low dimensionality of the problem and by the ability of the KPCA to implicitly perform clustering. With increasing dimensionality of this problem the UMDA scales well (see Sec. 3.1.3), but the KPCA is not expected to preserve its superior results in that case (larger population sizes would be needed). The other algorithms solved this problem much worse than the UMDA and KPCA; the SPX crossover failed to solve this task finding a deceptive attractor of the Two Peaks function in a half of all the runs. The Griewangk function is not separable and the UMDA is not able to solve this function (see Fig. 4.11). The SPX crossover solves this problem best, however this is due to the fact that the global optimum lies in the middle of the other local optima. SPX creates many offspring in between the parents and thus in the neighborhood of the global optima. As was shown by the Two Peaks function in previous paragraph, when solving a general multimodal function the SPX is not very efficient. The EDA using the DiT model is very efficient solving this task as well, although it needs a larger population than SPX and KPCA. The KPCA solves this task reliably with a relatively small population, however it needs longer time preserving a ‘subpopulation’ on each of the local optima. After the ‘subpopulations’ vanish due to the selection the global optimum is reached very quickly. The Rosenbrock function has the strongest dependency between variables among all three test problems. The SPX and KPCA were the only algorithms able to solve this task. The KPCA-EA needs much smaller population and is able to solve this ‘EA-hard’ problem very quickly. It needs only hundreds of fitness function evaluations which is a number that is still worse, but already comparable with more classic, gradient-based optimization techniques. Two Peaks Griewangk Rosenbrock 66 4 F OCUSING THE S EARCH Two Peaks, all crossovers 5 Average Best−So−Far Fitness 10 0 10 −5 10 −10 10 −15 10 GA, pop. size: 600 UMDA, pop. size: 200 DiT−EA, pop. size: 600 SPX, pop. size: 100 KPCA−EA, pop. size: 100 −20 10 0 1 2 3 4 No. of Evaluations 5 6 4 x 10 Figure 4.10: 2D Two Peaks function—comparison of the KPCA-EA against other algorithms Griewangk, all crossovers 0 Average Best−So−Far Fitness 10 −5 10 −10 10 −15 10 GA, pop. size: 600 UMDA, pop. size: 800 DiT−EA, pop. size: 400 SPX, pop. size: 100 KPCA−EA, pop. size: 100 −20 10 0 1 2 3 4 No. of Evaluations 5 6 4 x 10 Figure 4.11: 2D Griewangk function—comparison of the KPCA-EA against other algorithms 4.2 Kernel PCA 67 Rosenbrock, all crossovers 0 Average Best−So−Far Fitness 10 −10 10 −20 10 GA, pop. size: 800 UMDA, pop. size: 800 DiT−EA, pop. size: 600 SPX, pop. size: 200 KPCA−EA, pop. size: 50 −30 10 0 1 2 3 4 No. of Evaluations 5 6 4 x 10 Figure 4.12: 2D Rosenbrock function—comparison of the KPCA-EA aginst other algorithms 4.2.6 Summary The KPCA model introduced in this section provides alternative solution to the third aim of this thesis—to create a probabilistic model able to focus the search to promising areas of the search space in situations when the problem has higherorder interactions among variables. Its advantages are most distinguishable in situations when the distribution of a small group of tightly linked variables should be described (that is also the reason for using only low-dimensional test problems). When solving practical problems of higher dimensionality, the KPCA should be accompanied with a ‘device’ for detecting the dependencies between variables (i.e. for decomposing the problem at hand). I hypothesize that the KPCA will not scale well in the general case and without problem structure learning and decomposition its superior efficiency and reliability will vanish. No special parameter tuning for the KPCA model was performed and the KPCA crossover solved all the low dimensional problems with the same parameter settings. There is an open space for research in the field of creating good heuristics to set the parameters, or adapting the parameters during the EA run. The experiments presented in this section suggest that the KPCA crossover has a great potential and when used appropriately it belongs to the best operators for competent evolutionary algorithms. Advantages of KPCA-EA Flexibility The model is mainly data-driven, and as such it can describe a huge class of dependencies among variables. Clustering capable The KPCA model can identify and reproduce the distribution of promising individuals in the form of clusters. Curve modeling capable The KPCA model can learn and reproduce the distribution of promising individuals in the form of curves. 68 4 F OCUSING THE S EARCH Rotational invariance The model is not sensitive to any rotation of the coordinate system. Limitations of KPCA-EA Need for proper initialization The KPCA model is very precise and it does not usually generate new individuals in areas where no training points were present. It is thus critical to supply in the training data set such individuals which form a good representation of the promising solutions. Bad scalability The KPCA model generally requires larger populations when used in higher dimensions, i.e. when modeling the distribution of larger group of tightly linked variables. Lack of decomposition ability The model is not able to decompose the problem at hand to several simple subproblems. High time and space demands The model stores a portion of the training data, and performs the eigendecomposition of the kernel matrix which is of size N × N . The longest time, however, is demanded by generation of new individuals. The creation of each new individual is itself an optimization task. Chapter 5 Population Shift Chapters 3 and 4 presented probabilistic models with strong and weaker structural assumptions, respectively. EDAs using these models (when properly set up) are able to solve optimization tasks precisely. However, the crucial assumption of all these models is that the initial population and all subsequent populations must lie in the vicinity of the target solution, i.e. the population must surround the solution. If during the evolution this requirement is broken, it is usually very hard for these algorithms to recover—they usually converge prematurely to locally optimal solutions (or even to solutions which are not locally optimal). When using histogram models or the distribution tree model, we need to set up the box constraints—the limits of hypercube in which the search should be carried out. They can be set up before the evolution, or they can be set up adaptively during the evolution. In both cases, we can make an error—we can set the hypercube in such a way that it will not surround the best solution. These algorithms do not have any chance to search beyond the box boundaries. If we decide to use Gaussian distributions, their mixtures, or the KPCA model, the situation is better, because we do not have any hard boundaries of the search space. However, even using these models, the chance of generating new individuals outside the areas in which the parent individuals lie is very low. This chapter describes the reasons for these phenomena, and presents methods able not only to focus the search to promising areas but also shift the population to such areas. These methods, however, have the form of local optimizers, i.e. they are not able to cope with more local optima at the same time. Surrounding the target solution 5.1 The Danger of Premature Convergence The premature convergence is a phenomenon observed in EAs for a long time. It arises if the search algorithm somehow misses the true global optimum, another suboptimal solution takes over the population which loses its diversity, and the search stops. Rudolph (2001) shows that self-adpatative mutations may lead to premature convergence. In that article the mutations are created by sampling random mutation vectors from the Gaussian distribution and its parameters are self-adapted. The conclusions of this article hold for the EDAs which use the Gaussian distribution as well. Bosman & Grahl (2006) present a theoretical model of the progress of a simple EDA solving a 1D problem. The algorithm they analyze is an ordinary EDA (see Alg. 2.1) with the following instantiation: 69 Danger of selfadaptation 70 5 P OPULATION S HIFT • the truncation selection with parameter τ is used, i.e. the best τ × 100% of individuals are used to create the model, • the Gaussian distribution N (µ, σ 2 ) is used with maximum likelihood estimates of µ and σ. (σ µ(t+1) = µ(t) + σ (t) · d(τ ), (t+1) 2 ) (5.1) (t) 2 = (σ ) · c(τ ), (5.2) where µ(0) and (σ (0) )2 are the parameters of the Gaussian distribution used to generate the first population, and the d(τ ) and c(τ ) are coefficients describing the change of the parameters µ and σ 2 given by the following equations: φ Φ−1 (τ ) , τ Φ−1 (1 − τ ) · φ Φ−1 (τ ) c(τ ) = 1 + − τ d(τ ) = (5.3) !2 φ Φ−1 (τ ) τ , (5.4) where Φ and φ is the CDF and PDF of the standard normal distribution, respectively. The dependence of coefficients d(τ ) and c(τ ) on τ is depicted on Fig. 5.1. We can see that the only way how to prevent the variance σ 2 from diminishing is to select the whole population (τ = 1), but then the movement of the center µ would be zero (d(1) = 0). Or, we can force the algorithm to take large step by selecting only a few best individuals (τ is small), but then the variance σ 2 would drop to a very small number which would limit the exploration ability. 3.5 1 0.9 3 0.8 2.5 0.7 0.6 2 c(τ) d(τ) Population shift ability is limited! They analyze the behavior of this algorithm when searching for an optimum of a linear (or any monotonic) fitness function defined on interval (−∞, ∞). What would we expect from any search algorithm applied on such an ever increasing or ever decreasing function? The algorithm should never stop searching and its best-so-far solution x should go to infinity. Bosman et al., however, proved that such an algorithm is not able to move to infinity, because the distance it can ‘travel’ is limited. They derived the following difference equations describing the evolution of the parameters of the Gaussian distribution: 1.5 0.5 0.4 0.3 1 0.2 0.5 0.1 0 0 0.2 0.4 τ 0.6 0.8 1 0 0 0.2 0.4 τ 0.6 0.8 1 Figure 5.1: Coefficients d (left) and c (right) in relation to the selection proportion τ . If these facts are considered altogether, the following formulas can be derived: 1 p , (5.5) lim µ(t) = µ(0) + σ (0) · d(τ ) t→∞ 1 − c(τ ) lim (σ (t) )2 = 0. t→∞ (5.6) 5.2 Using a Model of Successful Mutation Steps This means that the population of the studied EDA eventually loses its diversity so that the search stops. The distance to which the Gaussian distribution can travel is limited and is given only by the initial variance (σ (0) )2 and the selection ratio τ . The analyzed EDA is a typical example of a badly instantiated EDA. Very often, if the truncation selection is used with a model estimated by maximum likelihood, premature convergence occurs. Although we do not have such theoretical analyses for other instantiations, there is strong empirical evidence that the premature convergence can arise even for other learning methods and other selection schemes. The researchers use a variety of remedies for the premature convergence phenomenon, but they usually do not provide any analysis of the algorithm behavior, or any proof that their remedy is generally applicable. The most often used ones are the following: 71 Premature convergence remedies • Use truncation selection and maximum likelihood estimates, but set a lower bound on the distribution variance. This often ensures, that the algorithm does not stop searching, but the speed of progress and the precision of the solution are limited. • Use maximum likelihood estimates but a different selection scheme ensuring higher variability of selected individuals. It is unclear in what circumstances this remedy works well. • Use truncation selection and maximum likelihood estimation, but increase the variance by a selected factor. If the factor is not selected properly the algorithm can diverge, i.e. the variance can increase in each step even if the population is situated around the optimum. • Use a special kind of selection and a carefully chosen procedure for updating the model (at least one parameter of the model is not set up using maximum likelihood estimation). • Do not estimate the distribution of selected individuals, but rather the distribution of successful mutation steps. The next section describes two methods which use the fifth and the combination of fourth and fifth remedies, respectively. 5.2 Using a Model of Successful Mutation Steps Instead of building a model of selected individuals, the methods presented in this section use successful mutation steps to generate new ones. This approach is somewhat similar to evolutionary strategies where the mutation vectors are attached to each individual. On the contrary, in methods presented here, the mutation vectors are detached from the individuals to which they were applied. This implies that the algorithm performs a local search (if the mutation steps are applied to one prototype individual in each iteration), or a kind of parallel local search (if the mutation vectors are applied to more individuals in the population—in that case, the shape of the individual basins of attraction should be similar to adapt the distribution successfully). One of the most straightforward solutions to utilize the successful mutation steps was presented by Pošı́k (2005b). He used a cooperative co-evolution approach. His EA has 2 populations—one with the solution vectors and one with Mutation steps co-evolution 72 5 P OPULATION S HIFT the successful mutation steps. The solution vectors are evaluated by the fitness function and are created by randomly pairing them with mutation steps that were useful in previous generations. The mutation steps are evaluated by the fitness increase they caused and are created by mutating the old mutation vectors. This approach proved to be flexible in such a way that it can adapt to changing local neighborhoods of the fitness landscape. However, its convergence is slow and it is not able to efficiently reintroduce the diversity to its population if needed. A different, and more successful approach is constituted by the evolutionary strategy with covariance matrix adaptation. 5.2.1 Adaptation of the Covariance Matrix Many algorithms from the class of continuous EDAs use the normal distribution as the model used to generate new individuals. A few of them use the multivariate normal distribution with a full covariance matrix—the two most typical examples are EMNA (Estimation of Multivariate Normal Algorithm, see e.g. Larrañaga & Lozano (2002)) and CMA-ES (Evolutionary Strategy with Covariance Matrix Adaptation, see e.g. Hansen & Ostermeier (2001)). Both algorithms use actually the same formula to generate new individuals: x(g+1) ∼ N µ(g) , Σ(g) n EMNA (5.7) They differ in the way the parameters of the normal distribution are estimated. The main feature of EMNA is that it uses the multivariate normal distribution to encode the distribution of selected individuals. Its parameters are estimated via the maximum likelihood method (see Fig. 5.2, left). As can be seen, the variance in the direction of gradient has dropped; EMNA is very susceptible to premature convergence as discussed in Sec. 5.1. In practical applications it must be equipped with some of the above mentioned premature convergence remedies. −1 −1 −1.5 −1.5 −2 −2 −2.5 −2.5 −3 −3 −3.5 −3.5 −4 −0.5 0 0.5 1 1.5 2 2.5 −4 −0.5 0 0.5 1 1.5 2 2.5 Figure 5.2: An example of one iteration of EMNA (left) and CMA-ES (right). From the initial distribution (dashed line) new points are sampled, evaluated, and divided to selected points (blue dots) and discarded points (red crosses). EMNA then fits a Gaussian model to selected points and places it to the mean of selected points. CMAES creates a Gaussian model of selected mutation steps and places it to the best individual. CMA-ES Evolutionary strategy with covariance matrix adaptation tries to view the neighborhood of the current point as a convex quadratic function. The aim of 5.2 Using a Model of Successful Mutation Steps 73 updating the covariance matrix is to estimate on-line the inverse Hessian matrix of the quadratic model of the neighborhood. CMA-ES uses rather intricate up date scheme for the parameters of the normal distribution N µ(g) , σ (g) · C(g) which is used to sample new individuals. CMA-ES updates separately all three parameters: The center of the distribution, µ. The basic update scheme is based on maximum likelihood method, i.e. (g) µ(g+1) = arg max P xselected |µ(g+1) . µ (5.8) Individual variants of CMA-ES thus place the center to the best selected individual, or to the average of several best individuals (sometimes weighted by their relative fitness). The covariance matrix, C. Again, the update scheme is based on maximum likelihood estimation, i.e. (g) (g+1) C = arg max P C ! xselected − µ(g) (g+1) |C . σ (g) (5.9) (g) The covariance matrix is computed for successful mutation steps xselected − µ(g) . Again, CMA-ES can use a weighted covariance matrix, as in the case of the distribution center. Global step size, σ. The step size is conceptually adapted in such a way so that the two most recent consecutive steps of µ were conjugate perpendicular, i.e. µ(g+2) − µ(g+1) T × C(g) −1 × µ(g+1) − µ(g) σ (g+1) 2 → 0. (5.10) Of course, this equation is only conceptual and cannot be used to set the σ (g+1) , because the µ(g+2) is not known. However, this equation reveals that CMA-ES performs a kind of principal components analysis over the distribution center steps. Moreover, CMA-ES takes into account also the previous values of the strategy parameters µ, C, and σ (taking a weighted average of the old and new values), and uses cumulation to filter out possibly erratic moves of the distribution center. For the implementation details, see e.g. Hansen (2006). 5.2.2 Empirical Comparison: CMA-ES vs. Nelder-Mead The CMA-ES is considered the cutting-edge method for real-valued optimization. Its superiority over other evolutionary optimization methods was shown in many comparative studies (among others, see e.g. IEEE CEC 2005 Special Session on Real-parameter Optimization). For a list of successful applications, see Hansen (2005). The CMA-ES has features of local optimizers. To this end, the superiority of CMA-ES in all the above comparisons may only justify that the functions used in the comparisons are better optimized with local optimizers. Although it is not exactly true, let me present here a comparison of CMA-ES with the well-known local optimization technique, Nelder-Mead simplex search described in Sec. 1.2.2. CMA-ES is cutting-edge 74 5 Vestibuloocular reflex analysis P OPULATION S HIFT The following results were reported by Pošı́k (2005a). He used the NelderMead simplex method and CMA-ES for the analysis of the vestibulo-ocular reflex signal. The details of the task can be found in Appendix B). Here, it suffices to state the fundamental aim of the optimization task. We have a system (vestibulo-ocular system) that is fed with an input signal—mixture of sine waves (MOS) of different frequencies, amplitudes, and phase shifts. On the system output we observe a signal depicted in Fig. 5.3—many signal segments with gaps between them. These segments are assumed to come from another MOS signal Simulated slow−phase segments of the VOR signal 0.1 0 −0.1 0 5 10 15 20 Figure 5.3: Vestibulo-ocular reflex signal, the input of the analysis. which has the same frequency components as the input MOS signal; however, these components can have different amplitudes and phase shifts. The aim of the analysis is to estimate the parameters (amplitudes and phase shifts) of the underlying MOS signal, and to align the VOR signal segments with the estimated MOS signal as can be seen in Fig. 5.4. After that, we can compare the am2 1 0 −1 −2 0 5 10 15 20 Figure 5.4: The VOR signal segments aligned with the underlying MOS signal, the output of the analysis. Fitness function plitudes and phase shifts of the individual sine components of the input MOS and the estimated output MOS signals, i.e. we can construct several points of the frequency response of the vestibulo-ocular system. The fitness function is described in Appendix B. It is a quadratic loss function, computed along the time domain after the individual VOR signal segments were moved ‘vertically’ to coincide with the estimated MOS signal. Experimental Setup Generating the VOR signals The two methods were tested on artificially generated VOR signals to assess its success and precision and to decide which of the optimization algorithms is more suitable for this task. The tests were carried out for MOS signals consisting of 1 to 5 sine components, i.e. the search was carried out in 2-, 4-, 6-, 8-, and 10dimensional parameter spaces (amplitudes and phase shifts of individual sine components are sought). First, for each sine component of the signal, the values of frequency, amplitude and phase shift were randomly generated. The ranges for individual parameters can be found in Table 5.1. Using these randomly generated values, a continuous MOS signal (which is to be estimated) is created. This signal then undergoes a disruption process which cuts it to individual segments with gaps 5.2 Using a Model of Successful Mutation Steps 75 between them. The segments are then placed approximately to the same level (see Fig. 5.3). Table 5.1: Settings for parameters of artificial VOR signal Parameter Value (Range) fi ai φi Sampling Freq. Signal Duration Saccade Duration h0.05, 2i h0.2, 2i h0, π/2i 500 Hz 20 s 0.05 s For each number of components, 9 different VOR signals were generated. For each of them the parameters of the underlying MOS were estimated by minimizing the loss function using both the Nelder-Mead simplex search and the CMA-ES. In each run, the algorithms were allowed to perform 10,000 evaluations of the loss function. A particular run was considered to be successful if the algorithm found a parameter set with the loss function value lower than 10−8 . Experiment settings Results and Discussion First, let us review the success rates of both algorithms when estimating the parameters of the MOS signal with the number of components ranging from 1 to 5 (see Tab. 5.2). Success rates Table 5.2: Success rates (in percentages) of Simplex and CMA-ES algorithms Components 1 2 3 4 5 Simplex CMA-ES 100.0 100.0 100.0 44.4 0.0 100.0 100.0 100.0 100.0 100.0 As we can see, the simplex algorithm has difficulties with finding the optimum of the loss function in less than 10,000 evaluations when the underlying MOS signal has 4 or more components. The comparison of speed is based on the number of evaluations needed to find a solution with loss lower than 10−8 , i.e. only successful runs are considered. The results are summarized in Fig. 5.5. The two graphs reveal that the number of needed evaluations increases with the number of components (i.e. with the dimensionality of the search space) much faster for the simplex search method than for the CMA-ES where the increase is almost only linear (at least subquadratic). CMA-ES is clearly more scalable solution than the simplex search. The progress of evolution is depicted in Fig. 5.6. It presents the loss function value of the best solution found so far, averaged over all successful runs. Again, there is no line for the simplex method searching for parameters of 5 components. The speed of convergence Evolution profiles 76 5 CMA−ES 6000 6000 5000 5000 No. of Evaluations No. of Evaluations Nelder−Mead Simplex Search P OPULATION S HIFT 4000 3000 2000 1000 4000 3000 2000 1000 0 1 2 3 4 No. of Components 5 0 1 2 3 4 No. of Components 5 Figure 5.5: Number of evaluations needed to find a solution with quality better than 10−8 as a function of the number of components of the underlying MOS signal. Middle line: median, box: interquartile range, whiskers: minimum and maximum. 4 10 Simplex CMA−ES 2 Average Best−so−far 10 0 10 −2 10 −4 10 −6 10 −8 10 0 1000 2000 3000 4000 No. of Evaluations 5000 6000 Figure 5.6: Typical progress of successful search runs as done by the simplex method and by the CMA-ES. Both the leftmost lines (solid and dashed) belong to 1 component, the rightmost dashed line belongs to simplex searching for parameters of 4 components while the rightmost solid line belongs to CMA-ES searching for parameters of 5 components. 5.2 Using a Model of Successful Mutation Steps Based on these results, a recommendation can be given: do not use the simplex method when searching for parameters of the MOS signal with more than 2 components. The CMA-ES solves such tasks faster and more reliably. 5.2.3 Summary So far, this chapter has discussed the role of adaptivity in situations when the population shift is needed, i.e. in situations when the population does not surround the target optimal solution. It was shown that the most straightforward type of model adaptation using maximum likelihood method often leads to premature convergence, so that some additional ‘tweaks’ are needed. Then, the CMA-ES was conceptually introduced—a clever algorithm on the boundary of evolutionary strategies and EDAs, that uses the normal distribution as the model of successful mutation steps, and adapts the covariance matrix in a way that prevents from premature convergence. The CMA-ES was then compared with the Nelder-Mead simplex search method; when increasing the dimensionality of the search space, it was revealed that the CMA-ES outperforms the simplex search very quickly and that the CMA-ES is more scalable solution. Advantages of CMA-ES Scalability The population size grow is only logarithmic with the search space dimensions, the time scale for adapting the covariance matrix is approximately quadratic. Flexibility Due to the adaptation mechanism, CMA-ES is able to solve unimodal non-separable problems with non-linear interactions among variables. Low population sizes Compared to other EAs, CMA-ES uses rather low population sizes which enables for quick adaptation to local neighborhoods and for effective population shift. Premature convergence resistant CMA-ES uses the step size control to prevent premature convergence. Rotational invariance The model is not sensitive to any rotation of the coordinate system. Stationarity The model parameters (µ, C) under random selection are unbiased. Step size σ is the exception, E(σ (g+1) |σ (g) ) > σ (g) . Limitations of CMA-ES Local search Although with regard to the taxonomy of optimization methods presented in Sec. 1.1, the CMA-ES belongs to the class of asymptotically complete methods, it exhibits features of a local search in the neighborhood of one point. If not properly initialized (the center of distribution, µ, and the step size, σ), it will not be able to escape from the basin of attraction of some local optimum. Locally linear dependencies only The multivariate normal distribution can describe only linear dependencies (the model have similar decomposition ability as UMDA with linear coordinate transforms described in Sec. 3.2). 77 78 5 P OPULATION S HIFT It is not able to describe valleys in the form of curves (however, it is able to traverse them), and it cannot perform clustering (i.e. it cannot reproduce the distribution of population members if they form clusters in the search space). What’s Next? The technique the CMA-ES uses for the covariance matrix adaptation is not the only one possible and is not optimal from several points of view. Higher order methods can be used to estimate the covariance matrix of the search distribution; it is the topic of the next section. 5.3 Estimation of Contour Lines of the Fitness Function Quadratic regression model? The previous section described the CMA-ES—the state-of-the-art among algorithms that use normal distribution to sample new points. It adapts the step size separately from the ‘directions’ of the multivariate normal distribution. The adaptation is based on accumulation of the previous steps that the algorithm made. Auger et al. (2004) proposed a method improving the CMA-ES covariance matrix adaptation using a quadratic regression model of the fitness function in the local neighborhood. Their approach, however, required in D-dimensional + 1 data vectors to learn the quadratic function. Morespace at least D(D+3) 2 over, it assumed that each point has its fitness value, i.e. it cannot use selection schemes based on pure comparisons of two individuals. The method is also not invariant with respect to order-preserving transformation of the fitness function. This section describes an initial study of a novel algorithm of Pošı́k & Franc (2006) for learning the Gaussian distribution by modeling the fitness landscape contour line between the selected and discarded individuals. It uses a modified perceptron algorithm that finds an elliptic decision boundary if it exists. If it does not exist, it will not stop. From this, the main limitation of the algorithm immediately follows: in present state the algorithm is able to optimize only convex quadratic functions without noises. 5.3.1 Estimate the contour line! Principle and Methods The basic principle of the proposed method is illustrated in Fig. 5.7 (right). After evaluation of the population, the method constructs a model of the fitness function contour line in the form of an ellipse (D-dimensional ellipsoid) that would allow us to discriminate between the selected and discarded individuals. The decision boundary is of the form xT Ax+Bx+c = 0, where x is a D-dimensional column vector representing a population member, A is a positive definite D × D matrix, B is a row vector with D elements, and c is a scalar value. This elliptic function is closely related to the Gaussian distribution: setting the mean of the distribution to the center of the ellipsoid, and the covariance matrix to Σ = A−1 , the elliptic decision boundary corresponds to a contour line of the Gaussian density function. The candidate members of the new population are then sampled from this distribution. The proposed method uses a variation of the perceptron algorithm that finds a linear decision function. To learn an ellipsoid we will have to map the points to a different space and then map the learned linear function back into the original 5.3 Estimation of Contour Lines of the Fitness Function 79 space where it will form the ellipsoid. This ellipsoid is then turned into a Gaussian used to sample new points. The following paragraphs introduce methods used to accomplish the process sketched above. 1 1 0 0 −1 −1 −2 −2 −3 −3 −4 −2 −1 0 1 2 3 −4 −2 −1 0 1 2 3 Figure 5.7: An example of learning the covariance matrix using a method similar to CMA-ES (left) and the proposed method (right) based on estimation of the contour line between the selected (blue dots) and the discarded (red crosses) individuals. The proposed method gives more accurate estimates of the covariance matrix and the position of the Gaussian. Quadratic Mapping We need to learn a quadratic function which would allow us to discriminate between the two classes of data points. The classifier is then given as C(x) = ( 1 iff xT Ax + Bx + c > 0 . 2 iff xT Ax + Bx + c < 0 (5.11) The decision boundary xT Ax + Bx + c = 0 is required to be a hyperellipsoid which is a special case of quadratic function. As already stated, the hyperellipsoid is to be learned with a method that is designed to find a linear decision boundary. In order to be able to do that, we have to use a coordinate transform such that if we fit the linear decision boundary in the transformed space, we can transform it back and get a quadratic function. This process is sometimes referred to as the basis expansion (see Hastie et al. (2001)) or feature space straightening (see Schlesinger & Hlaváč (2002)). The matrix A is symmetric, i.e. aij = aji , i, j ∈ h1, Di. We can rewrite the decision boundary to the following form: a11 x1 x1 + 2a12 x1 x2 + . . . + 2a1D x1 xD + a22 x2 x2 + . . . + 2a2D x2 xD ... + aDD xD xD b1 x1 + b2 x2 + . . . + bD xD + c + + + + + = 0 (5.12) This equation defines a quadratic mapping qmap which for each point x from the input space creates a new, quadratically mapped point z, where Basis expansion T z = qmap(x) = x21 , 2x1 x2 , . . . , 2x1 xD , x22 , . . . , 2x2 xd , . . . , x2D , x1 , . . . , xD , 1 (5.13) Quadratic mapping 80 5 P OPULATION S HIFT Then, if we arrange the coefficients aij , bi , and c into a vector w so that w = (a11 , a12 , . . . , a1D , a22 , . . . , a2D , . . . , aDD , b1 , . . . , bD , c) , (5.14) we can write the decision boundary as wz = 0 and the whole classifier as C(x) = C(z) = ( 1 iff wz > 0 , 2 iff wz < 0 (5.15) The dimensionality of the feature space is easily computed as the number of quadratic terms, D linear terms, and 1 conterms in Eq. 5.12: we have D(D+1) 2 D(D+3) stant term. This gives + 1 dimensions. 2 The learning of a quadratic decision boundary can be carried out by the following process: 1. Transform the points x from the input space to points z in the quadratically mapped feature space using Eq. 5.13. 2. Find the vector w defining the linear decision boundary in feature space. 3. Reorder the elements of vector w into matrices A, B, and c using Eq. 5.14. Separating Hyperplane Perceptron algorithm There are many ways to learn a separating hyperplane. In this method, the wellknown perceptron algorithm is used.1 The perceptron algorithm can be stated as follows. We have training vectors zi ∈ Z of the form zi = (zi1 , . . . , ziD , 1), each of them is classified into one of the two possible classes, C(zi ) ∈ {1, 2}. We search for a D-dimensional weight vector w so that wzi > 0 iff C(zi ) = 1 and wzi < 0 iff C(zi ) = 2. In other words, we search for a hyperplane that separates the two classes and contains the coordinate origin. The algorithm is presented as Alg. 5.1. Of course, the algorithm will not stop if the two classes of qmap-ed vectors are not linearly separable, i.e. if the original vectors are not separable by a quadratic decision boundary. Ensuring Ellipticity Modified perceptron algorithm Previous paragraphs showed a way of learning a quadratic decision boundary by mapping the training vectors into the quadratic space, finding a linear decision boundary, and rearranging the elements of the weight vector w into matrices A, B, and c. However, this quadratic decision function might not be elliptic, i.e. the matrix A might not be positive definite. The perceptron algorithm described in Alg. 5.1 is basically an algorithm for the satisfaction of constraints given in the form of linear inequalities. The usual set of constraints that must be satisfied is wzi > 0 for all i. If we found a way to describe the requirement of the matrix A positive definiteness in the form of similar inequalities, and if we were able to find vectors that violate these inequalities, we could use only slightly modified perceptron algorithm to learn an elliptic decision boundary. Such a way exists and is described in the following paragraphs. 1 For the perceptron algorithm, a method was found that ensures that the matrix A computed from the weight vector w will be positive definite. 5.3 Estimation of Contour Lines of the Fitness Function 81 Algorithm 5.1: Perceptron Algorithm 1 2 3 4 5 6 7 8 9 10 begin /* Initialize the weight vector */ w=0 /* Invert points in class 2 */ zi = −zi for all i where C(zi ) = 2 /* Find the training vector with minimal projection onto the weight vector w */ z∗ = arg minz∈Z (wz) if wz∗ > 0 then /* The minimal projection is positive, the separating hyperplane is found, and the algorithm finishes. */ exit else /* The minimal projection is not positive, wz∗ ≤ 0, the separating hyperplane was not found yet and the point z∗ is the one with the greatest error. Adapt the weight vector using this point. */ ∗ T w = w + (z ) goto step 4. end As shown in Sec. 5.3.1, the quadratic form can be written using a linear function: xT Ax + Bx + c = wz . (5.16) Matrix A is positive definite iff the condition xT Ax > 0 holds for all non-zero vectors x ∈ RD×1 . In order to write the condition of positive definiteness xT Ax > 0 in terms of the weight vector w, let us define a ‘pure’ quadratic mapping pqmap for the vectors x where only the quadratic elements are present while the D linear elements and 1 absolute element are substituted with zeros: q = pqmap(x), where qi = ( (5.17) zi iff i ∈ h1, D(D+1) i 2 and z = qmap(x). (5.18) D(D+1) + 1, D(D+3) + 1i 0 iff i ∈ h 2 2 Using any D-dimensional vector x and its transformed variant q = pqmap(x), the following two conditions are equivalent: xT Ax > 0 ⇐⇒ wq > 0 (5.19) Furthermore, all eigenvalues of any positive definite matrix are positive. If we perform the eigendecomposition of matrix A and get negative eigenvalues, then the related eigenvectors v violate the condition for positive definiteness, i.e. vT Av = wq ≤ 0, where q = pqmap(v). These pqmap-ed eigenvectors can thus be used to adapt the weight vector w in the same way as ordinary qmap-ed data vectors. A modified version of the perceptron algorithm that ensures the positive definiteness of the resulting matrix A is shown as Alg. 5.2. Again, the algorithm will not stop if the original vectors are not separable by an elliptic decision boundary. Pure quadratic mapping 82 5 P OPULATION S HIFT Algorithm 5.2: Modified Perceptron Algorithm 1 2 3 4 5 6 7 8 9 10 11 12 13 14 begin /* Initialize the weight vector */ w=0 /* Invert points in class 2 */ zi = −zi for all i where C(zi ) = 2 /* Find the training vector with minimal projection onto the weight vector w */ z∗ = arg minz∈Z (wz) /* Arrange the first D(D+1) elements of vector w 2 into a matrix A */ A ← use Eq. 5.14 /* Find its minimal eigenvalue λ∗ and corresponding eigenvector v∗ */ ∗ (λ , v∗ ) ← MinimalEigenvalueAndEigenvectorOf(A) if wz∗ > 0 and λ∗ > 0 then /* Both the minimal projection and the minimal eigenvalue are positive, the separating hyperplane is found, the algorithm finishes */ exit else if wz∗ < λ∗ then /* The minimal projection is lower than the minimal eigenvalue, adapt the weight vector using the data vector with the greatest error. */ w = w + (z∗ )T else /* The minimal eigenvalue is lower or equal to the minimal projection, adapt the weight vector using the pqmap-ed eigenvector for the sake of ellipticity. */ w = w + pqmap(v∗ )T goto step 4. end From Quadratic Function to Gaussian Distribution Quadratic function standardization The quadratic function xT Ax + Bx + c learned by the modified perceptron algorithm is not defined uniquely; all functions of the type k(xT Ax + Bx + c), k 6= 0, have the same decision boundary. One reasonable way of the standardization (assuming matrix A is positive definite) is to fix the function value at the minimum of the function. The minimum lies in the point µ = − 21 (BA−1 )T . The function value at the minimum is deliberately chosen to be −1, i.e. the following equation must hold: k(µT Aµ + Bµ + c) = −1 1 k=− T µ Aµ + Bµ + c (5.20) (5.21) The matrices defining the standardized quadratic function are then given as AS = kA, BS = kB, and CS = kC. 5.3 Estimation of Contour Lines of the Fitness Function 83 The multivariate normal distribution N (µ, Σ) which will be used to sample new points is then given by µ = − 12 (BA−1 )T and Σ = A−1 S . Sampling from Normal Distribution Sampling from the Gaussian distribution with the center µ and covariance matrix Σ is rather a standard task. The distribution, however, suffers from the curse of dimensionality in such a way that the proportion of generated vectors that lie inside the separating ellipsoid varies (drops with increasing dimensionality). Suppose we have a set of vectors generated from D-dimensional standardized normal distribution. Each of the coordinates has unidimensional standardized normal distribution and their sum of squares has a χ2 distribution with D degrees of freedom, χ2D . Thus, if we wanted to specify the percentage p, p ∈ (0, 1), of vectors lying inside the separating ellipsoid we can employ the inverse cumulative distribution function of the χ2 distribution, CDFχ−1 2 , in a way D that is described in step 3 of the sampling algorithm in Fig. 5.3. Algorithm 5.3: Sampling from normal distribution with specified proportion of points lying inside the separating ellipsoid 1 2 begin /* Eigendecompose the covarinace matrix so that Σ = RT Λ2 R, where R is a rotational matrix of eigenvectors and Λ2 is a diagonal matrix of the eigenvalues, i.e. Λ is a diagonal matrix of standard deviations in individual principal axes */ (Λ, R) ← Eig(Σ) /* Modify the standard deviations using the critical value of the χ2 distribution */ 3 Λ← r 4 5 6 7 Λ CDFχ−1 2 (p) . (5.22) D /* Generate the desired number of vectors xS from the standardized multivariate gaussian distribution N (0, I) */ xS ∼ N (0, I) /* Rescale them using the standard deviations Λ and rotate them using R */ xC = RΛxS /* Decenter the vectors xC using the center µ */ x = xC + µ end 5.3.2 Empirical Comparison To assess the very basic characteristics of the proposed algorithm, it is compared with the CMA-ES on two quadratic fitness functions, spherical and ellipsoidal (see Appendix A.1 and A.2, respectively). Setting the ratio of offspring generated inside the ellipsoid 84 5 P OPULATION S HIFT Evolutionary Model The evolutionary model used in the experiments is depicted as Alg. 5.4. CMAES uses its own standard evolutionary model. Algorithm 5.4: Evolutionary model for the proposed method 1 2 3 4 5 6 7 8 9 10 11 12 13 Testing the shift ability begin Initialize the population of size N and evaluate it while not TerminationCondition do Assign the discarded individuals to class 1, the selected individuals to class 2 Map the population to quadratic feature space Find a linear decision boundary given by vector w using the modified perceptron algorithm Rearrange components of w into matrices A, B, and c Turn the quadratic model into gaussian distribution Modify eigenvalues of the gaussian so that the ellipsoid contains the desired proportion of new individuals Sample N − 1 new individuals from learned gaussian Evaluate them Join the old and new population using elitism and throw away some individuals so that the population is of size N again end The population is initialized in area h−10, −5iD in order to test not only the ability to focus the search when it resides in the area of the optimum, but also to test the ability to efficiently shift the population toward the optimum. Elitism and sampling N − 1 new individuals together ensure, that the new population will contain at least 1 old individual and 1 new individual. This feature greatly prevents the stagnation in situations when no new individual is better than the worst old individual. The algorithm was stopped if the best fitness value in the population dropped under 10−8 . The results reported are taken from 20 independent runs for each algorithm and configuration. Where to Place the Gaussian Center? After finding the decision ellipsoid and turning it into a Gaussian distribution, we can decide where we want to place it. There are basically two possible decisions: 1. Place it in the center of the learned quadratic function. This placement is reminiscent of the process done in conventional EDAs (and is actually depicted in Fig. 5.7, right). Also, if the ellipsoid were fit precisely, the algorithm could jump to the area of the global optimum in a few iterations. 2. Place it around the best individual of the population. Such an approach is similar to mutative ES and the search is more local. Gaussian centered on the BSF solution Since the competitor of the proposed algorithm is the CMA-ES which uses the second option, the learned Gaussian distribution is centered around the best individual. This also makes the algorithm behave more like a local search and prevents the possibly ‘long jumps’ from the current center to the new estimated center of the Gaussian. 5.3 Estimation of Contour Lines of the Fitness Function Sphere Function 4 Ellipsoid Function 10 10 85 10 2 10 5 10 Average BSF Fitness Average BSF Fitness 0 10 −2 10 −4 10 −6 10 0 10 −5 10 −8 10 −10 10 0 −10 500 1000 1500 Number of Evaluations 2000 2500 10 0 1000 2000 3000 4000 Number of Evaluations 5000 6000 Figure 5.8: Comparison of average evolution traces for the proposed algorithm (solid line) and the CMA-ES (dashed line). The individual lines belong to 2, 4, 6, and 8 dimensional versions, respectively, from left to right. Population Sizes The CMA-ES uses a population sizing equation of the form N = 4 + ⌊3log(D)⌋. For the proposed method, no population sizing model exists yet. However, the evaluation concentrates on the potential hidden in the proposed method, and thus I decided to tune the population size for individual test problems and individual dimensionalities.2 The best settings found for the proposed algorithm along with the population size used by CMA-ES are presented in Table 5.3. Table 5.3: Best population sizes for both test problems 5.3.3 Dimension 2 4 6 8 CMA-ES Proposed method, Sphere Proposed method, Ellipsoidal 6 9 11 8 8 10 9 7 8 10 6 6 Results and Discussion The comparison of the algorithms can be seen in Fig. 5.8. For the sphere function the proposed approach is slightly better for dimensions 2 and 4, and slightly worse for dimensions 6 and 8. This could suggest that the algorithm will not scale up well and that the efficiency of the algorithm could drop with increasing dimensionality. However, the proposed algorithm used lower population sizes than the CMA-ES; it is possible that after careful study of the algorithm properties, it will be able to keep its superiority. For the ellipsoid function, our algorithm clearly outperforms the CMA-ES for all tested dimensions (2, 4, 6, 8). But again, the difference between the proposed method and CMA-ES gets lower with increasing dimensionality. Coming back to Table 5.3, we can observe a very interesting phenomenon. The ‘optimal’ population size drops with increasing dimensionality which is something really unexpected. For the time being, no sound explanation for that phenomenon is provided. I can only hypothesize that it is due to the learning algorithm, the modified perceptron. Fig. 5.9 shows an example of the ellipsoid 2 This is not a good practice for production systems but in this early stage of the research such a tuning is acceptable for discovering the potential of the proposed method. Strange behavior of optimal population size 86 Great expectations from SVM 5 P OPULATION S HIFT found by the perceptron. We can see that the ellipsoid is far from the ideal circular decision boundary. It may be thus profitable to use smaller populations which would impose less constraints on the ellipsoid that would be in turn less deformed. On the contrary, as can be seen in Fig. 5.9 (dashed line), the Support Vector Machine (SVM) finds maximum margin separating hyperplane which is a very useful feature. However, for the time being, a method for SVM training, that would ensure the ellipticity of the learned decision boundary, must still be invented. 5 4 3 2 1 0 −1 −2 −3 −4 −5 −6 −4 −2 0 2 4 6 Figure 5.9: The difference in decision boundaries found by the modified perceptron algorithm (solid line) and the support vector machine (dashed line). 5.3.4 Summary This chapter presented a solution to the last aim of this thesis—to find a method usable in situation when the population members are placed in an area that does not contain any local optimum. From many points of view, the best algorithm able to accomplish this goal is CMA-ES. In this thesis, a novel scheme for learning the Gaussian distribution in the context of EDAs was presented. The method is based on training a classifier with elliptic decision boundary; after the training, it should correctly classify the selected and the discarded individuals if the elliptic decision boundary exists. The experiments (carried out on low-dimensional test problems only, for the time being) suggest, there is a huge potential in this method—for the sphere and ellipsoidal fitness function the proposed method outperformed the stateof-the-art CMA-ES optimizer. The proposed method was developed independently, but it turned out that it could be described as an instance of the Learnable Evolution Model framework of Wojtusiak & Michalski (2006) which is also based on constructing classifiers of the selected and discarded population members. However, they use a different type of model—they use almost exclusively the AQ21 rules (see Wojtusiak (2004)). With regard to the model type (Gaussian distribution) and the learning algorithm (elliptic classifier), the proposed method is an original contribution to the EC field. 5.3 Estimation of Contour Lines of the Fitness Function However, as was already pointed out, the algorithm is very naive in its assumptions. As a result, at the present state it is able to optimize only convex quadratic functions. The development of more efficient learning algorithm is an open research area. The possibility to use the learned center of the quadratic function as the center of the Gaussian has not been investigated yet. This feature could further increase the efficiency of the algorithm. On the other hand, when using low population sizes, the estimate of the Gaussian center can become unstable and the search could be more erratic. There is also a promising possibility to look for the center of the search distribution on the line going through the BSF individual and the learned center of separating ellipsoid using a line search method. All the advantages and limitations are not clear yet. The obvious ones are listed below. Advantages of the proposed method Premature convergence resistant The proposed method estimates neither the distribution of selected points, nor the distribution of selected mutation steps; it estimates the contour lines of the fitness function. In principle, it is impossible to converge to a point which is not at least locally optimal. Rotational invariance The model is not sensitive to any rotation of the coordinate system. Limitations of CMA-ES Local search Similarly to CMA-ES, the proposed method belongs to the class of asymptotically complete methods, but it exhibits features of a local search in the neighborhood of one point. Locally linear dependencies only The multivariate normal distribution can describe only linear dependencies (the model have similar decomposition ability as UMDA with linear coordinate transforms described in Sec. 3.2). It is not able to describe valleys in the form of curves (see next limitation). It also cannot perform clustering (i.e. it cannot reproduce the distribution of population members if they form clusters in the search space). Need for separability by 1 ellipsoid The two classes of points (selected and discarded) must be separable by 1 ellipsoid—this is a requirement of the modified perceptron algorithm (if this assumption is not satisfied, the algorithm will not stop). The proposed method in its present state is not applicable to multimodal and noisy functions and to functions with nonlinear dependencies among variables. A learning algorithm that will relax this assumption and will allow for class overlapping must be developed. Non-optimality of the learned ellipsoid The ellipsoids learned by the modified perceptron algorithm are not optimal in many aspects (see Fig. 5.9). Such non-optimalities become highly disturbing with increasing dimensionality—the algorithm performance deteriorates. There are algorithms that fit better decision boundaries (e.g. support vector machines) but they must be altered to produce elliptic, not just quadratic, decision boundaries. 87 88 5 P OPULATION S HIFT Chapter 6 Conclusions and Future Work This thesis dealt with the applications of probabilistic models and coordinate transforms in EAs for real domains. The purpose of this chapter is to summarize the contributions of this thesis to the EC field and to suggest the directions for future research. 6.1 The Main Contributions The main ideas presented in this thesis are the following: Importance of the neighborhood structure learning After introducing the notions of optimization, Chapter 1 emphasized the fact that the neighborhood structure1 is the crucial part of any optimization algorithm—if the neighborhood structure is not suitable for the particular optimization problem, it can hardly be solved efficiently, quickly, and reliably. In populationbased optimizers, the neighborhood structure is given by the variational operators. The population is usually the only means for adaptation, the operators remain static. This in turn provides the motivation for the establishment of methods capable of adapting the neighborhood structure by means of adapting the variational operators, i.e. there is an open space for inductive and transductive machine learning methods. One class of these methods is constituted by EDAs which use probabilistic modeling of the population distribution (often complemented with coordinate transforms) to describe the particular neighborhoods. Population-based search algorithms are viewed as methods performing a search in local neighborhood of the population. This approach is not common in the EC field and provides a unifying view of all search algorithms. Essential features of any optimization algorithm Chapter 1 also recognized the differences between the discrete and real search spaces. For any realvalued optimization algorithm, it is crucial not only to address the design problems we face in discrete domains, i.e. (1) to learn the structure of the problem, and to be able (2) to focus the search to promising areas; the algorithm also has to be able (3) to shift the population in terms of a particular 1 As explained in the footnote 8 on page 13, the neighborhood is given by the variational operator; in this context, the neighborhood is not considered to be given only by some static metric or topology. 89 90 6 C ONCLUSIONS AND F UTURE W ORK distance in a particular direction (something that has no sense in discrete domains), and (4) to make the first three requirements cooperate. Many researchers tried to equip their algorithms with the above mentioned features, but none of them expressed all these requirements explicitly. Chapters 3, 4, and 5 were dedicated to each of the first three requirements. Introduction and assessment of a new marginal probability histogram model In Chap. 3.1, validation of present knowledge about the use of univariate probabilistic models inside marginal product model was presented and elaborated. Moreover, the max-diff histogram model was introduced and compared to equi-width and equi-height histograms and to the univariate mixture of gaussians. The max-diff histogram can be easily constructed and proved to be comparable to the winner of this comparison, to the equiheight histogram. The max-diff histogram has not yet been applied2 in the context of EDAs; along with the comparison with the other univariate marginal models, it seems to be an original contribution of the author. The use of linear coordinate transforms The flexibility of the marginal product probabilistic model can be improved by employing linear coordinate transforms (particularly PCA and ICA) which allow to describe greater number of the search space constellations. Even though these preprocessing transforms were applied globally for the whole population, they are able to improve the efficiency of the search algorithm using the marginal product model on multimodal test functions. The use of these transforms is not new in EAs, however, their comparison presented in Sec. 3.2 of this thesis broadens the knowledge of their effect on the EDA behavior. The study also revealed that the no-free-lunch theorem applies here as well; we do not know beforehand, which of these transforms gives better results, or if any transform should be applied at all. The distribution tree model A new model of the joint probability distribution was developed in Sec. 4.1. It has a tree structure and is based on the CART framework. It can be described as a kind of multidimensional histogram. The worst limitation of the model is its ability to create only axis-parallel splits of the search space. It is also possible to create univariate probability model in the form of distribution tree and use it in the UMDA framework, however, this possibility was not pursued in this thesis. The model is an original contribution of the author. Application of kernel PCA The use of linear PCA and the need for model of non-linear dependencies motivated the application of kernel PCA. Sec. 4.2 introduced the method and its embedding in EDA. It requires solving the pre-image problem which is itself an optimization problem. New individuals generated from the KPCA model reproduce the original distribution with a high fidelity, but the creation of individuals is very time consuming. For the sake of time demands, only the results for low dimensional test problems are presented. 2 To the best of author’s knowledge. 6.2 Future Work Nevertheless, these results suggest there is a huge potential in this method when applied in situations which need to focus the search only. The model is able to model curves and clusters of high quality solutions in the same time. On the other hand, it is highly prone to premature convergence if the training set (the selected individuals) does not surround the sought optimum. Enhancement of the CMA-ES algorithm Chapter 5 dealt with methods which are very efficient in situations when a population shift is needed. The need to estimate not the distribution of selected individuals, but rather the distribution of selected mutation steps is emphasized. The methods usually search in an adaptive neighborhood of one point, resembling the behavior of local optimizers although they use a population of individuals. The state-of-the-art CMA-ES algorithm was described and a method for its enhancement was presented (joint work of the author and Vojtěch Franc). The original CMA-ES adapts the covariance matrix and the step size on the basis of selected mutation steps. The proposed method estimates the covariance matrix on the basis of binary classifier that is trained to discriminate between the selected and discarded individuals. The contour lines of the normal search distribution should coincide with the contour lines of the fitness function as much as possible. Although the proposed method still has some naive assumptions which are hardly fulfilled for real-world problems, it is very promising—on selected test problems it outperformed the CMA-ES algorithm. 6.2 Future Work The thesis opened more questions than it answered. The algorithms presented in this thesis use rather sophisticated probability models, often in conjunction with coordinate transforms. All of them exhibit some negative features which constitute the base of the future work. Methods for algorithm combinations Chapters 3, 4, and 5 presented algorithms with strong and weak structural assumptions, and methods for efficient population shift, respectively, which were identified in Sec. 1.4.2 as the first three essential features for any optimization algorithm. The fourth feature allowing these methods to cooperate was not investigated in this thesis. It is a very broad topic deserving its own dissertation. Remedy: Research of techniques allowing to combine the presented algorithms so that they would cooperate and not harm each other. The first steps were already conducted in works about competing heuristics of Tvrdı́k et al. (2001). However, this approach tends to select the more profitable local optimizers in the beginning of the search which could prevent more global heuristics to find better solutions in further stages of evolution. Another possibility is to use parallel models (see e.g. Cantù-Paz (2000) or Pošı́k (2001)) of GAs. They can be easily extended so that each subpopulation uses its own search heuristic. The migration then constitutes the means for information exchange between individual search algorithms. Meta-learning This thesis showed that EDAs could be successfully applied to many problems and many ‘time instants’ during evolution. However, it is 91 92 6 C ONCLUSIONS AND F UTURE W ORK also obvious that particular models are most suitable to particular problems and particular ‘time instants’. Using one type of EDA to solve all kind of problems is far from optimal; even using one type of probability model during the whole course of evolution when solving one particular problem is not optimal. Remedy: Research and development of techniques that would allow us to analyze what kind of problem we are facing, and in what area of the search space the population is situated. This information would allow us to select such an optimization technique that is appropriate regarding the user requirements, the problem type, and the particular ‘time instant’. Population memory The majority of algorithms and methods presented in this thesis still use the population as the main tool for implementing the memory of the algorithm (with the CMA-ES being the exception). The probabilistic models that were used for creating new individuals are build from scratch in each generation. Thus the population size still strongly influences the learning ability of the algorithm. Remedy: Research of the techniques used in incremental learning to adapt the probabilistic models. The model should be able to memorize characteristics of past populations and adapt itself when new promising individuals are selected. This could also relax the need for using large populations. Reduced parallelization ability GAs are popular (among other reasons) because they allow us to parallelize virtually all operations. EDAs, on contrary, have lower parallelization ability which is caused by the probabilistic model construction phase. Remedy: Research of probabilistic models that can be constructed by merging independently learned parts. Such property would allow us to distribute even the model learning phase among several processors. Time and space demands of learning The learning procedures used to create the probabilistic models are often very time consuming and do not scale well with increasing dimensionality. They also often need a population with exponentially increasing size to encompass all the dependencies. Remedy: Selection and research of structurally simple models which can cover sufficiently rich class of distributions and are easily learnable. The marginal product model is scalable, but structurally to simple. Equipped with PCA (the complexity of which is O(D3 )), it encompasses larger class of distributions at the expense of higher computational costs, etc. These two opposite criteria can be possibly balanced by using some meta-learning approaches. The incremental learning discussed above can help with the demands on population sizes. Ignoring the solution feasibility None of the presented methods took into account the feasibility of generated solutions, i.e. they assumed to solve unconstrained optimization problems or optimization problems with box constraints, respectively. Remedy: There are many techniques dealing with infeasible solutions: we can pretend they are all feasible, we can use penalty functions, or we can use repairing mechanisms. However, the influence of these techniques on the model used in the particular EDA is not clear. Moreover, models 6.2 Future Work which would allow us to directly incorporate at least some of the constraints should be investigated. Loss of paternity EDAs exhibit a feature which could be called the loss of paternity, i.e. we do not know which of the old individuals is the parent of a new individual, or, more precisely, all the old individuals are parents of all the new individuals. This prevents us from using various replacement schemes and other operators that take advantage of such a parent-children relation. These operators are often used to preserve diversity in the population. Remedy: We have to use other means for diversity preservation. They are usually based on the notion of distance (the more close an old individual is, the more likely it is the parent of a new individual). Such substitutes, however, might be computationally expensive. LEM approach As suggested in Sec. 5.3, building generative probabilistic models on the basis of classifiers trained to discriminate between selected and discarded individuals is very promising direction of research. Kernel methods showed to be very flexible (as exemplified by KPCA Sec. 4.2), so that the expectations from SVM are great. However, the toughest problem is that not all descriptions (models) of selected individuals provide an easy way of instantiation, i.e. of creating new individuals belonging to the class the model describes. 93 94 6 C ONCLUSIONS AND F UTURE W ORK Appendix A Test Functions This section contains the description of various functions used throughout the thesis to test and compare individual algorithms. The test functions were carefully chosen not to mess up this thesis with lots of results, yet to reveal the advantages and disadvantages of individual optimization methods. A.1 Sphere Function The sphere function (a.k.a. quadratic function, or quadratic bowl) is the very basic optimization problem. It is given by equation f (x) = D X x2d . (A.1) d=1 The optimum: f (0) = 0. It is usually used with the search space unconstrained, or constrained by the hypercube x ∈ h−100, 100iD . The function is unimodal, order 1 separable, symmetric, rotationally invariant. If any algorithm is not able to solve this easy task, it has no chance to successfully solve more complex functions. A.2 Ellipsoidal Function The ellipsoidal function (a.k.a. elliptical function) is a bit tougher version of the sphere function. Individual axes are scaled, the task is given as f (x) = D X d−1 (106 ) D−1 x2d . (A.2) d=1 The optimum: f (0) = 0. It is usually used with the search space unconstrained, or constrained by the hypercube x ∈ h−100, 100iD . The function is unimodal, order 1 separable. The scaling of individual axes makes this function more difficult—individual object variables are forced to converge one after another. Algorithms prone to premature convergence have difficulties maintaining the diversity in dimensions with lower weight while optimizing the dimensions with higher weights. A.3 Two-Peaks Function The two-peaks function was originally introduced as maximization problem (hence its name). For the purposes of this thesis, it was inverted to minimization 95 96 A T EST F UNCTIONS task, it is given by equation D X f (x) = ftwo−peaks (xd ) , where (A.3) 5x (A.4) d=1 iff 10 − 5x iff ftwo−peaks (x) = 5 − 0.8(x − 2) iff 0.8(12 − x) iff x ∈ h0, 1) x ∈ h1, 2) x ∈ h2, 7) x ∈ h7, 12i The optimum is f (1) = 0, the function is usually used with box constraints x ∈ h0, 12iD . The function is order 1 separable, but multimodal. These two features together are the cause of the fact that the number of local optima grows as 2D with dimensionality. Moreover, as can be seen in Fig. A.1, the size of the basin of attraction of the global optimum in relation to the size of the whole search space decreases as (1/6)D . Without proper decomposition, this function is almost impossible to solve. Methods that try to decompose the function based on the distribution of good solutions have serious troubles solving this function. 12 5 4.5 10 4 3.5 8 3 6 2.5 2 4 1.5 1 2 0.5 0 0 2 4 6 8 10 12 0 0 2 4 6 8 10 12 Figure A.1: The basic 1D two-peaks function (left) and the contour lines of the 2D two-peaks function (right) A.4 Griewangk Function The Griewangk function is given by equation f (x) = 1 + D D Y xd 1 X cos √ x2d − 4000 d=1 d d=1 . (A.5) The optimum is f (0) = 0, the function is usually used with box constraints x ∈ h−5.12, 5.12iD . The function is multimodal and non-separable. In the search space, the local optima form a pattern resembling number 5 on a dice (see Fig. A.2, left). However, Whitley et al. (1996) mentioned that the influence of interactions gets smaller when Griewangk function is applied in more dimensions. This phenomenon is illustrated in Fig. A.2, right. The relative importance of the surounding local optima gets lower; already the diagonal cut of 10-dimensional Griewangk function is pretty smooth. A.5 Rosenbrock Function 97 2.5 3 D=1 D=2 D=5 D = 10 2 2 1.5 1 1 0 5 5 0 0.5 0 −5 0 −6 −5 −4 −2 0 2 4 6 Figure A.2: The 2D Griewangk function (left) and the fitness values along the search space body diagonal for dimensions 1, 2, 5, and 10 (right). A.5 Rosenbrock Function The basic 2D Rosenbrock function (a.k.a. Banana function) is given by equation f (x) = 100 · (x21 − x2 )2 + (1 − x1 )2 (A.6) The optimum is f (1) = 0. The function is unimodal, but non-separable with high degree of non-linear interaction between variables. It is very hard to optimize using ordinary evolutionary algorithms. The basic form of this function is twodimensional (see Fig. A.3). 2 3000 1.5 2000 1 1000 0.5 0 0 2 −0.5 1 −1 0 −1 −1.5 −2 −2 −1 0 1 2 −2 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 Figure A.3: 2D Rosenbrock function (left) and its contour lines (right). There are several ways of extending this function to spaces of more dimensions: 1. sequential summing up scores of all non-overlapping pairs of variables, PD/2 i.e. fmultidim = d=1 f (x2d−1 , x2d ) 2. sequential summing up scores of all consecutive overlapping pairs of variP ables, i.e. fmultidim = D−1 d=1 f (xd , xd+1 ) Suppose we want to evaluate a 10-dimensional object. The first way would result in 5 calls to the basic f function, while the second way would call the f function 9 times. In this thesis, the first possibility is used. 98 A T EST F UNCTIONS Appendix B Vestibulo-Ocular Reflex Analysis In this section, the optimization problem used to compare the Nelder-Mead simplex search and CMA-ES in Sec. 5.2 is introduced. B.1 Introduction Vestibulo-ocular reflex (VOR) is responsible for maintaining retinal image stabilization in the eyes during relatively brief periods of head movement (see Stockwell (1997)). By analyzing the VOR signal, physicians can recognize some pathologies of the vestibular organ which may result in e.g. failures of the balance of a patient. The recognition of the pathologies is usually done by examining the slow-phase eye velocity and several points of the frequency response (gain and phase shift) of the vestibular system. The principle of the frequency response measurement is relatively simple: the patient is situated in a chair which is then rotated in a defined way following a source signal—sine wave or a mixture of sine (MOS) waves. The chair with the patient is situated in the dark, the patient has his eyes open and performs some mental tasks which should distract him from mental visualization that could prevent the eye movements which are subsequently measured. This is called the head rotation test. Since the resulting eye signal is distorted by fast eye movements, so-called saccades, they must be removed from the signal. This is usually done by computing the angular velocity and the segments with the velocity higher than a predefined threshold are simply omitted from the signal. A method for discovering the right threshold was presented e.g. in Novák et al. (2002). The resulting signal consists of segments of slow phase movements which we are interested in. However, they are not aligned to form a smooth signal (see Fig. B.1). Simulated slow−phase segments of the VOR signal 0.1 0 −0.1 0 5 10 15 20 Figure B.1: Simulated VOR signal with saccades removed. This is the input of the algorithm. This VOR signal serves as a source for measuring the slow-phase velocity and the frequency response. The measurement of frequency response is usually done on the basis of interpolating these segments with some smooth curves 99 100 B V ESTIBULO -O CULAR R EFLEX A NALYSIS and performing a Fourier transform of the resulting continuous signal. The frequency response created this way contains, however, some artifacts that come from the artificial interpolation curves and are not generated by the vestibular system. In this chapter, a method for the direct estimation of the gain and phase lag of the individual sine components of the underlying MOS signal is introduced. It allows for the measurement of several points of the frequency response at the same time. After the estimation, the VOR signal segments should match1 with the corresponding parts of the estimated MOS signal, as shown in Fig. B.2. 2 1 0 −1 −2 0 5 10 15 20 Figure B.2: VOR signal segments aligned with the estimated MOS signal. The parameters of the MOS signal are output of the algorithm. B.2 Problem Specification It is assumed that the source signal (which controls the rotation of the chair with the patient) is formed as a mixture of sine waves: y S (t) = n X aSi sin(2πfi t + φSi ) , (B.1) i=1 where y(t) is the source signal and ai , fi and φi are the amplitude, the frequency and the phase shift of the individual sine components, respectively. The superscript S indicates the relation to the source signal. Note that the frequencies fi are not marked with this superscript. It is assumed that the output signal of the vestibular organ is of the same form as the input one, i.e. it contains only sine components with the same frequencies as the source signal but possibly with different amplitudes and phase shifts. It should be of the form y(t) = n X ai sin(2πfi t + φi ) . (B.2) i=1 If we knew the ai and φi parameters of the output MOS signal components, we could calculate the amplification (ai /aSi ) and phase lag (φi −φSi ) at individual frequencies and deduce the state of the vestibular organ. (Ideally, we should have amplification2 of 1 and minimal phase lag at all frequencies, which is not possible. However, physicians can analyze the deviations and diagnose the states that are not OK.) 1 In fact, the slow-phases of the measured VOR signal should go in the opposite direction than the source MOS signal. When the chair rotates to the right, the eyes should move to the left and vice versa. 2 Or, rather -1 with respect to the previous footnote. B.3 Minimizing the Loss Function 101 Unfortunately, we do not have access to the output MOS signal described by Eq. B.2. We have only the measured VOR signal, i.e. the segments of the output MOS signal that are left after filtering out the saccades from the eye-tracking signal (see Fig. B.1)3 . However, we can search for the unknown parameters ai and φi of the output MOS signal by solving the optimization task described in the following text. B.3 Minimizing the Loss Function Let m be the number of segments of the VOR signal at hand, vj (t), j = 1 . . . m, end be the initial and be the actual j-th segment of the VOR signal and tini j and tj the final time instants for the j-th signal segment. As stated above, we can find the parameters of the output MOS signal by searching the 2n-dimensional space of points x, x = (a1 , φ1 , . . . , an , φn ). Such a vector of parameters represents an estimate of the output MOS signal and we can compute the degree of fidelity with which the MOS corresponds to the VOR signal segments by constructing a loss function as follows: L(x) = m Z X tend j ini j=1 tj ((vj (t) − v̄j ) − (y(t) − ȳj ))2 dt , (B.3) where v̄j is the mean value of the j-th VOR signal segment and is computed as v̄j = 1 end tj − tini j tend j Z tini j vj (t)dt , (B.4) and ȳj is the mean value of the current estimate of the output MOS signal related to the j-th segment and is computed as 1 ȳj = end tj − tini j Z tend j tini j y(t)dt . (B.5) Subtracting the means v̄j and ȳj from the VOR signal segments vj (t) and MOS signal y(t), respectively, we try to match the VOR signal segment to the corresponding part of the MOS signal. If they match, their difference is zero, otherwise it is a positive number quadratically increasing with the difference. This operation is carried out for all m VOR signal segments. In practice we work with the discretized versions of the signals so that we usually substitute the integral with a sum. The equations are then4 tend L(x) = j m X X j=1 i=tini j ((vj (i) − v̄j ) − (y(i) − ȳj ))2 , (B.6) tend v̄j = j X 1 vj (i) , tend − tini ini j j (B.7) i=tj tend ȳj = j X 1 y(i) . tend − tini ini j j (B.8) i=tj 3 It is important to note that in this thesis only artificially generated (simulated) VOR signals were used. This allows for assessing the precision of the proposed method. 4 In the equations B.6, B.7 and B.8, the arrays vj (i) are supposed to be indexed with i ranging from tini to tend . j j 102 B V ESTIBULO -O CULAR R EFLEX A NALYSIS B.4 Nature of the Loss Function In Figures B.3, B.4 and B.5 it is shown what the landscape of the loss function L(a1 , φ1 , a2 , φ2 ) looks like if two of the parameters are kept fixed. For these figures, the optimal values of the parameters are set to x = (0.6, 1, 0.2, 0.2) (marked with a small cross in the figures). It seems that the loss function L(x) exhibits many features which are considered to be hard for any optimization algorithm, namely: Non-separability. It is not sufficient to optimize the parameters one after another. The profile of the loss function along one variable changes significantly with a change in another variable. See Figures B.3, B.4 and B.5—the cross describing the optimum is not situated in the optimum of the cut if the other parameters are not optimal as well. The function cannot be decomposed to a set of simpler optimization tasks. Long narrow valleys not aligned with coordinate axes. See Fig. B.3. Even gradient based algorithms have problems finding minimum of such a landscape. They have to perform many small steps along the bottom of the valley before they hit the optimum. Multimodality. See Fig. B.5. There are several local minima. In this case they are caused by the periodicity of the sine function with respect to the phase shift. However, based on the experience when optimizing this function I hypothesize that the influence of local optima can be negligible—the optimizers exhibit the behavior similar to behavior on unimodal functions, but with very narrow and perhaps tortuous valleys leading to the global optimum. B.4 Nature of the Loss Function 103 A−A Cut: Other parameters optimal A−A Cut: Other parameters NOT optimal 1 1 A2 1.5 A2 1.5 0.5 0.5 0 0 0.5 1 0 0 1.5 0.5 1 1.5 A1 A1 Figure B.3: Cuts through the landscape of the loss function L(a1 , φ1 , a2 , φ2 ) with φ1 and φ2 fixed at their optimal values (left) and at values different from the optimal ones (right). A−P Cut: Other parameters NOT optimal 3 2 2 1 1 P1 P1 A−P Cut: Other parameters optimal 3 0 0 −1 −1 −2 −2 −3 0 0.5 1 −3 0 1.5 0.5 A1 1 1.5 A1 Figure B.4: Cuts through the landscape of the loss function L(a1 , φ1 , a2 , φ2 ) with a2 and φ2 fixed at their optimal values (left) and at values different from the optimal ones (right). P−P Cut: Other parameters optimal 3 2 2 1 1 P2 P2 P−P Cut: Other parameters optimal 3 0 0 −1 −1 −2 −2 −3 −3 −3 −2 −1 0 P1 1 2 3 −3 −2 −1 0 P1 1 2 3 Figure B.5: Cuts through the landscape of the loss function L(a1 , φ1 , a2 , φ2 ) with a1 and a2 fixed at their optimal values (left) and at values different from the optimal ones (right). 104 B V ESTIBULO -O CULAR R EFLEX A NALYSIS Bibliography Amari, S., Cichocki, A. & Yang, H. H. (1996), A new learning algorithm for blind source separation, in ‘Advances In Neural Information Processing 8’, MIT Press, Cambridge, MA, pp. 757–763. Auger, A., Schoenauer, M. & Vanhaecke, N. (2004), LS-CMA-ES: A secondorder algorithm for covariance matrix adaptation, in X. Y. et al., ed., ‘Parallel Problem Solving from Nature VIII’, number 3242 in ‘LNCS’, Springer Verlag, pp. 182–191. Bäck, T. (1992), Self-adaptation in genetic algorithms, in F. J. Varela & P. Bourgine, eds, ‘Towards a Practice of Autonomous Systems: 1st European Conference on Artificial Life’, MIT Press, pp. 263–271. Baluja, S. (1994), Population based incremental learning: A method for integrating genetic search based function optimization and competitive learning, Technical Report CMU-CS-94-163, Carnegie Mellon University, Pittsburgh, PA. Baluja, S. & Davies, S. (1997), Using optimal dependency-trees for combinatorial optimization: Learning the structure of the search space, in D. Fisher, ed., ‘14th International Conference on Machine Learning’, Morgan Kaufmann, pp. 30– 38. Bosman, P. A. & Grahl, J. (2006), ‘Matching inductive bias and problem structure in continuous estimation-of-distribution algorithms’, European Journal of Operational Research . Bosman, P. A. & Thierens, D. (1999), An algorithmic framework for density estimation based evolutionary algorithms, Technical Report UU-CS-1999-46, Utrecht University. Bosman, P. A. & Thierens, D. (2000), Continuous iterated density estimation evolutionary algorithms within the IDEA framework, in ‘Workshop Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2000)’, pp. 197–200. Breiman, L., Friedman, J., Stone, C. J. & Olshen, R. A. (1984), Classification and Regression Trees, Kluwer Academic Publishers. Cantù-Paz, E. (2000), Efficient and Accurate Parallel Genetic Algorithms, Vol. 1 of Genetic Algorithms and Evolutionary Computation, 1 edn, Kluwer Academic Publishers. ISBN 0792372212. Comon, P. (1994), ‘Independent component analysis - a new concept?’, Signal Processing 36, 287–314. 105 106 BIBLIOGRAPHY Davis, L. (1991), Handbook of Genetic Algorithms, Van Nostrand Reihold. de Bonet, J. S., Isbell, C. L. & Viola, P. (1997), ‘MIMIC: Finding optima by estimating probability densities’, Advances in Neural Information Processing Systems 9, 424–431. Eiben, A. E., Hinterding, R. & Michalewicz, Z. (1999), ‘Parameter control in evolutionary algorithms’, IEEE Trans. on Evolutionary Computation 3(2), 124–141. Etxeberria, R. & Larrañaga, P. (1999), Global optimization using bayesian networks, in A. Rodriguez, M. Ortiz & R. Hermida, eds, ‘CIMAF 99, Second Symposium on Artificial Intelligence’, Adaptive Systems, La Habana, pp. 332–339. Fogarty, T. C. (1989), Varying the probability of mutation in genetic algorithms, in J. D. Schaffer, ed., ‘Proc. of the 3rd International Conference on Genetic Algorithms’, Morgan Kaufmann, pp. 104–109. Fogel, L. J., Angeline, P. J. & Fogel, D. B. (1995), An evolutionary programming approach to self-adaptation on finite state machines, in J. McDonnel, R. Reynolds & D. Fogel, eds, ‘Proc. of the 4th Annual Conference on Evolutionary Programming’, MIT Press, pp. 355–365. Friedman, J. H. (1987), ‘Exploratory projection pursuit’, Journal of the American Statistical Association 82(397), 249–266. Gallagher, M. R., Frean, M. & Downs, T. (1999), Real-valued evolutionary optimization using a flexible probability density estimator, in ‘Genetic and Evolutionary Computation Conference (GECCO-1999)’, Morgan Kaufmann, pp. 840–864. Goldberg, D. E. (1989), Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley. Goldberg, D. E., Deb, K. & Clark, J. H. (1992), ‘Genetic algorithms, noise and the sizing of populations’, Complex Systems 6, 333–362. Goldberg, D. E. & Smith, R. E. (1987), Nonstationary function optimization using genetic algorithms with dominance and diploidy, in J. J. Grefenstette, ed., ‘Proc. of the 2nd International Conference on Genetic Algorithms and Their Applications’, Lawrence Erlbaum Associates, pp. 59–68. Grefenstette, J. J. (1986), ‘Optimization of control parameters for genetic algorithms’, IEEE Trans. on System, Man, and Cybernetics 16(1), 122–128. Hansen, N. (2005), ‘References to CMA-ES applications’. URL: http://www.bionik.tu-berlin.de/user/niko/cmaapplications.pdf Hansen, N. (2006), The CMA evolutionary strategy: A comparing review, in J. A. Lozano, P. L. naga, I. naki Inza & E. Bengoetxea, eds, ‘Towards a New Evolutionary Computation: Advances on Estimation of Distribution Algorithms’, number 192 in ‘Studies in Fuzziness and Soft Computing’, Springer, pp. 75– 102. Hansen, N. & Ostermeier, A. (2001), ‘Completely derandomized self-adaptation in evolution strategies’, Evolutionary Computation 9(2), 159–195. BIBLIOGRAPHY Harik, G. (1999), Linkage learning via probabilistic modeling in the ECGA, Technical Report IlliGAL Report No. 99010, University of Illinois, UrbanaChampaign. Harik, G., Cantù-Paz, E., Goldberg, D. E. & Miller, B. L. (1997), The gambler’s ruin problem, genetic algorithms, and the sizing of populations, in ‘Proc. of the 4th IEEE Conference on Evolutionary Computation’, pp. 7–12. Harik, G., Lobo, F. & Goldberg, D. E. (1997), The compact genetic algorithm, Technical Report IlliGAL Report No. 97006, University of Illinois, UrbanaChampaign. Hastie, T., Tibshirani, R. & Friedman, J. (2001), The Elements of Statistical Learning, Springer series in statistics, Springer Verlag. Hotelling, H. (1933), ‘Analysis of a complex of statistical variables into principal components’, Journal of Educational Psychology 24, 417–441. Hyvärinen, A. (1999), ‘Survey on independent component analysis’, Neural Computing Surveys 2, 94–128. Hyvärinen, A. & Oja, E. (2000), ‘Independent component analysis: A tutorial’, Neural Networks 13(4–5), 411–430. Janikow, C. Z. & Michalewicz, Z. (1991), An experimental comparison of binary and floating point representations in genetic algorithms, in R. Belew & L. Booker, eds, ‘Proc. of the 4th International Conference on Genetic Algorithms’, Morgan Kaufmann, pp. 151–157. Joines, J. A. & Houck, C. R. (1994), On the use of non-stationary penalty functions to solve nonlinear constrained optimization problems with GAs, in ‘Proc. of the 1st IEEE Conference on Evolutionary Computation’, IEEE Press, pp. 579–584. Julstrom, B. A. (1995), What have you done for me lately? adapting operator probabilities in a steady-state genetic algorithm, in L. Eshelman, ed., ‘Proc. of the 6th International Conference on Genetic Algorithms’, Morgan Kaufmann, pp. 81–87. Larrañaga, P., Etxeberria, R., Lozano, J. A., Sierra, B., naki Inza, I. & Peña, J. M. (1999), A review of the cooperation between evolutionary computation and probabilistic graphical models, in A. Rodriguez, M. Ortiz & R. Hermida, eds, ‘CIMAF 99, Second Symposium on Artificial Intelligence’, Adaptive Systems, La Habana, pp. 314–324. Larrañaga, P. & Lozano, J. A., eds (2002), Estimation of Distribution Algorithms, GENA, Kluwer Academic Publishers. Lis, J. & Lis, M. (1996), Self-adapting parallel genetic algorithm with dynamic mutation probability, crossover rate and population size, in J. Arabas, ed., ‘Proceedings of the 1st Polish National Conference on Evolutionary Computation’, Oficina Wydawnica Politechniki Warszawskiej, pp. 324–329. Michalewicz, Z. & Fogel, D. B. (1999), How To Solve It: Modern Heuristics, Springer Verlag. ISBN 3540660615. 107 108 BIBLIOGRAPHY Moerland, P. (2000), Mixture Models for Unsupervised and Supervised Learning, PhD thesis, Computer Science Department, Swiss Federal Institute of Technology, Lausanne. Mühlenbein, H., Mahnig, T. & Rodriguez, A. (1999), ‘Schemata, distributions, and graphical models in evolutionary optimization’, Journal of Heuristics 5, 215–247. Mühlenbein, H. & Paass, G. (1996), From recombination of genes to the estimation of distributions i. binary parameters, in ‘Parallel Problem Solving from Nature’, pp. 178–187. Nelder, J. & Mead, R. (1965), ‘A simplex method for function minimization’, The Computer Journal 7(4), 308–313. Neumaier, A. (2004), ‘Complete search in continuous global optimization and constraint satisfaction’, Acta Numerica 13, 271–369. Novák, D., Cuesta-Frau, D., Brzezný, R., Černý, R., Lhotská, L. & Eck, V. (2002), Method for clinical analysis of eye movements induced by rotational test, in ‘Proc. of EMBEC’. Očenášek, J. & Schwarz, J. (2002), Estimation of distribution algorithm for mixed continuous-discrete optimization problems, in ‘2nd Euro-International Symposium on Computational Intelligence’, IOS Press, Košice, Slovakia, pp. 227– 232. ISBN 1-58603-256-9, ISSN 0922-6389. Paredis, J. (1995), ‘Coevolutionary computation’, Artificial Life 2(4), 355–375. Pelikan, M., Goldberg, D. E. & Cantú-Paz, E. (1998), Linkage problem, distribution estimation, and bayesian networks, Technical Report IlliGAL Report No. 98013, University of Illinois, Urbana-Champaign. Pelikan, M. & Mühlenbein, H. (1999), The bivariate marginal distribution algorithm, in ‘Advances in Soft Computing – Engineering Design and Manufacturing’, pp. 521–535. Poosala, V., Ioannidis, Y., Haas, P. & Shekita, E. (1996), Improved histograms for selectivity estimation of range predicates, in ‘1996 ACM SIGMOD Intl. Conf. Managment of Data’, ACM, ACM Press, pp. 294–305. Pošı́k, P. (2001), Parallel genetic algorithms (in czech), Master’s thesis, Czech Technical University, Prague. Pošı́k, P. (2003), Comparing various marginal probability models in evolutionary algorithms, in P. Ošmera, ed., ‘MENDEL 2003’, Vol. 1, Brno University, Brno, pp. 59–64. ISBN 80-214-2411-7. Pošı́k, P. (2004), Using kernel principal components analysis in evolutionary algorithms as an efficient multi-parent crossover operator, in ‘IEEE 4th International Conference on Intelligent Systems Design and Applications’, IEEE, Piscataway, pp. 25–30. ISBN 963-7154-29-9. Pošı́k, P. (2005a), On the utility of linear transformations for population-based optimization algorithms, in ‘Preprints of the 16th World Congress of the International Federation of Automatic Control’, IFAC, Prague. CD-ROM. BIBLIOGRAPHY Pošı́k, P. (2005b), Real-parameter optimization using the mutation step coevolution, in ‘IEEE Congress on Evolutionary Computation’, IEEE, pp. 872– 879. ISBN 0-7803-9364-3. Pošı́k, P. & Franc, V. (2006), Using elliptic decision boundary as a probabilistic model in EAs. Personal communication. Rudlof, S. & Köppen, M. (1996), Stochastic hill climbing by vectors of normal distributions, in ‘First Online Workshop on Soft Computing’, Nagoya, Japan. Rudolph, G. (2001), ‘Self-adaptive mutations may lead to premature convergence’, IEEE Trans. on Evolutionary Computation 5(4), 410–413. Schaffer, J. & Morishima, A. (1987), An adaptive crossover distribution mechanism for genetic algorithms, in J. Grefenstette, ed., ‘Proc. of the 2nd International Conference on Genetic Algorithms and Their Applications’, Lawrence Erlbaum Associates, pp. 36–40. Schlesinger, M. I. & Hlaváč, V. (2002), Ten Lectures on Statistical and Structural Pattern Recognition, Kluwer Academic Publishers, Dodrecht, The Netherlands. Schölkopf, B. & Smola, A. J. (2002), Learning with Kernels, MIT Press, Cambridge, Massachusetts. Schölkopf, B., Smola, A. J. & Müller, K.-R. (1996), Nonlinear component analysis as a kernel eigenvalue problem, Technical report, Max-Planck-Institute für biologische Kybernetik. Schwefel, H.-P. (1995), Evolution and Optimum Seeking, Wiley, New York. Sebag, M. & Ducoulombier, A. (1998), Extending population based incremental learning to continuous search spaces, in ‘Parallel Problem Solving from Nature V’, Springer Verlag, Berlin, pp. 418–427. Servet, I., Trave-Massuyes, L. & Stern, D. (1997), Telephone network traffic overloading diagnosis and evolutionary techniques, in ‘Third European Conference on Artificial Evolution’, pp. 137–144. Shaefer, C. (1987), The argot strategy: Adaptive representation genetic optimizer technique, in J. Grefenstette, ed., ‘Proc. of the 2nd International Conference on Genetic Algorithms and Their Applications’, Lawrence Erlbaum Associates, pp. 50–55. Spears, W. (1995), Adapting crossover in evolutionary algorithms, in J. McDonnel, R. Reynolds & D. B. Fogel, eds, ‘Proc. of the 4th Annual Conference on Evolutionary Programming’, MIT Press, pp. 367–384. Stockwell, C. (1997), ‘Vestibular testing: past, present, future’, British Journal of Audiology 31(6), 387–398. Thierens, D. & Goldberg, D. E. (1991), Mixing in genetic algorithms, in R. Belew & L. Booker, eds, ‘Proc. of the 4th International Conference on Genetic Algorithms’, Morgan Kaufmann, pp. 31–37. Tipping, M. E. & Bishop, C. M. (1999), ‘Probabilistic principal component analysis’, Journal of the Royal Statistical Society 61, 611–622. 109 110 BIBLIOGRAPHY Tsutsui, S., Goldberg, D. E. & Sastry, K. (2001), Linkage learning in real-valued GAs with simplex crossover, in ‘Proceedings of EA’01’, Le Creusot, France. Tsutsui, S., Pelikan, M. & Goldberg, D. E. (2001), Evolutionary algorithm using marginal histogram models in continuous domain, Technical Report IlliGAL Report No. 2001019, University of Illinois, Urbana-Champaign. Tvrdı́k, J., Křivý, I. & Mišı́k, L. (2001), Evolutionary algorithm with competing heuristics, in P. Ošmera, ed., ‘Proceedings of MENDEL 2001, 7th Int. Conference on Soft Computing’, Vol. 1, Brno University, pp. 58–64. Whitley, D., Mathias, K., Rana, S. & Dzubera, J. (1996), ‘Evaluating evolutionary algorithms’, Artificial Intelligence 85, 245–276. Wojtusiak, J. (2004), AQ21 user’s guide, Reports of the Machine Learning and Inference Laboratory MLI 04-5, George Mason University. http://www.mli.gmu.edu/papers/2003-2004/04-5.pdf. Wojtusiak, J. & Michalski, R. S. (2006), The LEM3 system for non-darwinian evolutionary computation and its application to complex function optimization, Reports of the Machine Learning and Inference Laboratory MLI 04-1, George Mason University, Fairfax, VA. URL: http://www.mli.gmu.edu/projects/lem.html Wolpert, D. H. & Macready, W. G. (1997), ‘No free lunch theorems for optimization’, IEEE Trans. on Evolutionary Computation 1(1), 67–82. Zhang, Q., Allison, N. M. & Jin, H. (2000), ‘Population optimization algorithm based on ICA’. URL: http://citeseer.ist.psu.edu/388205.html
© Copyright 2025 Paperzz