The Impact of Changing the Way the Fitness Function Is Implemented in an Evolutionary Algorithm for the Design of Shapes Andrés Gómez de Silva Garza Computer Engineering Department, Instituto Tecnológico Autónomo de México (ITAM),Río Hondo #1, Col. Progreso-Tizapán, 01080—México, D.F., México [email protected] Abstract. Evolutionary algorithms (EA's) have been used in many ways for design and other creative tasks. One of the main elements of these algorithms is the fitness function used by the algorithm to evaluate the quality of the potential solutions it proposes. The fitness function guides, constrains, and biases the algorithm's search for an acceptable solution. In this paper we explore the degree to which the fitness function and its implementation affects the search process in an evolutionary algorithm. To do this, the reliability and speed of the algorithm, as well as the quality of the designs produced by it, is measured for different fitness function implementations. Keywords: Evolutionary algorithms, fitness function, evolutionary design 1 Introduction Evolutionary algorithms (EA’s) are general-purpose search methods that operate on the individuals in a population of potential solutions to a problem [4]. An EA will evolve these potential solutions through a series of generations until some convergence criterion is reached. They embody such a general-purpose search method that it is no surprise that EA's have often been applied to the support of design and other creative tasks. Two compendia of example systems and applications can be found in [1] and [2]. In many of the examples included in these compendia, the fitness function is not automated. Instead, in some of the example approaches a user is asked to determine the fitness function by tweaking the values of different parameters provided through system interface controls. In some of the other examples, votes are gathered from a large set of users to rank the solutions proposed by the systems and therefore eliminate the need for a fitness function to be explicitly programmed. Instead, in our work we are concerned with the traditional EA method of having a completely preprogrammed, autonomous, fitness function integrated into the EA's functioning. Andrzej M.J. Skulimowski (ed.): Proceedings of KICSS'2013, pp. 104-113 © Progress & Business Publishers, Kraków 2013 The fitness function analyzes the characteristics of the solutions generated by the EA according to how well they match the desired characteristics. These desired characteristics are usually expressed in general terms, and the fitness of a particular solution is determined by measuring the solution in comparison to the general pattern. Like any computational algorithm, the fitness evaluation function can be implemented in multiple ways. All of them are equivalent in the sense of describing the same general type of algorithmic solution. However, the specific way in which the fitness function is expressed or implemented could affect the performance of the EA. This paper describes a set of experiments designed to measure this phenomenon. It is assumed that the reader has a passing knowledge of how a generic EA operates. If this is not the case, the reader is pointed to [4] as a possible source of background knowledge on EA’s. The EA used in the experiments described in this paper is called ShEvolver (a contraction of “shape evolver”). Here we describe ShEvolver briefly to set the context for the description of the experiments we report in this paper. A more complete description of ShEvolver can be found in [3], from which a few sections of this paper were taken or adapted (with the permission of the Association for Computing Machinery, ACM, the copyright holder). The domain of ShEvolver is the evolution of shapes that consist of configurations of colored unit squares. This kind of domain and algorithm can be applied, for example, to the design of bathroom/kitchen tile patterns. The genetic alphabet used by ShEvolver is such that each gene represents the execution of a "move-and-place" action, moving from an initial location, by a distance of one unit square, to a destination location in any one of eight possible directions, and placing a colored unit square there (in one of four possible colors). A genotype is a sequence of such moves that, when followed in succession, end up producing a configuration of colored unit squares on the "canvas" (initially empty) that the system is working on. Cannot be created: 1: Can be created: 4: 5: 2: 6: 3: 7: Fig. 1. Some shapes that cannot, and some that can, be produced using ShEvolver Because there are only eight possible directions of movement available to ShEvolver, shapes in which the sides of adjacent unit squares only partially overlap, - 105 - shapes that do not have unit squares that fit into a regular grid, or shapes that have diagonal edges/borders, cannot be described in ShEvolver. Fig. 1 illustrates this, as well as giving examples of shapes that can be produced by ShEvolver. As can be seen in shape 6 in the figure, the shapes that can be produced by ShEvolver may include holes (which could be taken to represent a fifth color, black). 2 Fitness Evaluation There are 36 fitness evaluation functions (criteria) that have been implemented in ShEvolver, labeled c1-c36. Some of them focus on measuring geometric characteristics of shapes, some of them focus on color-related features, and others combine aspects of both. An example of the first type of function is c4, which measures the bumpiness of a shape (where bumpiness is taken to mean that the outer edges of a shape are jagged, caused by diagonal adjacencies between the unit squares forming the edges, rather than smooth, caused by horizontal or vertical adjacencies between the unit squares forming the edges). An example of the second type of function is c12, which measures the greenness of shapes (the percentage of unit squares in a shape that have a green color). An example of the third type of function is c17, which measures the degree to which x-shaped sub-shapes of a given color occur within a shape. As an example, Fig. 2 shows the algorithm, written in Java, used to measure the bumpiness of a shape (evaluation function c4) and shows how it applies to three small shapes, a, b, and c. In all three cases, a fitness value of 1 is achieved only if the shape has the maximum possible bumpiness given its size (the number of unit squares it is composed of). Otherwise a fitness value between 0 and 1 is assigned to the shape which represents its degree of bumpiness. The degree of bumpiness is calculated by counting the number of angles along the edges of a shape and dividing by the perimeter of the shape. // Calculates the degree of bumpiness // of a shape s: public double c4(Shape s) { return (double) numberOfAngles(s)/ (double) perimeter(s); } a) Bumpiness: 4/4=1 b) Bumpiness: 4/6=1/3=0.67 c) Bumpiness: 8/8=1 Fig. 2. The algorithm for calculating the bumpiness of a shape (evaluation criterion c4) and three examples - 106 - Each of the evaluation functions in ShEvolver produces a value between 0 (meaning the absence of whatever is being measured) and 1 (meaning the maximum possible amount of whatever is being measured). ShEvolver is designed so that multiple fitness evaluation functions can be applied to the shapes produced by the system before assigning a global fitness value. When the user decides to use multiple evaluation functions, each one of them is given the same weight/importance, i.e., the global fitness function consists of a linear combination of individual evaluation functions. For example, let us assume that in a given scenario the fitness function consists of a linear combination of two evaluation criteria, c1 and c2. Let us also assume that each of these evaluation criteria was used to evaluate a shape s. Thus, if c1(s) is the fitness value of shape s when evaluating it by using c1, and c2(s) is its fitness value when evaluating it using c2, then the global, normalized, fitness value F for s would be calculated as: F(s)=(c1(s)+c2(s))/2. If three evaluation criteria (c1, c2, and c3) had been used to evaluate s, then the global, normalized, fitness value F for s would be calculated as: F(s)=(c1(s)+c2(s)+c3(s))/3. And so on. The result is always a value of F that falls between 0 and 1, inclusive, calculated in such a way that each of the components (individual evaluation criteria) that make up F is given the same importance as the rest. ShEvolver is set up so that its EA tries to maximize the global fitness value of the shapes it generates (though it could easily be adapted so that its EA tried to minimize said value). If a shape ends up having a global fitness value of 0.75, this can be interpreted as meaning that its quality (according to whatever evaluation criteria were being used to measure said quality) is of 75%. Given this basic framework, two versions of ShEvolver were produced, A and B. In ShEvolver A, each of the individual fitness evaluation functions was enhanced by incorporating into it a verification procedure that checks that the shape meets some minimum and maximum width and height requirements. The enhanced version of the algorithm shown in Fig. 2 would thus be the one shown in Fig. 3. The value returned by the algorithm will be identical to the value the original version would return, unless the constraints on the dimensions of the shape are not met, in which case the algorithm will return a fitness value of 0. // Calculates the degree of bumpiness // of a shape s: public double c4(Shape s) { int w=getWidth(s), h=getHeight(s); double result; result=(double) numberOfAngles(s)/ (double) perimeter(s); if(w<MIN_WIDTH||w>MAX_WIDTH|| h<MIN_HEIGHT||h>MAX_HEIGHT) result=0; return result; } Fig. 3. The modified version in ShEvolver A of the algorithm from Fig. 2 - 107 - In ShEvolver B, the fitness evaluation functions were left without modification, but two new functions, c37 and c38, were written and added into the mix. In c37 a 1 or 0 is assigned to a shape depending on whether its dimensions fall within a minimum and a maximum acceptable width or not, respectively. No values between 0 and 1 are allowed. In c38, an analogous procedure is followed, but this time analyzing the height instead of the width. When running ShEvolver B, three evaluation functions were used and then combined linearly as described above, to assign a fitness value to each shape produced by the EA: one of c1-c36, plus c37 and c38. From the above descriptions it is apparent that ShEvolver A and ShEvolver B are not very different from each other. In ShEvolver A it can be considered that satisfying the constraints on the dimensions of a shape contributes 50% of the global fitness value of that shape (the other 50% being due to whichever original evaluation criterion is being used at a given moment). In ShEvolver B it can be considered that satisfying the constraints on the dimensions of a shape contributes 66.7% of the global fitness value of that shape (33.3% for width and 33.3% for height), with the final 33.3% being due to whichever original evaluation criterion is being used at a given moment. 3 Experimental Design and Results In this section we describe an experiment that was designed to compare the individual evaluation functions in ShEvolver. The experiment was repeated for each of two versions, one using ShEvolver A and one using ShEvolver B. Each version was run through two stages. Stage 1 consisted of several runs. In each run the initial population of the EA was generated at random. In each run the convergence criterion used was a combination of two factors: the EA would stop as soon as it produced the first individual with a fitness value of 0.95 (convergence due to achieving a minimum acceptable quality) or as soon as it had reached 5000 evolutionary cycles (convergence due to reaching a maximum acceptable time limit), whichever came first. The limit of 5000 evolutionary generations was designed to halt the EA when no individual with a fitness of at least 95% was found within a reasonable amount of time, instead of permitting the EA to continue potentially forever, with the possibility of never converging. All other EA parameters such as population size, crossover and mutation rates, etc., also remained fixed across all runs. The minimum and maximum requirements were fixed across all runs and were set at 5 for both width and height (that is, the EA was being forced into producing shapes measuring exactly 5x5 unit squares, in addition to satisfying whatever other evaluation criterion was active—see below). In Stage 1 of the experiment we performed 50 runs of ShEvolver using each of the 36 individual evaluation functions one at a time for a total of 50x36=1800 runs. The variables that were measured for each run were the number of generations (G) that the EA went through before convergence and the average fitness (AF) of the individuals in the population at the time of convergence. The best design produced at the end of each run was also obtained. - 108 - The reason for performing 50 runs for each experimental condition was to be able to filter out any unusual results that may have been due to the random nature of many of the decision points in an EA, such as the initial makeup of its population or the selection of crossover points. Instead, by obtaining the mean behavior over 50 runs, general trends in the overall set of results can be observed irrespective of any quirks that may have been present in any of the individual runs. For each set of 50 runs corresponding to each evaluation function, the general trends that were focused on were the mean values for G and AF, as well as the reliability of the EA. The reliability is interpreted as being the percentage of times that convergence occurred because a shape of sufficient quality had been produced (quality convergence), as opposed to reaching a maximum allowable number of EA generations (time convergence). Given that the experiment consists of comparing different fitness functions, obtaining different values for the reliability of an EA for the different sets of 50 runs can also be used to measure the relative strictness of the fitness functions. Fitness functions that are easier to satisfy, for whatever reason, will result in higher reliability values than fitness functions for which it is more difficult to produce high-quality shapes. The experimental results given in this section show the results for evaluation functions c4, c12, and c17, although the experiments performed took into account all evaluation functions, as mentioned above. Tables 1 (reliability), 2 (mean EA generation at which convergence occurred), and 3 (mean fitness of the individuals in the population when convergence occurred) show the results of performing Stage 1 of the experiment both for ShEvolver A and for ShEvolver B. Focusing on the middle column of any table allows one to gain insight into how the different evaluation functions used compare with each other in ShEvolver A. Focusing on the right-most column permits a similar comparison in ShEvolver B. Focusing on any one of the three lower rows of any table allows one to gain insight into how the influence of a given evaluation function differs between ShEvolver A and ShEvolver B. Table 1. Comparison of EA reliability when using evaluation functions c4, c12, and c17 in ShEvolver A and ShEvolver B. Reliability ShEvolver A: ShEvolver B: c4: 48/50=0.96 50/50=1 c12: 50/50=1 50/50=1 c17: 0 0 Table 2. Comparison of mean EA generation in which convergence occurred when using evaluation functions c4, c12, and c17 in ShEvolver A and ShEvolver B. Mean convergence ShEvolver A: ShEvolver B: c4: 272.42 7.9 c12: 186.96 53.08 c17: 5000 5000 - 109 - Table 3. Comparison of mean fitness of the individuals in the EA population at the time convergence occurred when using evaluation functions c4, c12, and c17 in ShEvolver A and ShEvolver B. Mean fitness at convergence ShEvolver A: ShEvolver B: c4: 0.896 0.638 c12: 0.900 0.922 c17: 0.002 0.681 An analysis of the results shown in Tables 1-3 leads to the following observations: Table 1: The intuition that the constraints imposed by some evaluation functions are easier to satisfy than others is demonstrated by the fact that c12 had a reliability of 1 in both ShEvolver A and ShEvolver B (very easy to satisfy), c17 had a reliability of 0 in both ShEvolver A and ShEvolver B (very difficult to satisfy), and c4 had a reliability of 0.96 in ShEvolver A but 1 in ShEvolver B (somewhat easy to satisfy). Table 2: The relative results in the columns for ShEvolver A compared to ShEvolver B seem to indicate that, when convergence is in general due to quality (i.e., under high-reliability conditions), ShEvolver B converges much quicker than ShEvolver A (272.42/7.9≈34 times quicker for c4 and 186.96/53.08≈3.5 times quicker for c12), whereas under low reliability conditions (c17) there does not seem to be any difference between ShEvolver A and ShEvolver B as far as an impact on when convergence occurs. Analyzing the column for ShEvolver A seems to confirm the conclusion reached in the previous bullet point that c12 is an evaluation function that is very easy to satisfy, c17 is very difficult to satisfy, and c4 is somewhat easy to satisfy. However, if we analyze the column for ShEvolver B it would seem that c4 is easier to satisfy than c12. Thus, even tiny differences in comparative experimental scenarios (ShEvolver A vs. ShEvolver B) can cause sets of results that lead to different interpretations. Table 3: The mean fitness of the population for c17 in the ShEvolver A column, 0.002, shows that there are at least a few individuals in the population with non-0 fitness values, even though none of them were good enough to make the EA converge due to quality in any of the runs (as seen from the 5000 mean value for convergence of c17 in Table 2). The results in the column for ShEvolver A make it difficult to distinguish c4 from c12 (the mean population fitness values that the two evaluation functions produced were very similar: 0.896 and 0.900, respectively). However, if we analyze the column for ShEvolver B we could conclude that c4 is much more difficult to satisfy than c12 (since the mean fitness of the individuals in the population when convergence occurred was only 0.638 for c4 but 0.922 for c12), and in fact that c4 is slightly more difficult to satisfy than c17 (which produced a mean fitness value of 0.681), in direct contradiction with all other indications that c17 is very difficult to satisfy (both in absolute terms and relative to the other evaluation functions analyzed). - 110 - In Stage 2 of the experiment each of the 1800 best designs produced in each of the runs in Stage 1 was evaluated by using each of the 36 individual fitness functions in ShEvolver. The purpose of this stage was to get a better feel for how the evaluation functions compare when used to analyze a set of shapes of which most were not "attuned" to the functions. Only 50/1800=2.8% of the shapes that were thus evaluated were produced by an EA using the fitness function that was doing the evaluating. Fig. 4 shows the mean fitness values for each of the 1800 shapes when evaluated by each of the fitness evaluation functions c1-c36 in ShEvolver A. Fig. 5 shows the corresponding results for ShEvolver B. In the figures the horizontal axis corresponds to the different evaluation functions used and the vertical axis is the mean fitness value awarded by the corresponding function to the entire set of 1800 designs. Fig. 4. Comparison of applying c1-c36 to 1800 designs in ShEvolver A Fig. 5. Comparison of applying c1-c36 to 1800 designs in ShEvolver B As can be observed from comparing Fig. 4 with Fig. 5, the overall behavior in the two graphs is the same, as far as which evaluation functions can be interpreted to be very easy to satisfy, very difficult to satisfy, and somewhat easy to satisfy. This lends further credence to the conclusions drawn earlier with respect to classifying c4, c12, and c17 into these three categories. However, the fitness value for the entire set of 1800 designs is clearly higher (approximately twice as high) when using ShEvolver B - 111 - compared to its value when using ShEvolver A, for all evaluation functions. This shows very clearly that even a slight difference in the way that fitness evaluation is implemented can cause a great difference in the performance of an EA. We saw it before, in Tables 1-3, when measuring various EA search parameters such as reliability and speed of convergence, and we see it again here, in Fig. 4 and Fig. 5, when measuring the quality of the designs produced by the EA. In fact, we actually created a third variation of the system, ShEvolver C. In ShEvolver C we combined the functionality of c37 and c38 from ShEvolver B into one new fitness evaluation function, c39 (which assigns a fitness value of 1 to a shape only if it satisfies minimum and maximum width and height requirements at the same time). We then ran the same experiment as described above using ShEvolver C, always using as a global fitness function a linear combination of two evaluation functions: one of c1-c36, plus c39. In theory the results should have been similar to those for ShEvolver A, because now we have a scenario in which satisfying the constraints on the dimensions of a shape contributes 50% of the global fitness value (c39) of that shape (the other 50% being due to whichever original evaluation criterion is being used at a given moment), as in ShEvolver A. However, the result was that the experiment could not be completed, as the computer always out of heap space (memory) while running it, even though several attempts were made. It was decided to try again, but setting the time convergence criterion such that the EA went on for a maximum of 500 generations instead of 5000, assuming that this would use up less memory, but the results were the same. For some reason in ShEvolver C the nature of the shapes being created by the EA was such that they always grew in size, instead of being constrained to the 5x5 dimensions imposed by the size-related fitness criteria, whereas in ShEvolver A and ShEvolver B the size constraints were being met by the shapes and thus the system never ran out of memory. 4 Summary and Discussion In this paper we described ShEvolver, an EA used for evolving shapes that consist of configurations of colored unit squares. In ShEvolver, several fitness evaluation functions have been implemented. Some of these functions focus on geometric features of the designs they analyze, some on color-related features, and some on a combination of both. An experiment was designed and performed in order to evaluate ShEvolver's evaluation functions in several ways. First, evaluation functions were compared to each other under the same experimental conditions and it was found that they can be classified according to whether they impose constraints that are very easy, somewhat easy, or very difficult to satisfy. Then some of the evaluation functions were combined in slightly different ways and the new ways of combining them were again run under the same experimental conditions. It was found in this way that even small differences in how an EA evaluates the solutions it proposes can cause large differences in the EA's performance. This effect was observed when measuring several EA parameters: its reliability, its mean convergence time, the mean fitness value of its population at con- - 112 - vergence, and the quality of the solutions it produced. In one of the experimental scenarios tested, a slight difference in how the evaluation functions were combined was also the difference between the EA being able to complete its task (the experiment) or not (because the computer ran out of memory space in the middle). In a typical paper describing a given EA, the way that fitness evaluation is performed by the EA is usually merely described, without any explanation about why it is the "best" way of doing it or whether alternative algorithms/implementations might be feasible or were tested. This leads us to suspect that in most cases alternatives may not have even been considered, and that the implementation that was used was chosen merely because it was the first one that came to mind to the researcher involved in the project, or because it has worked in the past for the same or other researchers (whether the application domain was similar or not), or because the researcher is comfortable with it, or other such possibilities. However, as we have seen with our experiments, even minor variations in the way that fitness evaluation is implemented can cause large differences in the performance of an EA. What this indicates is that, when using an EA, making a few preliminary trials that evaluate alternate fitness evaluation implementations might enable the construction of more robust, more efficient, more reliable EA’s, capable of producing better solutions. If, instead, an EA is simply endowed with whatever the first implementation that comes to mind is, without even thinking about alternatives, there is no guarantee that one has achieved the “best” possible EA for that domain. Taking into account these lessons learned from our experiments can have a major impact on whether a particular EA will be able to successfully provide support for creativity in a given domain or not. Further work would be needed to determine whether general principles or guidelines could be proposed in order to suggest how to implement the fitness evaluation function under different application domains and operating conditions when using EA’s for creativity support. Acknowledgements. This work has been supported by Asociación Mexicana de Cultura, A.C. References 1. Bentley, P. (ed.): Evolutionary Design by Computers. Morgan Kaufmann, San Francisco CA (1999) 2. Bentley, P., Corne, D.W. (eds.): Creative Evolutionary Systems. Morgan Kaufmann, San Francisco CA (2002) 3. Gómez de Silva Garza, A.: Exploring the Sensitivity to Representation of an Evolutionary Algorithm for the Design of Shapes. In: Proceedings of the Eighth ACM International Conference on Creativity and Cognition (C&C '11), pp. 259-267, Atlanta GA (2011), DOI: http://dl.acm.org/citation.cfm?doid=2069618.2069661 4. Mitchell, M.: An Introduction to Genetic Algorithms. MIT Press, Cambridge MA (1998) - 113 -
© Copyright 2026 Paperzz