An Emergence of Game Strategy in Multiagent Systems Peter LACKO∗ Slovak University of Technology Faculty of Informatics and Information Technologies Ilkovičova 3, 842 16 Bratislava, Slovakia [email protected] Abstract. In this paper, we study an emergence of game strategy in multiagent systems. Symbolic and subsymbolic approaches are compared. Symbolic approach is represented by a backtrack algorithm with specified search depth, whereas the subsymbolic approach is represented by feedforward neural networks that are adapted by reinforcement temporal difference TD(lambda) technique. As a test game, we used simplified checkers. Three different strategies were used. The first strategy corresponds to a single agent that repeatedly plays games against MinMax version of a backtrack search method. The second strategy corresponds to single agents that are repeatedly playing a megatournament. The third strategy is an evolutionary modification of the second one. It is demonstrated that all these approaches led to a population of agents very successfully playing checkers against a backtrack algorithm with the search depth 3. 1 Introduction Applications of TD(λ) reinforcement learning [2, 3] to computational studies of emergence of game strategies were initiated by Gerald Tesauro [4, 5] in 1992. He let a machine-learning program endowed with feed-forward neural network to play against itself a backgammon game. Tesauro has observed that a neural network emerged, which is able to play backgammon on a supreme champion level. The purpose of this paper is to use TD(λ) reinforcement learning method and evolutionary optimization for adaptation of feed-forward neural networks that are used as evaluators of next possible positions created from a given position by permitted moves. Neural network evaluates each position by a score – number from an open ∗ Supervisor: prof. Ing. Vladimír Kvasnička, DrSc., Institute of Applied Informatics, Faculty of Informatics and Information Technologies STU in Bratislava M. Bieliková (Ed.), IIT.SRC 2005, April 27, 2005, pp. 41-48. 42 Peter Lacko interval (0,1). A position with the largest score is selected as the forthcoming position, while other moves are ignored. The method is tested on a simplified game of checkers, where that player wins, whose piece first reaches any of the squares on the opposite end of the game board. Three different experiments are done. The first experiment uses only one neural network playing a game against a player simulated by a MinMax backtrack algorithm with the search depth 3. The second experiment corresponds to a population of neural networks, which are repeatedly playing a megatournament (each network against all others). After each game both neural networks are adapted by TD(λ) according to the result of the game. Finally, the third experiment uses evolutionary adaptation of neural networks, i.e. reinforcement learning was substituted by random mutations and natural selection. In all three experiments, the emerged neural networks won about 60% of games of simplified checkers against MinMax algorithm with the search depth 3. 2 Simplified game of checkers The game of checkers is played on a square board with sixty-four smaller squares arranged in an 8×8 grid of alternating colors (like chess board). The starting position is with each player having 8 pieces (black, respectively white), on the 8 squares of the same color closest to his edge of the board. Each must make one move per turn. The pieces move one square, diagonally, forward. A piece can only move to a vacant square. One captures an opponent's piece by jumping over it, diagonally, to the adjacent vacant square. If a player can jump, he must. A player wins, if one of his/her pieces reaches a square on the opponent's edge of the board, or captures the last opponent's piece, or blocks all opponent's moves. 2.1. Formalization of game In this section, we shall deal with a formalization of the game of checkers, which can be applied to all symmetric two player games (chess, go, backgammon, etc.). Let the current position of the game be described by a variable P, this position can be changed by permitted actions – moves constituting a set A(P). Using a move a′∈A(P), the a′ position P shall be transformed into a new position P′, P ⎯⎯ → P ′ . An inverse position P is obtainable from a position P by switching the color of all black pieces to white and of all white pieces to black. We shall use a multiagent approach, and we shall presume, that the game is played by two agents G1 and G2, which are endowed with cognitive devices, by which they are able to evaluate next positions. Algorithm 1. step. The game is started by the first player, G←G1, from a starting position, P←Pini. An Emergence of Game Strategy in Multiagent Systems 43 2. step. The player G generates from the position P a set of the next permitted positions A ( P ) = { P1 , P2 ,..., Pn } . Each such position Pi from the set of these positions is evaluated by a coefficient 0<zi<1. The player selects as his next position such P´∈A(P), which is evaluated by the maximum coefficient, P←P´. If the position P satisfies condition for victory, then the player G wins and the game continues with the step 4. 3. step. The other player takes turn in the game, G ← G2 , the position P is generated from the inverse of the current position, P ← P , the game continues with the step 2. 4. step. End of the game. The key role in the algorithm plays the calculation of coefficients z=z(P′) for positions P′∈A(P). These calculations can be done either by methods of classical artificial intelligence, based on a combination of depth first search and various heuristics, or by soft computing methods. We shall use the second approach, where we shall turn our attention to modern approach of multiagent systems. It is based on a presumption, that the behavior of an agent in his/her environment and/or certain actions he/she performs are fully determined by his/her cognitive device, which demonstrates a certain plasticity (i.e. it is capable of learning). information flow z(P) output neuron (1) y ............ hidden neurons (p) ............ input neurons (32) x(P) Fig. 1. Feed forward neural network with one layer of hidden neurons. The input activities are equivalent to 32-dimensional vector x(P), which codes a position of the game. The output activity equals to the real number z(P) from the open interval (0,1), this number is an evaluation of the input position. 3 The structure of the cognitive device – neural network Firstly, before we shall specify the cognitive device of agents, we have to introduce the so-called numerical representation of positions. Position is represented by a 32dimensional vector 32 (1a) x ( P ) = ( x1 , x2 ,..., x32 ) ∈ {0 ,1, − 1} where single entries specify single squares at the position P of the game 44 Peter Lacko xi = {0 ( ith square is free ) , ± 1 ( on ith square is our/opponent's piece )} (1b) The used neural network has the architecture of a „feed-forward network“ with one layer of hidden neurons. The activities of input neurons are determined by a numerical representation x(P) of the given position P, the output activity evaluates the position x(P) (see fig. 1). The number of parameters of the neural network is 34p+1, where p is the number of hidden neurons 4 Adaptation of cognitive device of agent with temporal difference TD(λ)-method with reward and punishment [2] In this section we shall give the basic principles of the used reinforcement learning method, which currently in its temporal difference TD(λ) version belongs to effective algorithmic tools for adaptation of cognitive devices of multiagent systems. The basic principles of reinforcement learning are following: Agent observes the mapping of his/her input pattern to his/her output signal of his/her cognitive device (the output signal is often called an „action“ or „control signal“). Agent evaluates the quality of the output signal on the basis of the external scalar signal „reward“. The aim of the learning is such an adaptation of the cognitive organ of the agent, which will modify the output signals toward maximization of external „reward“ signals. In many cases the „reward“ signal is delayed, it arrives only at the end of a long sequence of actions and it can be understood as the evaluation of the whole sequence of actions, whether the sequence achieved the desired goal or not. In this section we shall outline the construction of TD(λ) method as a certain generalization of a standard method of gradient descent learning of neural networks. Let us presume, that we know the sequence of positions of the given agent - player and their evaluation by a real number z (2) P1 , P2 ,..., Pm , zreward where zreward is an external evaluation of the sequence and corresponds to the fact, that the last position Pm means that the given agent won, or lost zreward = {1 ( sequence of positions won ) , −1 ( sequence of positions lost )} (3) From the sequence (2) we shall create m couples of positions and their evaluations by zreward, which shall be used as training set for the following objective function 2 1 m (4) E ( w ) = ∑ ( z reward − G ( xt ; w ) ) 2 t =1 We shall look for such weight coefficients of the neural network – cognitive device, which will minimize the objective function. When we would find out such weight coefficients of the network that the function would be zero, then each position from the sequence (2) is evaluated by a number zreward. The recurrent formula for adaptation of the weight coefficients is as follows ∂E (5) w: = w − α = w + ∆w ∂w An Emergence of Game Strategy in Multiagent Systems 45 ∂z t (6) ∂w t =1 where zt=G(Pi,w) is an evaluation of tth position Pt by neural network working as a cognitive device. Our goal will be, that all the positions from the sequence (2) would be evaluated by the same number zreward, which specifies, if outcome of the game consisting from the sequence (2) was a win, draw, or loss for the given player. This approach can be generalized to a formula, which creates the basis of the TD(λ) class of learning methods [2] m ∆ w = α ∑ ( z reward − z t ) m ∆ w = ∑ ∆ wt (7) t =1 t ∆ wt = α (z t +1 − z t )∑ λ t − k k =1 ∂z k ∂w (8) where the parameter 0≤λ≤1. Formulas (7) and (8) enable a recurrent calculation of the increment ∆w. We shall introduce a new symbol et(λ), which can be easily calculated recurrently t ∂z ∂z (9) e t (λ ) = ∑ λ t − k k ⇒ e t +1 (λ ) = λe t (λ ) + t +1 ∂w ∂w k =1 where e1 (λ ) = ∂ z1 ∂ w . Then the single „partial” increments ∆wt are determined as follows (10) ∆ wt = α (z t +1 − z t )e t (λ ) 5 Results For measurement of game strategy emergency success we used MinMax algorith [1].In our implementation, we used the following heuristic function: l m i =1 i =1 evaluation = ∑ yn [i ] −∑ ( 8 − ys [i ]) (11) If we denote the number of our pieces on a game board l and the number of opponent’s pieces m, then we can denote by yn[i] the position of our ith piece along the y axis (the axis towards the opponent’s starting part of game board). By ys[i] we can denote equivalent value of the opponent’s pieces. This evaluation ensures that the MinMax will try to get its pieces toward the opponent’s part of the game board and it will try to prevent the opponent’s progress. The play of our player is quite offensive. 5.1 The result of a supervised learning of neural network In our first, simplest approach, we considered only a 1-agent system. Our agent learns by playing against another agent, whose decisions are based on backtrack search to a maximum depth 3. The game is repeated with players alternating to go first. After each end of the game, the agent with a cognitive device represented by a neural network adapts his/her neural network by a reinforcemernt learning method using reward/ punishment signal. 46 Peter Lacko For training and testing we used a two-layered feed-forward network with 64 hidden neurons, the learning rate 0.01 and the coefficient λ=0.9. The network learned after each game, when it was evaluated by 1 if it won, and 0 evaluated it if it lost. After each 100 games, the ratio of won/lost games was marked on the graph. In the trial, the network played against the MinMax algorithm searching to the level 3. The graph of the learning progress is shown on the fig. 2. It is evident, that the network learned slowly and even after 450000 matches achieved victory only in 65% of matches. Nevertheless, it is still an excellent result, since if the network would play as well as our MinMax algorithm searching to the level 3, it would win only 50% of matches. Fig. 2. The progress of learning of neural network playing against the MinMax algorithm with the search depth 3. 5.2 The result of adaptation of a population of neural networks In the second, more complicated case, we consider a multiagent system. Its agents repeatedly play against each other a megatournament, while after each game the neural networks of both agents are adapted by a reinforcement learning method. For this trial, we used 20 neural networks. These networks played a megatournament against each other, and their level of development was measured by a tournament against MinMax with search depth 3. The learning curve is shown on the fig. 3. The figure shows, that even though the neural networks were not taught by TD(λ) learning against the MinMax algorithm, they did learn to play checkers. It means that we did not need an expert with a good knowledge to teach the networks how to play. 5.3 The result of evolutionary adaptation of a population of neural networks In the third, most complex approach, we used in a multiagent system also a Darwinian evolution; after the end of each megatournament agents are evaluated by a fitness according to their success rate in the game and they further quasirandomly reproduce with a probability proportional to their fitness. In this case, we use asexual reproduction, where we create a copy of the parental agent. This copy is with a certain probability mutated and replaces some weak agent in the population. To assess fitness An Emergence of Game Strategy in Multiagent Systems 47 we used the MinMax with search depth 3. The figure 5 shows a learning curve as an average result of the agents from the population against MinMax with search depth 3. The population consisted of 55 agents, from which in each epoch a subpopulation of 10 individuals was created. The subpopulation was generated by a quasirandom selection of agents from the population, which were mutated with a probability 0.5. The mutation added to each weight of neural network with a 0.001 probability a random number from a logistic distribution with a parameter 0.1. This subpopulation then replaced the weak individuals in the population. The figure 4 shows, that even in this approach a strategy of the game emerged, and resulting neural networks were able to play at the same level as the MinMax algorithm with a depth-search 3. Fig. 3. The progress of learning in a population of 20 neural networks trained by TD(λ). The curve shows an average percentage for the whole population of wins to losses against the MinMax with search depth 3. Fig. 4. The progress of learning of a population of 55 neural networks adapted by evolutionary algorithm. The curve shows an average percentage of wins for the whole population against the MinMax with search depth 3. 48 Peter Lacko 6 Conclusions The purpose of this paper is a computational study of game-strategy emergence for a simplified game of checkers. It was studied at three different levels. At the first simplest level we have studied a simple 1-agent system, where an agent (represented by a neural network) plays against another agent, which is simulated by MinMax backtrack method with a specified search depth of 3. At the second level, we used a genuine multiagent system, where agents play repeatedly a megatournament, each agent against all other agents. After finishing each single game, both participating agents modify their neural networks by TD(λ) reinforcement learning. At the third level, a Darwinian evolution is used for all agents from a population (multiagent systems). In a similar way as at the previous second level, agents play also a megatournament; its results are used for fitness evaluation of agents. The fitness values are used in the evolutionary approach for a reproduction process of agents, when fitter agents are reproduced with a higher probability than weaker agents. In the reproduction process the weight coefficient are softly randomly modified – mutated, where natural selection ensures that only better neural networks survive. At all three levels we observed an emergence of checker game strategy, where at the second and third level we did not use any user defined agents that are endowed with an ability to predict a correct strategy and are therefore able to play the game perfectly. This is a very important moment in our computer simulations; in the used multiagent systems, an emergence of game strategy is spontaneous and not biased by predefined opponents that are able to play the game perfectly. Neural networks are able to learn a strategy, which gives rise to agents with capabilities to play checkers at a very good level. Acknowledgment: This work was supported by Scientific Grant Agency of Slovak Republic under grants #1/0062/03 and #1/1047/04. References 1. Russell, S.J., Norvig, P.: Artificial Intelligence: Modern Approach. Upper Saddle River, HJ: Prentice Hall, 1995. 2. Sutton, R.S.: Learning to predict by the method of temporal differences. Machine Learning, Vol. 3 (1988), 9-44. 3. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998. 4. Tesauro, G.J.: Practical issues in temporal difference learning. Machine Learning, Vol. 8 (1992), 257-277. 5. Tesauro, G.J.: TD-gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation, Vol. 6, No. 2 (1994), 215-219.
© Copyright 2026 Paperzz