An Emergence of Game Strategy in Multiagent Systems

An Emergence of Game Strategy
in Multiagent Systems
Peter LACKO∗
Slovak University of Technology
Faculty of Informatics and Information Technologies
Ilkovičova 3, 842 16 Bratislava, Slovakia
[email protected]
Abstract. In this paper, we study an emergence of game strategy in
multiagent systems. Symbolic and subsymbolic approaches are
compared. Symbolic approach is represented by a backtrack algorithm
with specified search depth, whereas the subsymbolic approach is
represented by feedforward neural networks that are adapted by
reinforcement temporal difference TD(lambda) technique. As a test
game, we used simplified checkers. Three different strategies were
used. The first strategy corresponds to a single agent that repeatedly
plays games against MinMax version of a backtrack search method.
The second strategy corresponds to single agents that are repeatedly
playing a megatournament. The third strategy is an evolutionary
modification of the second one. It is demonstrated that all these
approaches led to a population of agents very successfully playing
checkers against a backtrack algorithm with the search depth 3.
1 Introduction
Applications of TD(λ) reinforcement learning [2, 3] to computational studies of
emergence of game strategies were initiated by Gerald Tesauro [4, 5] in 1992. He let a
machine-learning program endowed with feed-forward neural network to play against
itself a backgammon game. Tesauro has observed that a neural network emerged,
which is able to play backgammon on a supreme champion level.
The purpose of this paper is to use TD(λ) reinforcement learning method and
evolutionary optimization for adaptation of feed-forward neural networks that are used
as evaluators of next possible positions created from a given position by permitted
moves. Neural network evaluates each position by a score – number from an open
∗
Supervisor: prof. Ing. Vladimír Kvasnička, DrSc., Institute of Applied Informatics, Faculty of
Informatics and Information Technologies STU in Bratislava
M. Bieliková (Ed.), IIT.SRC 2005, April 27, 2005, pp. 41-48.
42
Peter Lacko
interval (0,1). A position with the largest score is selected as the forthcoming position,
while other moves are ignored. The method is tested on a simplified game of checkers,
where that player wins, whose piece first reaches any of the squares on the opposite
end of the game board. Three different experiments are done. The first experiment uses
only one neural network playing a game against a player simulated by a MinMax
backtrack algorithm with the search depth 3. The second experiment corresponds to a
population of neural networks, which are repeatedly playing a megatournament (each
network against all others). After each game both neural networks are adapted by
TD(λ) according to the result of the game. Finally, the third experiment uses
evolutionary adaptation of neural networks, i.e. reinforcement learning was substituted
by random mutations and natural selection. In all three experiments, the emerged
neural networks won about 60% of games of simplified checkers against MinMax
algorithm with the search depth 3.
2 Simplified game of checkers
The game of checkers is played on a square board with sixty-four smaller squares
arranged in an 8×8 grid of alternating colors (like chess board). The starting position is
with each player having 8 pieces (black, respectively white), on the 8 squares of the
same color closest to his edge of the board. Each must make one move per turn. The
pieces move one square, diagonally, forward. A piece can only move to a vacant
square. One captures an opponent's piece by jumping over it, diagonally, to the
adjacent vacant square. If a player can jump, he must. A player wins, if one of his/her
pieces reaches a square on the opponent's edge of the board, or captures the last
opponent's piece, or blocks all opponent's moves.
2.1. Formalization of game
In this section, we shall deal with a formalization of the game of checkers, which can
be applied to all symmetric two player games (chess, go, backgammon, etc.). Let the
current position of the game be described by a variable P, this position can be changed
by permitted actions – moves constituting a set A(P). Using a move a′∈A(P), the
a′
position P shall be transformed into a new position P′, P ⎯⎯
→ P ′ . An inverse
position P is obtainable from a position P by switching the color of all black pieces to
white and of all white pieces to black. We shall use a multiagent approach, and we
shall presume, that the game is played by two agents G1 and G2, which are endowed
with cognitive devices, by which they are able to evaluate next positions.
Algorithm
1. step. The game is started by the first player, G←G1, from a starting position,
P←Pini.
An Emergence of Game Strategy in Multiagent Systems
43
2. step. The player G generates from the position P a set of the next permitted
positions A ( P ) = { P1 , P2 ,..., Pn } . Each such position Pi from the set of
these positions is evaluated by a coefficient 0<zi<1. The player selects as
his next position such P´∈A(P), which is evaluated by the maximum
coefficient, P←P´. If the position P satisfies condition for victory, then the
player G wins and the game continues with the step 4.
3. step. The other player takes turn in the game, G ← G2 , the position P is
generated from the inverse of the current position, P ← P , the game
continues with the step 2.
4. step.
End of the game.
The key role in the algorithm plays the calculation of coefficients z=z(P′) for
positions P′∈A(P). These calculations can be done either by methods of classical
artificial intelligence, based on a combination of depth first search and various
heuristics, or by soft computing methods. We shall use the second approach, where we
shall turn our attention to modern approach of multiagent systems. It is based on a
presumption, that the behavior of an agent in his/her environment and/or certain
actions he/she performs are fully determined by his/her cognitive device, which
demonstrates a certain plasticity (i.e. it is capable of learning).
information flow
z(P)
output neuron (1)
y
............
hidden neurons (p)
............
input neurons (32)
x(P)
Fig. 1. Feed forward neural network with one layer of hidden neurons. The input
activities are equivalent to 32-dimensional vector x(P), which codes a position
of the game. The output activity equals to the real number z(P) from
the open interval (0,1), this number is an evaluation of the input position.
3 The structure of the cognitive device – neural network
Firstly, before we shall specify the cognitive device of agents, we have to introduce the
so-called numerical representation of positions. Position is represented by a 32dimensional vector
32
(1a)
x ( P ) = ( x1 , x2 ,..., x32 ) ∈ {0 ,1, − 1}
where single entries specify single squares at the position P of the game
44
Peter Lacko
xi = {0 ( ith square is free ) , ± 1 ( on ith square is our/opponent's piece )} (1b)
The used neural network has the architecture of a „feed-forward network“ with one
layer of hidden neurons. The activities of input neurons are determined by a numerical
representation x(P) of the given position P, the output activity evaluates the position
x(P) (see fig. 1). The number of parameters of the neural network is 34p+1, where p is
the number of hidden neurons
4 Adaptation of cognitive device of agent with temporal difference
TD(λ)-method with reward and punishment [2]
In this section we shall give the basic principles of the used reinforcement learning
method, which currently in its temporal difference TD(λ) version belongs to effective
algorithmic tools for adaptation of cognitive devices of multiagent systems. The basic
principles of reinforcement learning are following: Agent observes the mapping of
his/her input pattern to his/her output signal of his/her cognitive device (the output
signal is often called an „action“ or „control signal“). Agent evaluates the quality of the
output signal on the basis of the external scalar signal „reward“. The aim of the
learning is such an adaptation of the cognitive organ of the agent, which will modify
the output signals toward maximization of external „reward“ signals. In many cases the
„reward“ signal is delayed, it arrives only at the end of a long sequence of actions and
it can be understood as the evaluation of the whole sequence of actions, whether the
sequence achieved the desired goal or not.
In this section we shall outline the construction of TD(λ) method as a certain
generalization of a standard method of gradient descent learning of neural networks.
Let us presume, that we know the sequence of positions of the given agent - player and
their evaluation by a real number z
(2)
P1 , P2 ,..., Pm , zreward
where zreward is an external evaluation of the sequence and corresponds to the fact,
that the last position Pm means that the given agent won, or lost
zreward = {1 ( sequence of positions won ) , −1 ( sequence of positions lost )} (3)
From the sequence (2) we shall create m couples of positions and their
evaluations by zreward, which shall be used as training set for the following objective
function
2
1 m
(4)
E ( w ) = ∑ ( z reward − G ( xt ; w ) )
2 t =1
We shall look for such weight coefficients of the neural network – cognitive
device, which will minimize the objective function. When we would find out such
weight coefficients of the network that the function would be zero, then each position
from the sequence (2) is evaluated by a number zreward. The recurrent formula for
adaptation of the weight coefficients is as follows
∂E
(5)
w: = w − α
= w + ∆w
∂w
An Emergence of Game Strategy in Multiagent Systems
45
∂z t
(6)
∂w
t =1
where zt=G(Pi,w) is an evaluation of tth position Pt by neural network working as
a cognitive device. Our goal will be, that all the positions from the sequence (2) would
be evaluated by the same number zreward, which specifies, if outcome of the game
consisting from the sequence (2) was a win, draw, or loss for the given player. This
approach can be generalized to a formula, which creates the basis of the TD(λ) class of
learning methods [2]
m
∆ w = α ∑ ( z reward − z t )
m
∆ w = ∑ ∆ wt
(7)
t =1
t
∆ wt = α (z t +1 − z t )∑ λ t − k
k =1
∂z k
∂w
(8)
where the parameter 0≤λ≤1.
Formulas (7) and (8) enable a recurrent calculation of the increment ∆w. We
shall introduce a new symbol et(λ), which can be easily calculated recurrently
t
∂z
∂z
(9)
e t (λ ) = ∑ λ t − k k ⇒ e t +1 (λ ) = λe t (λ ) + t +1
∂w
∂w
k =1
where e1 (λ ) = ∂ z1 ∂ w . Then the single „partial” increments ∆wt are determined
as follows
(10)
∆ wt = α (z t +1 − z t )e t (λ )
5 Results
For measurement of game strategy emergency success we used MinMax algorith [1].In
our implementation, we used the following heuristic function:
l
m
i =1
i =1
evaluation = ∑ yn [i ] −∑ ( 8 − ys [i ])
(11)
If we denote the number of our pieces on a game board l and the number of
opponent’s pieces m, then we can denote by yn[i] the position of our ith piece along the
y axis (the axis towards the opponent’s starting part of game board). By ys[i] we can
denote equivalent value of the opponent’s pieces.
This evaluation ensures that the MinMax will try to get its pieces toward the
opponent’s part of the game board and it will try to prevent the opponent’s progress.
The play of our player is quite offensive.
5.1
The result of a supervised learning of neural network
In our first, simplest approach, we considered only a 1-agent system. Our agent learns
by playing against another agent, whose decisions are based on backtrack search to
a maximum depth 3. The game is repeated with players alternating to go first. After
each end of the game, the agent with a cognitive device represented by a neural
network adapts his/her neural network by a reinforcemernt learning method using
reward/ punishment signal.
46
Peter Lacko
For training and testing we used a two-layered feed-forward network with 64
hidden neurons, the learning rate 0.01 and the coefficient λ=0.9. The network learned
after each game, when it was evaluated by 1 if it won, and 0 evaluated it if it lost. After
each 100 games, the ratio of won/lost games was marked on the graph. In the trial, the
network played against the MinMax algorithm searching to the level 3. The graph of
the learning progress is shown on the fig. 2. It is evident, that the network learned
slowly and even after 450000 matches achieved victory only in 65% of matches.
Nevertheless, it is still an excellent result, since if the network would play as well as
our MinMax algorithm searching to the level 3, it would win only 50% of matches.
Fig. 2. The progress of learning of neural network playing against the MinMax
algorithm with the search depth 3.
5.2
The result of adaptation of a population of neural networks
In the second, more complicated case, we consider a multiagent system. Its agents
repeatedly play against each other a megatournament, while after each game the neural
networks of both agents are adapted by a reinforcement learning method. For this trial,
we used 20 neural networks. These networks played a megatournament against each
other, and their level of development was measured by a tournament against MinMax
with search depth 3. The learning curve is shown on the fig. 3. The figure shows, that
even though the neural networks were not taught by TD(λ) learning against the
MinMax algorithm, they did learn to play checkers. It means that we did not need an
expert with a good knowledge to teach the networks how to play.
5.3
The result of evolutionary adaptation of a population of neural
networks
In the third, most complex approach, we used in a multiagent system also a Darwinian
evolution; after the end of each megatournament agents are evaluated by a fitness
according to their success rate in the game and they further quasirandomly reproduce
with a probability proportional to their fitness. In this case, we use asexual
reproduction, where we create a copy of the parental agent. This copy is with a certain
probability mutated and replaces some weak agent in the population. To assess fitness
An Emergence of Game Strategy in Multiagent Systems
47
we used the MinMax with search depth 3. The figure 5 shows a learning curve as an
average result of the agents from the population against MinMax with search depth 3.
The population consisted of 55 agents, from which in each epoch a subpopulation of 10
individuals was created. The subpopulation was generated by a quasirandom selection
of agents from the population, which were mutated with a probability 0.5. The
mutation added to each weight of neural network with a 0.001 probability a random
number from a logistic distribution with a parameter 0.1. This subpopulation then
replaced the weak individuals in the population. The figure 4 shows, that even in this
approach a strategy of the game emerged, and resulting neural networks were able to
play at the same level as the MinMax algorithm with a depth-search 3.
Fig. 3. The progress of learning in a population of 20 neural networks trained by
TD(λ). The curve shows an average percentage for the whole population
of wins to losses against the MinMax with search depth 3.
Fig. 4. The progress of learning of a population of 55 neural networks adapted by
evolutionary algorithm. The curve shows an average percentage of wins
for the whole population against the MinMax with search depth 3.
48
Peter Lacko
6 Conclusions
The purpose of this paper is a computational study of game-strategy emergence for a
simplified game of checkers. It was studied at three different levels. At the first
simplest level we have studied a simple 1-agent system, where an agent (represented
by a neural network) plays against another agent, which is simulated by MinMax
backtrack method with a specified search depth of 3. At the second level, we used a
genuine multiagent system, where agents play repeatedly a megatournament, each
agent against all other agents. After finishing each single game, both participating
agents modify their neural networks by TD(λ) reinforcement learning. At the third
level, a Darwinian evolution is used for all agents from a population (multiagent
systems). In a similar way as at the previous second level, agents play also a
megatournament; its results are used for fitness evaluation of agents. The fitness values
are used in the evolutionary approach for a reproduction process of agents, when fitter
agents are reproduced with a higher probability than weaker agents. In the reproduction
process the weight coefficient are softly randomly modified – mutated, where natural
selection ensures that only better neural networks survive. At all three levels we
observed an emergence of checker game strategy, where at the second and third level
we did not use any user defined agents that are endowed with an ability to predict a
correct strategy and are therefore able to play the game perfectly. This is a very
important moment in our computer simulations; in the used multiagent systems, an
emergence of game strategy is spontaneous and not biased by predefined opponents
that are able to play the game perfectly. Neural networks are able to learn a strategy,
which gives rise to agents with capabilities to play checkers at a very good level.
Acknowledgment: This work was supported by Scientific Grant Agency of Slovak
Republic under grants #1/0062/03 and #1/1047/04.
References
1. Russell, S.J., Norvig, P.: Artificial Intelligence: Modern Approach. Upper Saddle
River, HJ: Prentice Hall, 1995.
2. Sutton, R.S.: Learning to predict by the method of temporal differences. Machine
Learning, Vol. 3 (1988), 9-44.
3. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. Cambridge,
MA: MIT Press, 1998.
4. Tesauro, G.J.: Practical issues in temporal difference learning. Machine Learning,
Vol. 8 (1992), 257-277.
5. Tesauro, G.J.: TD-gammon, a self-teaching backgammon program, achieves
master-level play. Neural Computation, Vol. 6, No. 2 (1994), 215-219.