Tic-Tac-Toe

1. INTRODUCTION
There has been interest in designing computer algorithms to play common games
since the early advent of the modern digital computer. With a few exceptions, these
efforts have relied on domain-specific information programmed into an algorithm in the
form of weighted features that were believed to be important for assessing the relative
worth of alternative positions in the game. That is, the programs relied on human
expertise to defeat human expertise. Every “item” of knowledge was preprogrammed. In
some cases, the programs were even tuned to defeat particular human opponents,
indicating their brittle nature. The fact that they require human expertise a priori is
testament to the limitation of this approach.
In contrast, Fogel [1] offered experiments where evolution was used to design neural
networks that were capable of playing tic-tac-toe without incorporating features
prescribed by experts. The neural networks competed against an expert system, but their
overall quality of play was judged solely on the basis of their win, loss, and draw
performance over a series of 32 games. No effort was made to assign credit to the
evolving networks for any specific move or board feature. The results indicated that
successful strategies for this simple game could be developed even without prescribing
this information.
In our project, we demonstrate the neural evolutionary strategy capable of playing tictac-toe under the recommendations made by Fogal and extend these results by employing
partitioned neural network architecture, the mixtures of expert paradigm.
In the following parts, section 2 introduces the neural evolutionary strategy capable of
playing tic-tac-toe made by Fogal. Section 3 introduces the basic concept of modular
network. Section 4 introduces the algorithm and experiment results of combining
evolving neural network and expert paradigm and finally the conclusion.
2. EVOLVING NEURAL STRATEGIES IN TIC-TAC-TOE
In Tic-Tac-Toe, there are two players and a three-by-three grid. Initially the grid is
empty. Each player moves in turn by placing a marker in an open square. By convention,
the first player’s marker is “X” and the second player’s marker is “O”. The first player
moves first. The object of the game is to place three markers in a row. This results in a
win for that player and a loss for the opponent. Failing a win, a draw may be earned by
preventing the opponent from placing three markers in a row. It can be shown by
enumeration the game tree that at least a draw can be forced by the second player.
Fogel [1] devoted attention to evolving a strategy for the first player (an equivalent
procedure could be used for the second player). A suitable coding structure must be
selected. It must receive a board pattern as input and yield a corresponding move as
output. The coding structure utilized in these experiments was an MLP (Fig. 2.1). Each
hidden or output node performed a sum of the weighted input strengths, subtracted off an
adaptable bias term, and passed the result through a sigmoid filter. Only a single hidden
layer was incorporated. This architecture was selected because:
1
•
•
•
•
Variations can be shown to be universal function approximators;
It was believed to be adequate for the task;
The response to any stimulus could be evaluated rapidly;
The extension to multiple hidden layers is obvious.
INPUT LAYER
HIDDEN LAYER
OUTPUT LAYER
Fig 2.1 MLP used in Tic-Tac-Toe
There were nine input and output units. Each corresponded to a square in the grid. An
“X” was denoted by the value 1.0, an “O” was denoted by the value -1.0, and an open
space was denoted by the value 0.0. A move was determined by presenting the current
board pattern to the network and examining the relative strengths of the nine output nodes.
A marker was placed in the empty square with the maximum output strength. This
procedure guaranteed legal moves. The output from nodes associated with squares in
which a marker had already been placed was ignored. No selection pressure was applied
to drive the output from such nodes to zero.
The initial population consisted of 50 parent networks. The number of nodes in the
hidden layer was chosen at random in accordance with a uniform distribution over the
integers. The initial weighted connection strengths and bias terms were randomly
distributed according to a uniform distribution ranging over [-0.5, 0.5]. A single offspring
was copied from each parent and modified by two modes of mutation.
• All weight and bias terms were perturbed by adding a Gaussian random variable
with zero mean and a standard deviation of 0.05.
• With a probability of 0.5, the number of nodes in the hidden layer was allowed to
vary. If a change was indicated, there was an equal likelihood that a node would
be added or deleted, subject to the constraints on the maximum and minimum
number of nodes (10 and one, respectively). Nodes to be added were initialized
with all weights and the bias term being set equal to 0.0.
A rule-based procedure that played nearly perfect tic-tac-toe was implemented to
evaluate each contending network. The execution time with this format was linear with
the population size and provided the opportunity for multiple trials and statistical results.
The evolving networks were allowed to move first in all games. The first move was
examined by the rule base with the eight possible second moves being stored in an array.
The rule base proceeded as follows:
1.From the array of all possible moves, select a move that has not yet been played.
2.For subsequent moves:
a) with a 10% chance, move randomly;
2
b) if a win is available, place a marker in the winning square;
c) else, if a block is available, place a marker in the blocking square;
d) else, if two open squares are in line with an “O,” randomly place a marker
in either of the two squares;
e) else, move randomly in any open square.
3.Continue with step 2) until the game is completed.
4.Continue with step 1) until games with all eight possible second moves have been
played.
Each network was evaluated over four sets of these eight games. The payoff function
varied in three sets of experiments over {+1, -1, 0}, { +1, -10, 0}, and { +10, -1, 0},
where the entries are the payoff for winning, losing, and playing to a draw, respectively.
The maximum possible score over any four sets of games was 32 under the first two
payoff functions and 320 under the latter payoff function.
But a perfect score in any generation did not necessarily indicate a perfect algorithm
because of the random variation in play generated by step 2a). After competition against
the rule base was completed for all networks in the population, a second competition was
held in which each network was compared with ten other randomly chosen networks. If
the score of the chosen network was greater than or equal to its competitor, it received a
win. Those networks with the greatest number of wins were retained to be parents of the
successive generations. Thirty trials were conducted with each payoff function. Evolution
was halted after 800 generations in each trial.
3.Modular Network
Consider the specific configuration of a modular network shown in Fig 3.1 [2]. The
structure consists of K supervised modules called expert networks, and an integrating unit
called a gating network that performs the function of a mediator among the expert
networks.
Let the training examples be denoted by input vector x of dimension p and desired
response (target output) vector d of dimension q. The input vector x is applied to the
expert networks and the gating network simultaneously. Let yi denote the q-by-1 output
vector of the ith expert network, let gi denote the activation of the ith output neuron of the
gating network, and let y denote the q-by-1 output vector of the whole modular network.
We may then write
K
y=
giyi
i =1
The activations of the output neurons of the gating network are constrained to satisfy
two requirements (Jacobs and Jordan, 1991):
0 ≤ gi ≤ 1
for all I
and
K
gi =1
i =1
3
These two constraints are necessary if the activation gi is to be interpreted as a priori
probabilities.
Given a set of unconstrained variables, {uj | j=1,2,…,K}, we may satisfy the two
constraints by defining the activation gi of the ith output neuron of the gating network as
follows (Bridle, 1990a):
gi=
exp(ui )
K
j =1
exp(uj )
where ui is the weighted sum of the inputs applied to the ith output neuron of the gating
network. This normalized exponential transformation may be viewed as a multi-input
generalization of the logistic function. It preserves the rank order of its input values, and
is a differentiable generalization of the “winner-take-all” operation of picking the
maximum value.
Module
1
Input
Vector
x
g1
Output
Vector
y
Module
2
g2
Module
K
gk
Gating
network
Fig 3.1 Block diagram of a modular network; the outputs of the expert networks
(modules) are mediated by a gating network.
We may define the mth element of the output vector yi of the ith expert network as
the inner product of the corresponding synaptic weight vector w i(m ) and the input vector x,
as depicted in the signal flow graph of Fig 5.2(b); that is
i=1, 2, …, k
(m )
(m)
T
y i =x w i ,
m=1, 2, …, q
where the superscript T denotes transposition, and the weight vector w i( m ) is made up of
the elements w i(1m ) , w i(2m ) , …., w ip(m ) of the mth neuron in the ith expert network.
4
x1
y (i1)
x2
(2)
i
y
y
xp
x1
x2
w (i1m )
w (i 2m )
w
(q )
i
y i(m )
( m)
ip
xp
Output Layer
Input Layer
(b)
(a)
Fig 3.2 (a) Single layer of linear neurons consisting the expert network
(b) Signal-flow graph of a linear neuron
4. Evolve Modular Network in Tic-Tac-Toe
4.1 Coding Structure
In our project, we demonstrate a neural Evolutionary Strategy capable of playing tictac-toe under the recommendations made by Fogal. Extend these results by employing a
partitioned neural architecture, using the Mixtures of Experts paradigm. Finally include
the definition of the partitions into the evolutionary process such that sub-sampling
relationship between neuron and stimuli is part of the evolutionary process.
In tic-tac-toe, as we know, the object is to put three markers in a row. So in our
coding structure, there are eight expert networks and one gate network. To each expert
network, there are three input and nine outputs. The three inputs correspond to the three
positions in a row, a column or a diagonal in the board. The nine outputs correspond to
each square on the board. The input and output nodes are fully connected. To the gate
network, there are nine inputs and eight outputs. The nine inputs correspond to each
square on the board. The inputs and outputs are fully connected too. If we use a two
dimension array to show the nine squares as board[0..2][0..2], we can illustrate the coding
structure as Fig 4.1.
4.2 Initialization and Selection
On the board, an “X” was denoted by the value 1.0,an “O” was denoted by the value 1.0, and an open space was denoted by the value 0.0. The initial population consists of 50
individual networks. The initial weighted connection strengths and bias terms of expert
networks and gate network were randomly distributed according to a uniform distribution
ranging over [-0.5, 0.5]. The select method is tournament with size of 4. Each candidate
network played eight games with the rule base algorithm and one game with each of the
10 randomly selected networks from the population. The payoff function varied in three
sets of experiments over {+1, -1, 0}, {+1, -10, 0}, and {+10, -1, 0}, where the entries are
the payoff for winning, losing, and playing to a draw, respectively. The maximum
possible score for a candidate network was 18 under the first two payoff functions and
180 under the third payoff function. We compared the final scores of those four candidate
5
networks and selected two of them with higher scores as the parent networks. A single
offspring was copied from each parent and modified by mutation as follow:
• All weight and bias terms of expert networks were perturbed by adding a
Gaussian random variable with zero mean and a standard deviation of 0.05.
• All weight and bias terms of gate network were perturbed by adding a Gaussian
random variable with zero mean and a standard deviation of 0.05.
Two candidate networks with lower scores were substituted by two offspring, by
which we get the new population. Thirty trials were conducted with each payoff function.
Evolution was halted after 800 generations in each trial.
board[0][0]
board[0][1]
SUM
board[0][0]
SUM
board[0][1]
SUM
board[2][2]
board[0][2]
Expert Network 1
board[1][0]
board[1][1]
board[1][2]
Expert Network 2
board[0][2]
board[1][1]
board[2][0]
Expert Network 8
g1
g2
board[0][0] board[0][1]
g8
board[2][2]
Gate Network
Fig 4.1 Coding Structure
6
4.3 Algorithm
There are six functions in our C source code:
1. ran1: generate a uniform random deviate between 0.0 and 1.0;
2. indi_init: initialize the network population of the first generation;
3. mutation: get a single offspring;
4. feed_forward: select a square to put an “X” decided by the neural network;
5. rule_base: select a square to put an “O” decided by the rule base algorithm;
6. main: initialize the 50 networks in the first generation; loop for 800 generations,
in each generation select 4 candidate networks and let them play with the rule
base algorithm in 8 games and 10 randomly selected networks in 10 games; select
two good candidate networks as the parent networks; substitute the other two
candidate networks with the offspring generated by doing mutation on the parent
networks.
The detailed algorithm is as follow:
function ran1
{ /* “Minimal” random number generator of Park and Miller with Bays-Durham shuffle
and added safeguards. Return a uniform random deviate between 0.0 and 1.0
(exclusive of the endpoint values). Call with idum a negative integer to initialize;
thereafter, do not alter idum between successive deviates in a sequence. RNMX
should approximate the largest floating value that is less than 1.*/
}
function indi_init
{ initialize the board, the value of each square is equal to 0.0;
initialize the weighted connection strengths and bias terms for the expert networks;
initialize the weighted connection strengths and bias terms for the gate network;
initialize other attributes of network to zero, such as pay_off, win_rule, lose_rule,
candidate and result;
}
function mutation
{ adding a Gaussian random variable perturbs all weight and bias terms of expert
networks;
adding a Gaussian random variable perturbs all weight and bias terms of gate network;
}
function feed_forward
{ give three inputs, which are in a row, a column or a diagonal on the board to each of the
eight expert networks;
give nine inputs(the board pattern) to the gate network;
calculate nine outputs of each expert network;
calculate eight outputs of the gate network;
7
multiple each output of the gate network to the corresponding output of the expert
networks;
calculate the final nine outputs of the total modular network;
put the nine outputs in a list with strength from the maximal to the minimal;
check the list and find out an empty square with the maximal output strength in all of
the empty squares;
if the network is the candidate network
{set the square value to 1;
if there is a line on the board by putting a marker on the square, add win score to the
payoff of the candidate network, set the value of empty_square to 0 and exit;
}
else //that means the network is playing as an opponent network
{set the square value to –1;
if there is a line on the board by putting a marker on the square, add lose score to the
payoff of the candidate network, set the value of empty_square to 0 and exit;
}
//select the square on which to put a marker
function rule_base
{ if there is a win place, set the square value to -1, subtract one from the pay off, set the
value of empty square to 0 and exit;
else if there is a block place, set the square value to -1;
else, if two open squares are in line with an “O”, randomly place a marker in either of
the two squares;
else randomly select an empty square and set it to –1;
}
function main
{ do indi_init for 50 times to generate 50 networks in the first generation;
do the following for 30 trials
{ while not reaching the max generation
{ randomly select 4 candidate networks from the population;
to each candidate network do
{ //play with the rule based algorithm
feed_forward;
for each of the eight possible moves
{ put a marker on the square
while there is an empty square on the board
{ feed_forward;
rule_base;
}
}
randomly select 10 networks;
8
to each selected network do
{while there is an empty square on the board
{ to the candidate network do feed_forward;
to the selected network do feed_forward;
}
}
}
get the final payoff of 4 candidate networks;
select two networks with higher payoff as the parents;
to each of the parent network do mutation and get two offspring;
substitute the other two networks with two offspring;
put those four networks to the population so we can get a new population;
}
}
}
4.4 Implementation
The key data structure in our program is:
typedef struct pop_network
{ double board[3][3];
double expert_weight[8][3][9];
double bias[8][9];
double gate_weight[9][8];
double gate_bias[8];
int result;
//1:win; 0:draw; -1:lose
int pay_off;
int candidate;
int win_rule;
int lose_rule;
}POP_NETWORK;
In this struct type:
• board[3][3] shows the board pattern. “1” means an “X” on the square, “-1”
means an “O” on the square, and “0” means the square is empty.
• expert_weight[8][3][9] shows the connection weights of the eight expert
networks. The first dimension shows there are eight expert networks, the second
dimension shows there are three inputs in each expert network, and the third
dimension shows the three inputs and nine outputs in each expert network are
fully connected.
• bias[8][9] shows the bias terms of the eight expert networks. The first dimension
shows there are eight expert networks, and the second dimension shows there are
nine outputs of each expert network.
9
•
•
•
•
•
•
•
gate_weight[9][8] shows the connection weights of the gate network. The first
dimension shows there are 9 inputs in the gate network, the second dimension
shows there are 8 outputs and each input and each output are fully connected.
gate_bias[8] shows the bias term of eight outputs in the gate network.
result shows the competition result of current play with “1” for win, “-1” for lose
and “0” for draw.
pay_off shows the final score of the candidate network after 18 games.
candidate shows the network is a candidate network or not, with “1” for a
candidate network and “-1” for an opponent network.
win_rule shows how many times the candidate network wins in the 8 games
competed with rule base.
lose_rule shows how many times the candidate network loses in the 8 games
competed with rule base.
4.5 Experiment Results
Fig 4.2 shows the best score, worst score, and mean score competed with rule-base
algorithm, omitting the possibility for random moves in step 2a), over 8 games of each
generation across 30 trials, with three curves series 1, series 2, and series 3, using pay
offs of +1, -1, and 0 for winning, losing, and playing to draw. The best score is about 3,
which means the number of games it wins is 3 greater than the number of games it loses.
4
2
766
721
676
631
586
541
496
451
406
361
316
271
226
181
91
136
-2
46
1
0
Series1
Series2
Series3
-4
-6
-8
-10
Fig 4.2 The best score, worst score and average score of each generation across 30 trials
on 8 games competed with rule base, using payoff function (win=+1, lose=-1, draw=0)
Fig 4.3 shows the best score, worst score, and mean score competed with rule-base
algorithm, omitting the possibility for random moves in step 2a), over 8 games of each
10
generation across 30 trials, with three curves series 1, series 2, and series 3, using pay
offs of +10, -1, and 0 for winning, losing, and playing to draw. In this case, the variability
of the score was greater because a win receives ten more points than a draw, rather than
only a single point more.
60
50
40
30
Series1
Series2
Series3
20
10
766
721
676
631
586
541
496
451
406
361
316
271
226
181
91
136
-10
46
1
0
-20
Fig 4.3 The best score, worst score and average score of each generation across 30 trials
on 8 games competed with rule base, using payoff function (win=+10, lose=-1, draw=0)
Fig 4.4 shows the best score, worst score, and mean score competed with rule-base
algorithm, omitting the possibility for random moves in step 2a), over 8 games of each
generation across 30 trials, with three curves series 1, series 2, and series 3, using pay
offs of +1, -10, 0 for winning, losing, and playing to draw. In this case, selection in light
of the increased penalty for losing resulted in purging the losing strategies quickly from
the evolving population. But even in generation more than 700 the network can’t get
more wins. What it can do is not to lose.
We are not satisfied with the result we get when using pay offs of +1, -10, 0 for
winning, losing, and drawing as shown in Fig 4.4. Under the suggestion of our instructor,
we do another experiment which uses pay offs of +1, -10, 0 for winning, losing, and
drawing in the first 800 generations and use pay offs of +10, -1, 0 for winning, losing,
and drawing in the next 800 generations. The result is as Fig 4.5. In this case, the
performance of the network is much better. In the first 800 generations, the losing
strategies were purged out, meantime the network can’t win more. So the best individual
can get a score near 0, which means no lose and no win. In the next 800 generations,
because the winning strategy can get ten more points than a draw, the individual
networks which are retained are those who can win more times. So in this case, we can
see the best individual can get a score near 50, which means the individual can win at
11
least 5 games among the 8 games. Obviously the pay off function will influence the
network performance.
761
721
681
641
601
561
521
481
441
401
361
321
281
241
201
161
121
81
-10
41
1
0
-20
-30
Series1
Series2
Series3
-40
-50
-60
-70
-80
-90
Fig 4.4 The best score, worst score and average score of each generation across 30 trials
on 8 games competed with rule base, using payoff function (win=+1, lose=-10, draw=0)
60
40
20
-40
1567
1480
1393
1306
1219
1132
1045
958
871
784
697
610
523
436
349
262
175
-20
88
1
0
Series1
Series2
Series3
-60
-80
-100
Fig 4.5 The best score, worst score and average score of each generation across 30 trials
on 8 games competed with rule base, using payoff function (win=1, lose=-10, draw=0) in
generation (1..800) and using payoff function (win=10, lose=-1, draw=0) in generation
(801..1600)
12
5. Conclusion
In our project, neural evolutionary strategy and partitioned neural architecture are
combined to play tic-tac-toe. The coding structure we used in our experiments includes
eight expert networks, each with three inputs and nine outputs, and one gate network,
which has nine inputs and eight output. In our experiment, no a priori information
regarding the object to the game was offered to the evolutionary algorithm. No hints
regarding appropriate moves were given, nor were there any attempts to assign values to
various board patterns. The only given information was that the three inputs of each
expert network are in a row, a column, or a diagonal in the board. The final outcome (win,
lose, draw) was the only information available regarding the quality of the play.
From the result of those experiments, the influence of the payoff function on the
performance of evolving neural network is obvious. What we can do in the future is to
add hidden layers both in the gate network and expert networks in order to optimize the
performance (win more, lose less) of the network. And we can use non-linearity function
in expert and gate networks.
Reference:
[1] Chellapilla K., Fogel D.B., “Evolution, Neural Networks, Games and Intelligence”,
Proceedings of the IEEE, 87(9), pp 1471-1496, 1999
[2] Chapter 12 in: Haykin S., Neural Networks a Comprehensive Foundation, MacMillan,
ISBN 0-02-352761-7, 1994
13

Download Report

Tic-Tac-Toe

Paperzz.com

Your Paperzz