Particle Swarm Optimisation for Learning Game Strategies Contents

Particle Swarm Optimisation for Learning
Game Strategies
AP Engelbrecht
Department of Computer Science, University of Pretoria, South Africa
[email protected]
Contents
Introduction
Particle Swarm Optimization
Coevolution
Coevolutionary Game Learning
PSO Coevolutionary Algorithm
The Games
Performance Measures
Some Results
Conclusions and Future Developments
1
Introduction
What is the objective?
• to show that particle swarm optimization
(PSO) can be used to evolve game agents
in a coevolutionary context, and
• that it can be done by training agents from
zero knowledge of any playing strategies.
Sub-objectives
• to propose a generic framework for learning game strategies using a coevolutionary
PSO approach
• to show how the framework can be applied
to different games
• to propose performance measures
• to show results on some games
2
Assumptions
• Two players
• Turn-based (mostly)
• Minimal prior knowledge:
– no prior knowledge of playing strategies
– the rules of the game
– current state
– feedback: did you win, loose, or is it a
draw?
3
Particle Swarm Optimization (PSO)
What is PSO?
• population-based, stochastic search algorithm
• modeled after simulations of the choreographic behavior of bird flocks
Components of PSO
• particle – position in n-dimensional search
space
• swarm of particles
• fitness function
• memory of previous best positions found
The search process:
• particles are “flown” through the hyperdimensional search space
• position update:
xi(t + 1) = xi(t) + vi(t)
• velocity update:
vi(t) = wvi(t − 1) + c1r1(t)(yi(t) − xi(t))
+ c2r2(t)(ŷi(t) − xi(t))
4
Movement is determined by three components:
• momentum: wvi(t − 1)
• cognitive component: c1r1(t)(yi(t)−xi(t))
• social component: c2r2(t)(ŷi(t) − xi(t))
• illustration
x2
new velocity
cognitive velocity
social velocity
inertia velocity
ŷ(t)
x(t + 1)
x(t)
y(t)
x1
(a) Time Step t
x2
new velocity
cognitive velocity
social velocity
inertia velocity
ŷ(t + 1)
x(t + 2)
x(t + 1)
x(t)
y(t + 1)
x1
(b) Time Step t + 1
5
General PSO algorithm:
Create and initialize an n-dimensional
swarm, S;
repeat
for each particle i = 1, . . . , ns do
//set the personal best position
if f (S.xi) < f (S.yi) then
S.yi = S.xi;
end
//set the global best position if
f (S.yi) < f (S.ŷi) then
S.ŷi = S.yi;
end
end
for each particle i = 1, . . . , ns do
update the particle velocity ;
update the particle position ;
end
until stopping condition is true;
6
PSO parameters
• swarm size
• intertia weight
• acceleration coefficients
• velocity clamping
A note on PSO convergence:
• guaranteed convergence to the stable point:
c1yi + c2ŷ
c1 + c 2
provided that convergent parameters are
used:
1
1 > w > (c1 + c2) − 1
2
• no guarantee of local convergence
7
Neighborhood topologies
(c)
Star: gbest PSO
(e)
(d)
Ring: lbest PSO
Von Neumann
8
Different PSO implementations:
• Constriction coefficient:
vij (t + 1) = χ[vij (t) + φ1(yij (t) − xij (t))
+ φ2(ŷj (t) − xij (t))]
where
with
2κ
s
χ=
|2 − φ − φ(φ − 4)|
φ = φ 1 + φ2
φ 1 = c1 r1
φ 2 = c2 r2
under the constraints that φ ≥ 4 and κ ∈
[0, 1].
• GCPSO:
xτ j (t + 1) = ŷj (t) + wvτ j (t) + ρ(t)(1 − 2r2(t))
vτ j (t+1) = −xτ j (t)+ŷj (t)+wvτ j (t)+ρ(t)(1−2r2j (t))




2ρ(t) if #successes(t) > s




ρ(t + 1) =  0.5ρ(t) if #f ailures(t) > c



 ρ(t)
otherwise
9
Coevolution
Multiple populations co-exist in the same environment, where limited resources are shared.
Two forms of coevolution:
• cooperative/symbiotic
• competitive
Credit assignment:
• not based on an absolute fitness measure
• uses a relative fitness measure, allowing a
moving target
• relative fitness evaluation
• fitness sampling
The PSO model employs both cooperative and
competitive coevolution
10
Coevolutionary Game Learning
Chellapilla and Fogel’s model for Checkers
Three components:
• a game tree
• a neural network evaluation function of
leaf nodes of the game tree
• a population of neural networks (evaluation functions)
Training process:
• competitive, based on only one population
• relative fitness function: score each individual based on performance against a randomly selected group of individuals
• weight adjustments using an EP
General PSO Coevolutionary
Algorithm
Used for the games: Deterministic TicTacToe,
Checkers, IPD, Bao, Probabilistic TicTacToe
11
Based on the model of Chellapilla and Fogel,
but using PSO to adjust weights
Create and randomly initialize a swarm of NNs;
repeat
Add all personal best positions to the competition
pool;
Add all particles to the competition pool;
for each particle (or NN) do
Randomly select a group of opponents from the
competition pool;
for each opponent do
Play a number of games against opponents,
playing as first player;
Record if games were won, lost or drawn;
Play a number of games against opponent, but
as the second player;
Record if games were won, lost or drawn;
end
Determine a score for each particle;
Compute new personal best positions based on
scores;
end
Compute neighbor best positions;
Update particle velocities;
Update particle positions;
until stopping condition is true;
Return global best particle as game playing agent;
12
The Games
Deterministic TicTacToe
• Scoring mechansim:
1 for a win
0 for a draw
-2 for a loss
• Neural network architecture:
sigmoid activation functions
9 input units:
– 1.0 to represent your own piece
– 0.0 to represent opponent’s piece
– 0.5 to represent empty space
1 output unit
Probabilistic TicTacToe:
• Game rules:
– 4 × 4 × 4 board
– 4 sided fair dice is used to determine the
layer to play in
– if a move cannot be made on a layer, the
player misses that round
13
– game ends when there are no more empty
spaces
– the player with the most completed rows,
columns and diagonals of length 4 wins
• Neural network architecture:
sigmoid activation functions
64 input units:
– 0.5 to represent your own piece
– -0.5 to represent opponent’s piece
– 0.0 to represent empty space
1 output unit
14
Checkers:
• State space: 1018
• Scoring mechanism:
1 for a win
0.5 for a draw
0.0 for a loss
• Neural network architecture:
sigmoid activation functions
32 input units:
– 1.0 to represent your king
– 0.75 to represent your man
– 0.5 to represent an empty space
– 0.25 to represent opponent’s man
– 0.0 to represent opponent’s king
1 output unit
15
Bao:
• State space: 1025
• The Bao Board:
• Components:
– Kitchwa – pits on the sides of second
row
– Kimbi – second row pits
– Nyumba – 5th pit in second row
– 64 Kete (or seeds/stones)
• Goal:
– capture the Kete from the front row of
the opponent
– deprive opponent from any valid moves
16
• Initial board state:
– each player starts with 10 kete on the
board
– 6 kete in the Nyumba
– 2 kete in the Kimbi to the right of the
Nyumba
– 2 kete in the next Kimbi
– South makes the first move
• Basic move: sowing
– clockwise or anti-clockwise
– direction cannot change during a move
– ending on non-empty pit in second row,
and opposing pit in second row is nonempty, then capturing occurs
– capturing: all counters of opponent’s pit
are removed and sown in your rows
– if capturing pit is left kitchwa or kimbi,
sow clockwise, else anti-clockwise
– capturing is enforced when possible
– if capturing is not possible, sowing continues from endig point fo previous sowing process – referred to as endelea
17
– sowing continues until ends in an empty
pit
• A move:
– is a sequence of sow and capturing actions
– only ends when a sowing action ends in
an empty pit
• Phase 1 (Namua phase):
– each player has 22 kete in his store
– for each move, place one kete into a nonempty kitchwa or kimbi
– only second row pits are played
– if capture not possible, continue sowing
from current pit
• Phase 2 (Mtaji phase):
– starts when store of both players are
empty
– only pits with more than one kete may
be sown
– may sow from any of the two rows
– may sow in any direction (except in case
of capture)
18
• Scoring mechanism:
5 for a win
-10 for a loss
• Neural network architecture:
sigmoid activation functions
32 input units, with values indicating the
number of seeds in the corresponding pit
19
Performance Measures
The problem:
• How can we fairly evaluate the performance
of a game agent?
• Without playing thousands of games against
human players or other computer-based
agents?
Record the number of games won, lost, or drawn:
• gives us three different measures
• how do we rank performance based on three
measures?
Evaluate performance against an agent making
random moves over many games:
• approximate the probability of winning,
loosing, or ending in a draw
• play a random player 1 million games as
the first player against another random
player, and then as the second player
• record total number of games lost, won, or
drawn
20
• compute probabilities of winning, loosing,
drawing
Won as Won as
1st
2nd
Game
player player Drawn
Deterministic
TicTacToe
0.588
0.288
0.124
Probabilistic
TicTacToe
0.50776 0.44367 0.04857
Checkers
0.43
0.43
0.14
Performance measure 1: The M-measure
• Deterministic TicTacToe:
M = (w1 − 58.8) + (w2 − 28.8)
where
w1 is number of games won as 1st player
against random player
w2 is number of games won as 2nd player
against random player
• The higher the M -value, the more successful the agent
21
• Confidence intervals:


σ̂ 
M ± zα/2 × √ 
n





where zα/2 is such that P [Z ≥ zα/2], with
Z ∼ N (0, 1), and
r
σ̂ = π̂1(1 − π̂1) + π̂2(1 − π̂2) + 2σ̂12
where π̂1 and π̂2 represent the probabilities
for winning as player one and player two respectively, and


Pm
Pm
( i=1 x1i)( i=1 x2i) 
1  m
X


(x1ix2i) −
σ̂12 =


m − 1 i=1
m
where m represents the total number of games
played, x1i and x2i represents the outcome
for game i as player one and two respectively
22
Performance measure 2: The F-measure
• The M-measure neglects draws, which should
be taken in consideration for some games
• Calculate the mean of the observed probabilities is calculated using the standard
mean function for discrete random variables:
3
X
F =
xif (xi)
i=1
where xi represents a weight associated
with losing, drawing and winning the game
respectively, and f (xi) represents the corresponding outcome probability
• Calculate F1 to represent performance as
player 1
Calculate F2 to represent performance as
player 2
Compute the average of F1 and F2
Normalize to the range [0, 100]
• A value close to 100 indicates that the
player wins most of its games
• A value of 50 is obtained for random players
23
Some Results
Deterministic TicTacToe: M -measure
• c1 = c2 = 1.0, w = 1.0
• Varying swarm sizes (7 hidden units)
24
• Varying number of hidden units (Von Neumann)
s
15
3
31.05670
± 0.00198
20 29.14400
± 0.00199
25 30.07170
± 0.00198
30 28.56370
± 0.00201
35 32.63000
± 0.00196
40 29.39830
± 0.00199
45 29.97770
± 0.00198
50 24.87400
± 0.00201
Avg 29.46451
M ± 0.00199
Hidden Units
7
9
11
31.31300 31.91870 32.29270
± 0.00198 ± 0.00197 ± 0.00196
34.84900 33.33230 33.44530
± 0.00195 ± 0.00196 ± 0.00195
32.63100 32.21170 31.04230
± 0.00196 ± 0.00195 ± 0.00197
31.69570 28.65230 31.56730
± 0.00198 ± 0.00200 ± 0.00197
27.64570 27.96570 29.15730
± 0.00200 ± 0.00200 ± 0.00199
30.28770 27.30300 29.94200
± 0.00198 ± 0.00200 ± 0.00199
29.61200 29.77130 29.71600
± 0.00199 ± 0.00199 ± 0.00200
30.99930 30.71830 28.92130
± 0.00199 ± 0.00199 ± 0.00200
31.12918 30.23416 30.44489
± 0.00198 ± 0.00198 ± 0.00198
25
15
29.33330
± 0.00198
32.03730
± 0.00196
31.06070
± 0.00198
26.56270
± 0.00202
28.48070
± 0.00201
32.70270
± 0.00196
27.84630
± 0.00201
26.44130
± 0.00201
29.30813
± 0.00199
Averages
30.43473
± 0.00198
32.05936
± 0.00197
31.50577
± 0.00197
29.57753
± 0.00199
29.80906
± 0.00199
29.70453
± 0.00199
29.05300
± 0.00199
28.66474
± 0.00200
• Camparing neighborhood structures
26
Illustrating the arms race effect:
27
Checkers: F -measure
• c1 = c2 = 1.0, w = 1.0
• Varying swarm size (5 hidden units)
28
• Varying number of hidden units
29
• Comparing neighborhood structures
30
• Influence of velocity clamping
31
• Influence of c1 and c2 (Vmax = 0.2)
32
• Influence of inertia weight
33
• Performance based on:
Vmax = 0.2, c1 = c2 = 1.0, w = 0.9
34
35
36
• Using sliding windows (15 particles, 5 hidden units)
Window Total Performance
dimensions inputs (F-value)
8-by-8
32
72.524
7-by-7
98
76.879
6-by-6
162
73.331
5-by-5
200
75.093
4-by-4
200
73.408
3-by-3
162
74.925
2-by-2
98
76.092
37
• Restrictions on the number of game moves
38
Comparing Checkers against other approaches
• Piece count evaluation function:
= (µ×pieceCountAdvantage)+pieceCount+η
where
µ biases piece count advantage
pieceCountAdvantage is difference in piece
count
η is random tie-breaker
• SmartEval – Martin Fierz
– Determines whether a back rank guard
is available.
– Determines if an intact double corner
exists.
– Evaluates control of the central area of
the board.
– Brings the number of pieces on the edges
of the board into consideration.
– Calculates material advantage.
39
40
Bao:
• hand-crafted evaluation function:
– If winning state for North player, return
100000.
– If winning state for South player, return
-100000.
– Initialize the evaluation value to 0.
– For each kete possessed by the North
player, add 1 to the evaluation value.
– For each kete possessed by the South
player, subtract 1.
– Add (44 - Number of kete in front row
of South player).
– Subtract (44 - Number of kete in front
row of South player).
– Return the evaluation value.
• Different models: Competition against
Model A: competition pool
Model B: competition pool and random
agents
Model C: added expert player
Model D: expert player added after 250
iterations
41
• Performance, as percentage games won against
opponents, using gbest PSO
• Model A & B against random player:
42
43
44
• Models against random players:
Model
B
C
D
Ply 1 Ply 2 Ply 3
75% 87% 92%
72% 82% 86%
78% 85% 92%
• Model B against the expert
Game Agent
Direction Ply-depth
South
1
2
3
North
1
2
3
Average 1
2
3
45
Expert Player
1 2 3
40 - 40 30 70 30 10
80 10 20 10 50 20 60 5 30 20 60 25 5
• Model C against the expert
Game Agent
Direction Ply-depth
South
1
2
3
North
1
2
3
Average 1
2
3
46
Expert
1 2
80 80
100 50
90 70
100 100
70 30
80 60
90 90
85 40
85 65
Player
3 4
- 30 - 20 30 20 10
10 30 10 5
• Model D against the expert
Game Agent
Direction Ply-depth
South
1
2
3
North
1
2
3
Average 1
2
3
47
Expert Player
1 2 3 4
90 80 - 100 50 30 100 70 - 90 90 - 80 50 50 100 70 30 10
90 85 - 90 50 40 100 70 15 5
Final Remarks
Have shown that a PSO based approach can
train game playing agents in a coevolutionary manner
Have been applied successfully to the games:
deterministic TacTacToe, probabilistic TicTacToe, Checkers, Bao, IPD
Issues:
• Computational complexity
• Stagnating behavior
• How well does the approach compare against
other approaches?
Future plans:
• Analyze games to identify playing strategies
• Further improve performance by addressing stagnation behavior
• Develop system to play and learn against
human players
• Include niching strategies to evolve multiple playing strategies
48
• Improve performance by using improved
PSO algorithms
• Investigate mechanisms to reduce computational complexity
• Perform elaborate evaluations against other
game agents
• Extend to more complex probabilistic games
• Evolve different strategies for the different
phases of Bao
49