Particle Swarm Optimisation for Learning Game Strategies AP Engelbrecht Department of Computer Science, University of Pretoria, South Africa [email protected] Contents Introduction Particle Swarm Optimization Coevolution Coevolutionary Game Learning PSO Coevolutionary Algorithm The Games Performance Measures Some Results Conclusions and Future Developments 1 Introduction What is the objective? • to show that particle swarm optimization (PSO) can be used to evolve game agents in a coevolutionary context, and • that it can be done by training agents from zero knowledge of any playing strategies. Sub-objectives • to propose a generic framework for learning game strategies using a coevolutionary PSO approach • to show how the framework can be applied to different games • to propose performance measures • to show results on some games 2 Assumptions • Two players • Turn-based (mostly) • Minimal prior knowledge: – no prior knowledge of playing strategies – the rules of the game – current state – feedback: did you win, loose, or is it a draw? 3 Particle Swarm Optimization (PSO) What is PSO? • population-based, stochastic search algorithm • modeled after simulations of the choreographic behavior of bird flocks Components of PSO • particle – position in n-dimensional search space • swarm of particles • fitness function • memory of previous best positions found The search process: • particles are “flown” through the hyperdimensional search space • position update: xi(t + 1) = xi(t) + vi(t) • velocity update: vi(t) = wvi(t − 1) + c1r1(t)(yi(t) − xi(t)) + c2r2(t)(ŷi(t) − xi(t)) 4 Movement is determined by three components: • momentum: wvi(t − 1) • cognitive component: c1r1(t)(yi(t)−xi(t)) • social component: c2r2(t)(ŷi(t) − xi(t)) • illustration x2 new velocity cognitive velocity social velocity inertia velocity ŷ(t) x(t + 1) x(t) y(t) x1 (a) Time Step t x2 new velocity cognitive velocity social velocity inertia velocity ŷ(t + 1) x(t + 2) x(t + 1) x(t) y(t + 1) x1 (b) Time Step t + 1 5 General PSO algorithm: Create and initialize an n-dimensional swarm, S; repeat for each particle i = 1, . . . , ns do //set the personal best position if f (S.xi) < f (S.yi) then S.yi = S.xi; end //set the global best position if f (S.yi) < f (S.ŷi) then S.ŷi = S.yi; end end for each particle i = 1, . . . , ns do update the particle velocity ; update the particle position ; end until stopping condition is true; 6 PSO parameters • swarm size • intertia weight • acceleration coefficients • velocity clamping A note on PSO convergence: • guaranteed convergence to the stable point: c1yi + c2ŷ c1 + c 2 provided that convergent parameters are used: 1 1 > w > (c1 + c2) − 1 2 • no guarantee of local convergence 7 Neighborhood topologies (c) Star: gbest PSO (e) (d) Ring: lbest PSO Von Neumann 8 Different PSO implementations: • Constriction coefficient: vij (t + 1) = χ[vij (t) + φ1(yij (t) − xij (t)) + φ2(ŷj (t) − xij (t))] where with 2κ s χ= |2 − φ − φ(φ − 4)| φ = φ 1 + φ2 φ 1 = c1 r1 φ 2 = c2 r2 under the constraints that φ ≥ 4 and κ ∈ [0, 1]. • GCPSO: xτ j (t + 1) = ŷj (t) + wvτ j (t) + ρ(t)(1 − 2r2(t)) vτ j (t+1) = −xτ j (t)+ŷj (t)+wvτ j (t)+ρ(t)(1−2r2j (t)) 2ρ(t) if #successes(t) > s ρ(t + 1) = 0.5ρ(t) if #f ailures(t) > c ρ(t) otherwise 9 Coevolution Multiple populations co-exist in the same environment, where limited resources are shared. Two forms of coevolution: • cooperative/symbiotic • competitive Credit assignment: • not based on an absolute fitness measure • uses a relative fitness measure, allowing a moving target • relative fitness evaluation • fitness sampling The PSO model employs both cooperative and competitive coevolution 10 Coevolutionary Game Learning Chellapilla and Fogel’s model for Checkers Three components: • a game tree • a neural network evaluation function of leaf nodes of the game tree • a population of neural networks (evaluation functions) Training process: • competitive, based on only one population • relative fitness function: score each individual based on performance against a randomly selected group of individuals • weight adjustments using an EP General PSO Coevolutionary Algorithm Used for the games: Deterministic TicTacToe, Checkers, IPD, Bao, Probabilistic TicTacToe 11 Based on the model of Chellapilla and Fogel, but using PSO to adjust weights Create and randomly initialize a swarm of NNs; repeat Add all personal best positions to the competition pool; Add all particles to the competition pool; for each particle (or NN) do Randomly select a group of opponents from the competition pool; for each opponent do Play a number of games against opponents, playing as first player; Record if games were won, lost or drawn; Play a number of games against opponent, but as the second player; Record if games were won, lost or drawn; end Determine a score for each particle; Compute new personal best positions based on scores; end Compute neighbor best positions; Update particle velocities; Update particle positions; until stopping condition is true; Return global best particle as game playing agent; 12 The Games Deterministic TicTacToe • Scoring mechansim: 1 for a win 0 for a draw -2 for a loss • Neural network architecture: sigmoid activation functions 9 input units: – 1.0 to represent your own piece – 0.0 to represent opponent’s piece – 0.5 to represent empty space 1 output unit Probabilistic TicTacToe: • Game rules: – 4 × 4 × 4 board – 4 sided fair dice is used to determine the layer to play in – if a move cannot be made on a layer, the player misses that round 13 – game ends when there are no more empty spaces – the player with the most completed rows, columns and diagonals of length 4 wins • Neural network architecture: sigmoid activation functions 64 input units: – 0.5 to represent your own piece – -0.5 to represent opponent’s piece – 0.0 to represent empty space 1 output unit 14 Checkers: • State space: 1018 • Scoring mechanism: 1 for a win 0.5 for a draw 0.0 for a loss • Neural network architecture: sigmoid activation functions 32 input units: – 1.0 to represent your king – 0.75 to represent your man – 0.5 to represent an empty space – 0.25 to represent opponent’s man – 0.0 to represent opponent’s king 1 output unit 15 Bao: • State space: 1025 • The Bao Board: • Components: – Kitchwa – pits on the sides of second row – Kimbi – second row pits – Nyumba – 5th pit in second row – 64 Kete (or seeds/stones) • Goal: – capture the Kete from the front row of the opponent – deprive opponent from any valid moves 16 • Initial board state: – each player starts with 10 kete on the board – 6 kete in the Nyumba – 2 kete in the Kimbi to the right of the Nyumba – 2 kete in the next Kimbi – South makes the first move • Basic move: sowing – clockwise or anti-clockwise – direction cannot change during a move – ending on non-empty pit in second row, and opposing pit in second row is nonempty, then capturing occurs – capturing: all counters of opponent’s pit are removed and sown in your rows – if capturing pit is left kitchwa or kimbi, sow clockwise, else anti-clockwise – capturing is enforced when possible – if capturing is not possible, sowing continues from endig point fo previous sowing process – referred to as endelea 17 – sowing continues until ends in an empty pit • A move: – is a sequence of sow and capturing actions – only ends when a sowing action ends in an empty pit • Phase 1 (Namua phase): – each player has 22 kete in his store – for each move, place one kete into a nonempty kitchwa or kimbi – only second row pits are played – if capture not possible, continue sowing from current pit • Phase 2 (Mtaji phase): – starts when store of both players are empty – only pits with more than one kete may be sown – may sow from any of the two rows – may sow in any direction (except in case of capture) 18 • Scoring mechanism: 5 for a win -10 for a loss • Neural network architecture: sigmoid activation functions 32 input units, with values indicating the number of seeds in the corresponding pit 19 Performance Measures The problem: • How can we fairly evaluate the performance of a game agent? • Without playing thousands of games against human players or other computer-based agents? Record the number of games won, lost, or drawn: • gives us three different measures • how do we rank performance based on three measures? Evaluate performance against an agent making random moves over many games: • approximate the probability of winning, loosing, or ending in a draw • play a random player 1 million games as the first player against another random player, and then as the second player • record total number of games lost, won, or drawn 20 • compute probabilities of winning, loosing, drawing Won as Won as 1st 2nd Game player player Drawn Deterministic TicTacToe 0.588 0.288 0.124 Probabilistic TicTacToe 0.50776 0.44367 0.04857 Checkers 0.43 0.43 0.14 Performance measure 1: The M-measure • Deterministic TicTacToe: M = (w1 − 58.8) + (w2 − 28.8) where w1 is number of games won as 1st player against random player w2 is number of games won as 2nd player against random player • The higher the M -value, the more successful the agent 21 • Confidence intervals: σ̂ M ± zα/2 × √ n where zα/2 is such that P [Z ≥ zα/2], with Z ∼ N (0, 1), and r σ̂ = π̂1(1 − π̂1) + π̂2(1 − π̂2) + 2σ̂12 where π̂1 and π̂2 represent the probabilities for winning as player one and player two respectively, and Pm Pm ( i=1 x1i)( i=1 x2i) 1 m X (x1ix2i) − σ̂12 = m − 1 i=1 m where m represents the total number of games played, x1i and x2i represents the outcome for game i as player one and two respectively 22 Performance measure 2: The F-measure • The M-measure neglects draws, which should be taken in consideration for some games • Calculate the mean of the observed probabilities is calculated using the standard mean function for discrete random variables: 3 X F = xif (xi) i=1 where xi represents a weight associated with losing, drawing and winning the game respectively, and f (xi) represents the corresponding outcome probability • Calculate F1 to represent performance as player 1 Calculate F2 to represent performance as player 2 Compute the average of F1 and F2 Normalize to the range [0, 100] • A value close to 100 indicates that the player wins most of its games • A value of 50 is obtained for random players 23 Some Results Deterministic TicTacToe: M -measure • c1 = c2 = 1.0, w = 1.0 • Varying swarm sizes (7 hidden units) 24 • Varying number of hidden units (Von Neumann) s 15 3 31.05670 ± 0.00198 20 29.14400 ± 0.00199 25 30.07170 ± 0.00198 30 28.56370 ± 0.00201 35 32.63000 ± 0.00196 40 29.39830 ± 0.00199 45 29.97770 ± 0.00198 50 24.87400 ± 0.00201 Avg 29.46451 M ± 0.00199 Hidden Units 7 9 11 31.31300 31.91870 32.29270 ± 0.00198 ± 0.00197 ± 0.00196 34.84900 33.33230 33.44530 ± 0.00195 ± 0.00196 ± 0.00195 32.63100 32.21170 31.04230 ± 0.00196 ± 0.00195 ± 0.00197 31.69570 28.65230 31.56730 ± 0.00198 ± 0.00200 ± 0.00197 27.64570 27.96570 29.15730 ± 0.00200 ± 0.00200 ± 0.00199 30.28770 27.30300 29.94200 ± 0.00198 ± 0.00200 ± 0.00199 29.61200 29.77130 29.71600 ± 0.00199 ± 0.00199 ± 0.00200 30.99930 30.71830 28.92130 ± 0.00199 ± 0.00199 ± 0.00200 31.12918 30.23416 30.44489 ± 0.00198 ± 0.00198 ± 0.00198 25 15 29.33330 ± 0.00198 32.03730 ± 0.00196 31.06070 ± 0.00198 26.56270 ± 0.00202 28.48070 ± 0.00201 32.70270 ± 0.00196 27.84630 ± 0.00201 26.44130 ± 0.00201 29.30813 ± 0.00199 Averages 30.43473 ± 0.00198 32.05936 ± 0.00197 31.50577 ± 0.00197 29.57753 ± 0.00199 29.80906 ± 0.00199 29.70453 ± 0.00199 29.05300 ± 0.00199 28.66474 ± 0.00200 • Camparing neighborhood structures 26 Illustrating the arms race effect: 27 Checkers: F -measure • c1 = c2 = 1.0, w = 1.0 • Varying swarm size (5 hidden units) 28 • Varying number of hidden units 29 • Comparing neighborhood structures 30 • Influence of velocity clamping 31 • Influence of c1 and c2 (Vmax = 0.2) 32 • Influence of inertia weight 33 • Performance based on: Vmax = 0.2, c1 = c2 = 1.0, w = 0.9 34 35 36 • Using sliding windows (15 particles, 5 hidden units) Window Total Performance dimensions inputs (F-value) 8-by-8 32 72.524 7-by-7 98 76.879 6-by-6 162 73.331 5-by-5 200 75.093 4-by-4 200 73.408 3-by-3 162 74.925 2-by-2 98 76.092 37 • Restrictions on the number of game moves 38 Comparing Checkers against other approaches • Piece count evaluation function: = (µ×pieceCountAdvantage)+pieceCount+η where µ biases piece count advantage pieceCountAdvantage is difference in piece count η is random tie-breaker • SmartEval – Martin Fierz – Determines whether a back rank guard is available. – Determines if an intact double corner exists. – Evaluates control of the central area of the board. – Brings the number of pieces on the edges of the board into consideration. – Calculates material advantage. 39 40 Bao: • hand-crafted evaluation function: – If winning state for North player, return 100000. – If winning state for South player, return -100000. – Initialize the evaluation value to 0. – For each kete possessed by the North player, add 1 to the evaluation value. – For each kete possessed by the South player, subtract 1. – Add (44 - Number of kete in front row of South player). – Subtract (44 - Number of kete in front row of South player). – Return the evaluation value. • Different models: Competition against Model A: competition pool Model B: competition pool and random agents Model C: added expert player Model D: expert player added after 250 iterations 41 • Performance, as percentage games won against opponents, using gbest PSO • Model A & B against random player: 42 43 44 • Models against random players: Model B C D Ply 1 Ply 2 Ply 3 75% 87% 92% 72% 82% 86% 78% 85% 92% • Model B against the expert Game Agent Direction Ply-depth South 1 2 3 North 1 2 3 Average 1 2 3 45 Expert Player 1 2 3 40 - 40 30 70 30 10 80 10 20 10 50 20 60 5 30 20 60 25 5 • Model C against the expert Game Agent Direction Ply-depth South 1 2 3 North 1 2 3 Average 1 2 3 46 Expert 1 2 80 80 100 50 90 70 100 100 70 30 80 60 90 90 85 40 85 65 Player 3 4 - 30 - 20 30 20 10 10 30 10 5 • Model D against the expert Game Agent Direction Ply-depth South 1 2 3 North 1 2 3 Average 1 2 3 47 Expert Player 1 2 3 4 90 80 - 100 50 30 100 70 - 90 90 - 80 50 50 100 70 30 10 90 85 - 90 50 40 100 70 15 5 Final Remarks Have shown that a PSO based approach can train game playing agents in a coevolutionary manner Have been applied successfully to the games: deterministic TacTacToe, probabilistic TicTacToe, Checkers, Bao, IPD Issues: • Computational complexity • Stagnating behavior • How well does the approach compare against other approaches? Future plans: • Analyze games to identify playing strategies • Further improve performance by addressing stagnation behavior • Develop system to play and learn against human players • Include niching strategies to evolve multiple playing strategies 48 • Improve performance by using improved PSO algorithms • Investigate mechanisms to reduce computational complexity • Perform elaborate evaluations against other game agents • Extend to more complex probabilistic games • Evolve different strategies for the different phases of Bao 49
© Copyright 2026 Paperzz