Opponent Modelling by Sequence Prediction and Lookahead in Two-Player Games Richard Mealing and Jonathan L. Shapiro {mealingr,jls}@cs.man.ac.uk Machine Learning and Optimisation Group School of Computer Science University of Manchester, UK Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of Computer ScienceUniversity of Manchest 1 / 22 The Problem You play against an opponent The opponent’s actions are based on previous actions How can you maximise your reward? Applications Heads-up poker Auctions P2P networking Path finding etc Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of Computer ScienceUniversity of Manchest 2 / 22 Possible Approaches You could use reinforcement learning to learn to take actions with high expected discounted rewards However we propose to: Model the opponent using sequence prediction methods Lookahead and take actions which probabilistically, according to the opponent model, lead to the highest reward Which approach give us the highest rewards? Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of Computer ScienceUniversity of Manchest 3 / 22 Opponent Modelling using Sequence Prediction Observe the opponent’s action and the player’s action (aopp , a) Form a sequence over time t (memory size n) t t−1 t−1 t−n+1 t−n+1 , at ), (aopp , a ), ..., (aopp ,a ) (aopp Predict the opponent’s next action based on this sequence t+1 t t−1 t−1 t−n+1 t−n+1 Pr aopp |(aopp , at ), (aopp , a ), ..., (aopp ,a ) Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of Computer ScienceUniversity of Manchest 4 / 22 Sequence Prediction Methods We tested a variety of sequence prediction methods... Lempel-Ziv-1978 (LZ78) [1] Knuth-Morris-Pratt (KMP) [2] Unbounded contexts Prediction by Partial Matching C (PPMC) [3] ActiveLeZi [4] Context blending Transition Directed Acyclic Graph (TDAG) [5] Entropy Learned Pruned Hypothesis Space (ELPH) [6] Context pruning N-Gram [7] Hierarchical N-Gram (H. N-Gram) [7] } Collection of 1 to N-Grams Long Short Term Memory (LSTM) [8] } Implicit blending & pruning Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of Computer ScienceUniversity of Manchest 5 / 22 Sequence Prediction Method Lookahead Predict with k lookahead given a hypothesised context i.e. t+k t+k−1 t+k−1 t+k−2 t+k−2 t+k−n t+k−n Pr aopp |(aopp ,a ), (aopp ,a ), ..., (aopp ,a ) A hypothesised context may contain unobserved (predicted) symbols Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of Computer ScienceUniversity of Manchest 6 / 22 Reinforcement Learning: Q-Learning Learns an action-value function that when input a state-action pair (s, a) outputs the expected value of taking that action in that state and following a fixed strategy thereafter [9] State Action Reward Learning rate z}|{ Discount z}|{ z}|{ z}|{ z}|{ Q( s t , at ) ← (1 − α )Q(s t , at ) + α[ r t + γ max Q(s t+1 , at+1 )] {z } | at+1 | {z } fraction of old value fraction of reward & next max valued action Select actions with high q-values with some exploration Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of Computer ScienceUniversity of Manchest 7 / 22 Need for Lookahead (Prisoner’s Dilemma Example) D C D 1,1 0,4 C 4,0 3,3 Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of Computer ScienceUniversity of Manchest 8 / 22 Need for Lookahead (Prisoner’s Dilemma Example) Defect is the dominant action Cooperate-Cooperate is socially optimal (highest sum of rewards) Tit-for-tat (copy opponent’s last move) is good for iterated play Can we learn tit-for-tat? Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of Computer ScienceUniversity of Manchest 9 / 22 Need for Lookahead (Prisoner’s Dilemma Example) 4 Pred. C Pred. D D D C 3 D C D 1,1 0,4 C 4,0 3,3 1 C 0 Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of Computer ScienceUniversity of10 Manchest / 22 Need for Lookahead (Prisoner’s Dilemma Example) 4 Pred. C Pred. D D D C 3 D C D 1,1 0,4 C 4,0 3,3 C 1 0 Lookahead of 1 shows D has highest reward With lookahead of 2 (D,C,D,C) has highest total reward (unlikely) Assume the opponent copies the player’s last move (i.e. tit-for-tat) Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of Computer ScienceUniversity of11 Manchest / 22 Need for Lookahead (Prisoner’s Dilemma Example) D C Pred. C D D 1,1 0,4 C 4,0 3,3 Pred. D C D C 4 3 1 0 Pred. D Pred. C Pred. D Pred. C D C D C D C D C 5 4 7 6 2 1 4 3 Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of Computer ScienceUniversity of12 Manchest / 22 Need for Lookahead (Prisoner’s Dilemma Example) D C Pred. C D D 1,1 0,4 C 4,0 3,3 Pred. D C D C 4 3 1 0 Pred. D Pred. C Pred. D Pred. C D C D C D C D C 5 4 7 6 2 1 4 3 Lookahead of 2 against tit-for-tat shows C has highest reward Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of Computer ScienceUniversity of13 Manchest / 22 Q-Learning’s Implicit Lookahead State Action Reward Learning rate z}|{ z}|{ z}|{ Discount z}|{ z}|{ t t t t Q( s , a ) ← (1 − α )Q(s , a ) + α[ r t + γ max Q(s t+1 , at+1 )] {z } | at+1 | {z } fraction of old value fraction of reward & next max valued action Assume each state is an opponent action i.e. s = aopp Learns (player action, opponent action) values as: t+1 γ = 0 - payoff matrix (arg maxa Q aopp , a same as max lookahead 1) 0 <γ <1 - payoff matrix + future rewards with exponential decay γ = 1 - payoff matrix + future rewards Increasing γ increases lookahead Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of Computer ScienceUniversity of14 Manchest / 22 Exhaustive Explicit Lookahead We use exhaustive explicit lookahead with the opponent model and action values to greedily select actions (to limited depth) maximising total reward D D D 2 C C D C 1 D D C D C D C 5 C 4 5 D D C 8 4 C D C 7 1 C D D C 4 0 C D C 3 4 D C 7 3 6 Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of Computer ScienceUniversity of15 Manchest / 22 Experiments Iterated Rock-Paper-Scissors Opponent’s actions depend on its previous actions R P S R 0,0 1,-1 -1,1 P -1,1 0,0 1,-1 D 1,1 0,4 C 4,0 3,3 S 1,-1 -1,1 0,0 Iterated Prisoner’s Dilemma Opponent’s actions depend on both players’ previous actions D C Littman’s Soccer [10] Direct competition Which approach has better performance? Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of Computer ScienceUniversity of16 Manchest / 22 Iterated Rock Paper Scissors Memory Size 1 {R,R,R,P,P,P,S,S,S} Order 3 Name Avg Payoff Avg Time ELPH 0.666 ± 0.0003 56 ± 4 PGA-APP 0.652 ± 0.005 62 ± 4 WoLF-PHC 0.646 ± 0.004 71 ± 4 ɛ Q-Learner 0.582 ± 0.008 48 ± 6 WPL 0.393 ± 0.008 139 ± 7 Memory Size 2 {R,R,P,P,S,S} Order 2 Name Avg Payoff Avg Time WoLF-PHC 0.645 ± 0.006 89 ± 5 PGA-APP 0.644 ± 0.008 59 ± 5 ɛ Q-Learner 0.635 ± 0.008 22 ± 3 ELPH 0.617 ± 0.002 210 ± 0 WPL 0.374 ± 0.007 143 ± 7 ELPH WoLF-PHC ɛ Q-Learner PGA-APP WPL 1±0 0.98 ± 0.008 0.97 ± 0.01 0.92 ± 0.01 0.65 ± 0.02 10 ± 0 91 ± 3 28 ± 2 52 ± 3 105 ± 7 ELPH ɛ Q-Learner WoLF-PHC PGA-APP WPL 1±0 0.92 ± 0.01 0.91 ± 0.01 0.86 ± 0.01 0.54 ± 0.01 10 ± 0 45 ± 4 147 ± 8 109 ± 6 71 ± 6 WoLF-PHC ɛ Q-Learner PGA-APP ELPH WPL Memory Size 3 {R,P,S} Order 1 Avg Payoff Avg Time 1 ± 0 14.7 ± 0.6 1±0 27 ± 2 0.973 ± 0.009 24 ± 2 0.97 ± 0.01 29 ± 2 0.87 ± 0.01 74 ± 6 Name ELPH WoLF-PHC PGA-APP ɛ Q-Learner WPL ELPH WoLF-PHC ɛ Q-Learner PGA-APP WPL 1±0 0.95 ± 0.01 0.94 ± 0.01 0.9 ± 0.02 0.63 ± 0.01 10 ± 0 181 ± 6 37 ± 4 144 ± 6 98 ± 6 ELPH WoLF-PHC ɛ Q-Learner PGA-APP WPL 1 ± 0 17.2 ± 0.7 0.89 ± 0.01 205 ± 3 0.87 ± 0.01 71 ± 5 0.87 ± 0.01 179 ± 6 0.69 ± 0.01 208 ± 2 ELPH WoLF-PHC ɛ Q-Learner PGA-APP WPL Good ← ― ― ― ― ― ― → 0.68 ± 0.01 0.64 ± 0.01 0.61 ± 0.01 0.6 ± 0.002 0.375 ± 0.009 173 ± 6 56 ± 5 120 ± 7 58 ± 4 139 ± 7 1 ± 0 16.3 ± 0.7 0.85 ± 0.01 210 ± 0 0.84 ± 0.01 84 ± 6 0.77 ± 0.01 198 ± 3 0.76 ± 0.01 210 ± 0 Bad Agents cannot learn best response with memory size < model order Our approach gains the highest payoffs at generally the fastest rates Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of Computer ScienceUniversity of17 Manchest / 22 Memory Size 1 Discount = 0 and Depth = 1 Name Avg Payoff Avg Time Position PGA-APP 2.03 ± 0.01 30 ± 3 13 ɛ Q-Learner 1.94 ± 0.01 30 ± 4 16 WPL 1.932 ± 0.007 20 ± 1 17 TDAG 1.93 ± 0.01 30 ± 2 16 WoLF-PHC 1.89 ± 0.01 20 ± 2 18 Discount = 0.99 and Depth = 2 Name Avg Payoff Avg Time Position ɛ Q-Learner 2.68 ± 0.01 180 ± 5 1 TDAG + Q-Learner 2.63 ± 0.01 60 ± 4 1 TDAG 2.607 ± 0.008 20 ± 1 1 WPL 2.31 ± 0.01 30 ± 4 12 PGA-APP 2.17 ± 0.02 30 ± 3 13 WoLF-PHC 2.1 ± 0.02 40 ± 5 13 Memory Size 2 PGA-APP WPL WoLF-PHC TDAG ɛ Q-Learner 2.01 ± 0.01 1.949 ± 0.008 1.92 ± 0.01 1.902 ± 0.008 1.822 ± 0.007 30 ± 4 20 ± 1 30 ± 4 20 ± 2 20 ± 2 14 17 17 16 18 TDAG + Q-Learner ɛ Q-Learner TDAG WPL PGA-APP WoLF-PHC 2.828 ± 0.009 2.74 ± 0.01 2.72 ± 0.01 2.34 ± 0.01 2.18 ± 0.02 2.14 ± 0.01 120 ± 6 180 ± 5 20 ± 1 40 ± 4 40 ± 5 30 ± 3 1 1 1 12 13 13 Memory Size 3 Iterated Prisoner’s Dilemma ɛ Q-Learner TDAG WPL PGA-APP WoLF-PHC 2.02 ± 0.01 1.958 ± 0.008 1.945 ± 0.009 1.92 ± 0.009 1.773 ± 0.007 30 ± 3 20 ± 3 20 ± 3 20 ± 2 20 ± 1 14 17 17 16 18 TDAG + Q-Learner TDAG ɛ Q-Learner WPL PGA-APP WoLF-PHC 2.847 ± 0.009 2.74 ± 0.01 2.65 ± 0.01 2.32 ± 0.01 2.18 ± 0.02 2.14 ± 0.02 130 ± 5 30 ± 3 170 ± 5 30 ± 4 40 ± 4 40 ± 4 1 1 1 12 12 13 Good ← ― ― ― ― ― ― → Bad Increasing lookahead (discounting, search depth) increases rewards Our approach + Q-Learning increases rewards but also increases time Our approach gains the highest payoffs at generally the fastest rates Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of Computer ScienceUniversity of18 Manchest / 22 Soccer Name PPMC LSTM TDAG H. N-Gram LZ78 N-Gram ActiveLeZi ELPH FP KMP ɛ Q-Learner Avg Payoff 0.687 ± 0.006 0.635 ± 0.004 0.63 ± 0.004 0.628 ± 0.003 0.621 ± 0.004 0.62 ± 0.003 0.618 ± 0.003 0.601 ± 0.004 0.536 ± 0.003 0.524 ± 0.002 Name PPMC LSTM FP N-Gram H. N-Gram ActiveLeZi TDAG LZ78 ELPH KMP Good ← WoLF-PHC Avg Payoff 0.701 ± 0.006 0.638 ± 0.005 0.637 ± 0.004 0.614 ± 0.003 0.612 ± 0.003 0.606 ± 0.004 0.606 ± 0.004 0.602 ± 0.004 0.576 ± 0.003 0.564 ± 0.003 ― ― ― Name PPMC H. N-Gram N-Gram LSTM TDAG FP LZ78 ActiveLeZi ELPH KMP ― ― WPL Avg Payoff 0.717 ± 0.004 0.674 ± 0.002 0.665 ± 0.001 0.659 ± 0.003 0.659 ± 0.002 0.655 ± 0.003 0.653 ± 0.002 0.651 ± 0.002 0.637 ± 0.002 0.62 ± 0.002 ― → Name PPMC H. N-Gram ActiveLeZi FP TDAG LSTM N-Gram LZ78 ELPH KMP PGA-APP Avg Payoff 0.648 ± 0.006 0.608 ± 0.003 0.599 ± 0.004 0.593 ± 0.003 0.589 ± 0.004 0.585 ± 0.004 0.582 ± 0.003 0.574 ± 0.003 0.565 ± 0.003 0.553 ± 0.003 Bad Our approach wins above 50% of the games using any predictor PPMC has the highest performances Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of Computer ScienceUniversity of19 Manchest / 22 Conclusions We proposed sequence prediction and lookahead to accurately model and effectively respond to opponents with memory Empirical results show given sufficient memory and lookahead our approach outperforms reinforcement learning algorithms Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of Computer ScienceUniversity of20 Manchest / 22 Future Work Will apply our approach to domains with: Larger state spaces Hidden information Where the challenges are: Deeper lookahead (e.g. sampling techniques) Sequence predictor configuration (e.g. 1 predictor per state) Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of Computer ScienceUniversity of21 Manchest / 22 References [1] Lempel and Ziv. Compression of Individual Sequences via Variable-Rate Coding. 1978. [2] Byron Knoll. Text Prediction and Classification Using String Matching. 2009. [3] Alistair Moffat. “Implementing the PPM Data Compression Scheme”. In: IEEE Transactions on Communications 38 (1990), pp. 1917–1921. [4] Karthik Gopalratnam and Diane J. Cook. “ActiveLezi: An incremental parsing algorithm for sequential prediction”. In: 16th Int. FLAIRS Conf. 2003, pp. 38–42. [5] Philip Laird and Ronald Saul. “Discrete Sequence Prediction and Its Applications”. In: Machine Learning 15 (1994), pp. 43–68. [6] Jensen et al. “Non-stationary policy learning in 2-player zero sum games”. In: Proc. of 20th Int. Conf. on AI. 2005, pp. 789–794. [7] Ian Millington. “Artificial Intelligence for Games”. In: ed. by David H. Eberly. Morgan Kaufmann, 2006. Chap. Learning, pp. 583–590. [8] Felix A. Gers, Nicol N. Schraudolph, and Jürgen Schmidhuber. “Learning Precise Timing with LSTM Recurrent Networks”. In: JMLR 3 (2002), pp. 115–143. [9] C. J. C. H. Watkins. “Learning from delayed rewards”. PhD thesis. Cambridge, 1989. [10] Michael L. Littman. “Markov games as a framework for multi-agent reinforcement learning”. In: 11th Proc. of ICML. Morgan Kaufmann, 1994, pp. 157–163. Richard Mealing and Jonathan L. Shapiro (Machine Sequence Learning Prediction and Optimisation Opponent Modelling GroupSchool of Computer ScienceUniversity of22 Manchest / 22
© Copyright 2025 Paperzz