Opponent Modelling by Sequence Prediction and Lookahead in

Opponent Modelling by Sequence Prediction
and Lookahead in Two-Player Games
Richard Mealing and Jonathan L. Shapiro
{mealingr,jls}@cs.man.ac.uk
Machine Learning and Optimisation Group
School of Computer Science
University of Manchester, UK
Richard Mealing and Jonathan L. Shapiro (Machine
Sequence
Learning
Prediction
and Optimisation
Opponent Modelling
GroupSchool of Computer ScienceUniversity of Manchest
1 / 22
The Problem
You play against an opponent
The opponent’s actions are based on previous actions
How can you maximise your reward?
Applications
Heads-up poker
Auctions
P2P networking
Path finding
etc
Richard Mealing and Jonathan L. Shapiro (Machine
Sequence
Learning
Prediction
and Optimisation
Opponent Modelling
GroupSchool of Computer ScienceUniversity of Manchest
2 / 22
Possible Approaches
You could use reinforcement learning to learn to take actions with
high expected discounted rewards
However we propose to:
Model the opponent using sequence prediction methods
Lookahead and take actions which probabilistically, according to the
opponent model, lead to the highest reward
Which approach give us the highest rewards?
Richard Mealing and Jonathan L. Shapiro (Machine
Sequence
Learning
Prediction
and Optimisation
Opponent Modelling
GroupSchool of Computer ScienceUniversity of Manchest
3 / 22
Opponent Modelling using Sequence Prediction
Observe the opponent’s action and the player’s action (aopp , a)
Form a sequence over time t (memory size n)
t
t−1 t−1
t−n+1 t−n+1
, at ), (aopp
, a ), ..., (aopp
,a
)
(aopp
Predict the opponent’s next action based on this sequence
t+1
t
t−1 t−1
t−n+1 t−n+1
Pr aopp
|(aopp
, at ), (aopp
, a ), ..., (aopp
,a
)
Richard Mealing and Jonathan L. Shapiro (Machine
Sequence
Learning
Prediction
and Optimisation
Opponent Modelling
GroupSchool of Computer ScienceUniversity of Manchest
4 / 22
Sequence Prediction Methods
We tested a variety of sequence prediction methods...
Lempel-Ziv-1978 (LZ78) [1]
Knuth-Morris-Pratt (KMP) [2]
Unbounded contexts
Prediction by Partial Matching C (PPMC) [3]
ActiveLeZi [4]
Context blending
Transition Directed Acyclic Graph (TDAG) [5]
Entropy Learned Pruned Hypothesis Space (ELPH) [6]
Context
pruning
N-Gram [7]
Hierarchical N-Gram (H. N-Gram) [7] } Collection of 1 to N-Grams
Long Short Term Memory (LSTM) [8] } Implicit blending & pruning
Richard Mealing and Jonathan L. Shapiro (Machine
Sequence
Learning
Prediction
and Optimisation
Opponent Modelling
GroupSchool of Computer ScienceUniversity of Manchest
5 / 22
Sequence Prediction Method Lookahead
Predict with k lookahead given a hypothesised context i.e.
t+k
t+k−1 t+k−1
t+k−2 t+k−2
t+k−n t+k−n
Pr aopp
|(aopp
,a
), (aopp
,a
), ..., (aopp
,a
)
A hypothesised context may contain unobserved (predicted) symbols
Richard Mealing and Jonathan L. Shapiro (Machine
Sequence
Learning
Prediction
and Optimisation
Opponent Modelling
GroupSchool of Computer ScienceUniversity of Manchest
6 / 22
Reinforcement Learning: Q-Learning
Learns an action-value function that when input a state-action pair
(s, a) outputs the expected value of taking that action in that state
and following a fixed strategy thereafter [9]
State Action
Reward
Learning rate
z}|{ Discount
z}|{ z}|{
z}|{
z}|{
Q( s t , at ) ← (1 − α )Q(s t , at ) + α[ r t + γ max Q(s t+1 , at+1 )]
{z
}
|
at+1
|
{z
}
fraction of old value
fraction of reward & next max valued action
Select actions with high q-values with some exploration
Richard Mealing and Jonathan L. Shapiro (Machine
Sequence
Learning
Prediction
and Optimisation
Opponent Modelling
GroupSchool of Computer ScienceUniversity of Manchest
7 / 22
Need for Lookahead (Prisoner’s Dilemma Example)
D
C
D
1,1
0,4
C
4,0
3,3
Richard Mealing and Jonathan L. Shapiro (Machine
Sequence
Learning
Prediction
and Optimisation
Opponent Modelling
GroupSchool of Computer ScienceUniversity of Manchest
8 / 22
Need for Lookahead (Prisoner’s Dilemma Example)
Defect is the dominant action
Cooperate-Cooperate is socially optimal (highest sum of rewards)
Tit-for-tat (copy opponent’s last move) is good for iterated play
Can we learn tit-for-tat?
Richard Mealing and Jonathan L. Shapiro (Machine
Sequence
Learning
Prediction
and Optimisation
Opponent Modelling
GroupSchool of Computer ScienceUniversity of Manchest
9 / 22
Need for Lookahead (Prisoner’s Dilemma Example)
4
Pred. C
Pred. D
D
D
C
3
D
C
D
1,1
0,4
C
4,0
3,3
1
C
0
Richard Mealing and Jonathan L. Shapiro (Machine
Sequence
Learning
Prediction
and Optimisation
Opponent Modelling
GroupSchool of Computer ScienceUniversity of10
Manchest
/ 22
Need for Lookahead (Prisoner’s Dilemma Example)
4
Pred. C
Pred. D
D
D
C
3
D
C
D
1,1
0,4
C
4,0
3,3
C
1
0
Lookahead of 1 shows D has highest reward
With lookahead of 2 (D,C,D,C) has highest total reward (unlikely)
Assume the opponent copies the player’s last move (i.e. tit-for-tat)
Richard Mealing and Jonathan L. Shapiro (Machine
Sequence
Learning
Prediction
and Optimisation
Opponent Modelling
GroupSchool of Computer ScienceUniversity of11
Manchest
/ 22
Need for Lookahead (Prisoner’s Dilemma Example)
D
C
Pred. C
D
D
1,1
0,4
C
4,0
3,3
Pred. D
C
D
C
4
3
1
0
Pred. D
Pred. C
Pred. D
Pred. C
D C
D C
D C
D C
5
4
7
6
2
1
4
3
Richard Mealing and Jonathan L. Shapiro (Machine
Sequence
Learning
Prediction
and Optimisation
Opponent Modelling
GroupSchool of Computer ScienceUniversity of12
Manchest
/ 22
Need for Lookahead (Prisoner’s Dilemma Example)
D
C
Pred. C
D
D
1,1
0,4
C
4,0
3,3
Pred. D
C
D
C
4
3
1
0
Pred. D
Pred. C
Pred. D
Pred. C
D C
D C
D C
D C
5
4
7
6
2
1
4
3
Lookahead of 2 against tit-for-tat shows C has highest reward
Richard Mealing and Jonathan L. Shapiro (Machine
Sequence
Learning
Prediction
and Optimisation
Opponent Modelling
GroupSchool of Computer ScienceUniversity of13
Manchest
/ 22
Q-Learning’s Implicit Lookahead
State Action
Reward
Learning rate
z}|{ z}|{
z}|{ Discount
z}|{
z}|{
t
t
t t
Q( s , a ) ← (1 − α )Q(s , a ) + α[ r t + γ max Q(s t+1 , at+1 )]
{z
}
|
at+1
|
{z
}
fraction of old value
fraction of reward & next max valued action
Assume each state is an opponent action i.e. s = aopp
Learns (player action, opponent action) values as:
t+1
γ = 0 - payoff matrix (arg maxa Q aopp
, a same as max lookahead 1)
0 <γ <1 - payoff matrix + future rewards with exponential decay
γ = 1 - payoff matrix + future rewards
Increasing γ increases lookahead
Richard Mealing and Jonathan L. Shapiro (Machine
Sequence
Learning
Prediction
and Optimisation
Opponent Modelling
GroupSchool of Computer ScienceUniversity of14
Manchest
/ 22
Exhaustive Explicit Lookahead
We use exhaustive explicit lookahead with the opponent model and action
values to greedily select actions (to limited depth) maximising total reward
D
D
D
2
C
C
D C
1
D
D
C
D C
D C
5
C
4
5
D
D C
8
4
C
D C
7
1
C
D
D C
4
0
C
D C
3
4
D C
7
3
6
Richard Mealing and Jonathan L. Shapiro (Machine
Sequence
Learning
Prediction
and Optimisation
Opponent Modelling
GroupSchool of Computer ScienceUniversity of15
Manchest
/ 22
Experiments
Iterated Rock-Paper-Scissors
Opponent’s actions depend on
its previous actions
R
P
S
R
0,0
1,-1
-1,1
P
-1,1
0,0
1,-1
D
1,1
0,4
C
4,0
3,3
S
1,-1
-1,1
0,0
Iterated Prisoner’s Dilemma
Opponent’s actions depend on
both players’ previous actions
D
C
Littman’s Soccer [10]
Direct competition
Which approach has better
performance?
Richard Mealing and Jonathan L. Shapiro (Machine
Sequence
Learning
Prediction
and Optimisation
Opponent Modelling
GroupSchool of Computer ScienceUniversity of16
Manchest
/ 22
Iterated Rock Paper Scissors
Memory Size 1
{R,R,R,P,P,P,S,S,S} Order 3
Name
Avg Payoff
Avg Time
ELPH
0.666 ± 0.0003
56 ± 4
PGA-APP
0.652 ± 0.005
62 ± 4
WoLF-PHC
0.646 ± 0.004
71 ± 4
ɛ Q-Learner
0.582 ± 0.008
48 ± 6
WPL
0.393 ± 0.008
139 ± 7
Memory Size 2
{R,R,P,P,S,S} Order 2
Name
Avg Payoff Avg Time
WoLF-PHC 0.645 ± 0.006
89 ± 5
PGA-APP
0.644 ± 0.008
59 ± 5
ɛ Q-Learner 0.635 ± 0.008
22 ± 3
ELPH
0.617 ± 0.002
210 ± 0
WPL
0.374 ± 0.007
143 ± 7
ELPH
WoLF-PHC
ɛ Q-Learner
PGA-APP
WPL
1±0
0.98 ± 0.008
0.97 ± 0.01
0.92 ± 0.01
0.65 ± 0.02
10 ± 0
91 ± 3
28 ± 2
52 ± 3
105 ± 7
ELPH
ɛ Q-Learner
WoLF-PHC
PGA-APP
WPL
1±0
0.92 ± 0.01
0.91 ± 0.01
0.86 ± 0.01
0.54 ± 0.01
10 ± 0
45 ± 4
147 ± 8
109 ± 6
71 ± 6
WoLF-PHC
ɛ Q-Learner
PGA-APP
ELPH
WPL
Memory Size 3
{R,P,S} Order 1
Avg Payoff Avg Time
1 ± 0 14.7 ± 0.6
1±0
27 ± 2
0.973 ± 0.009
24 ± 2
0.97 ± 0.01
29 ± 2
0.87 ± 0.01
74 ± 6
Name
ELPH
WoLF-PHC
PGA-APP
ɛ Q-Learner
WPL
ELPH
WoLF-PHC
ɛ Q-Learner
PGA-APP
WPL
1±0
0.95 ± 0.01
0.94 ± 0.01
0.9 ± 0.02
0.63 ± 0.01
10 ± 0
181 ± 6
37 ± 4
144 ± 6
98 ± 6
ELPH
WoLF-PHC
ɛ Q-Learner
PGA-APP
WPL
1 ± 0 17.2 ± 0.7
0.89 ± 0.01
205 ± 3
0.87 ± 0.01
71 ± 5
0.87 ± 0.01
179 ± 6
0.69 ± 0.01
208 ± 2
ELPH
WoLF-PHC
ɛ Q-Learner
PGA-APP
WPL
Good
←
―
―
―
―
―
―
→
0.68 ± 0.01
0.64 ± 0.01
0.61 ± 0.01
0.6 ± 0.002
0.375 ± 0.009
173 ± 6
56 ± 5
120 ± 7
58 ± 4
139 ± 7
1 ± 0 16.3 ± 0.7
0.85 ± 0.01
210 ± 0
0.84 ± 0.01
84 ± 6
0.77 ± 0.01
198 ± 3
0.76 ± 0.01
210 ± 0
Bad
Agents cannot learn best response with memory size < model order
Our approach gains the highest payoffs at generally the fastest rates
Richard Mealing and Jonathan L. Shapiro (Machine
Sequence
Learning
Prediction
and Optimisation
Opponent Modelling
GroupSchool of Computer ScienceUniversity of17
Manchest
/ 22
Memory Size 1
Discount = 0 and Depth = 1
Name
Avg Payoff
Avg Time Position
PGA-APP
2.03 ± 0.01
30 ± 3
13
ɛ Q-Learner
1.94 ± 0.01
30 ± 4
16
WPL
1.932 ± 0.007
20 ± 1
17
TDAG
1.93 ± 0.01
30 ± 2
16
WoLF-PHC
1.89 ± 0.01
20 ± 2
18
Discount = 0.99 and Depth = 2
Name
Avg Payoff
Avg Time Position
ɛ Q-Learner
2.68 ± 0.01
180 ± 5
1
TDAG + Q-Learner
2.63 ± 0.01
60 ± 4
1
TDAG
2.607 ± 0.008
20 ± 1
1
WPL
2.31 ± 0.01
30 ± 4
12
PGA-APP
2.17 ± 0.02
30 ± 3
13
WoLF-PHC
2.1 ± 0.02
40 ± 5
13
Memory Size 2
PGA-APP
WPL
WoLF-PHC
TDAG
ɛ Q-Learner
2.01 ± 0.01
1.949 ± 0.008
1.92 ± 0.01
1.902 ± 0.008
1.822 ± 0.007
30 ± 4
20 ± 1
30 ± 4
20 ± 2
20 ± 2
14
17
17
16
18
TDAG + Q-Learner
ɛ Q-Learner
TDAG
WPL
PGA-APP
WoLF-PHC
2.828 ± 0.009
2.74 ± 0.01
2.72 ± 0.01
2.34 ± 0.01
2.18 ± 0.02
2.14 ± 0.01
120 ± 6
180 ± 5
20 ± 1
40 ± 4
40 ± 5
30 ± 3
1
1
1
12
13
13
Memory Size 3
Iterated Prisoner’s Dilemma
ɛ Q-Learner
TDAG
WPL
PGA-APP
WoLF-PHC
2.02 ± 0.01
1.958 ± 0.008
1.945 ± 0.009
1.92 ± 0.009
1.773 ± 0.007
30 ± 3
20 ± 3
20 ± 3
20 ± 2
20 ± 1
14
17
17
16
18
TDAG + Q-Learner
TDAG
ɛ Q-Learner
WPL
PGA-APP
WoLF-PHC
2.847 ± 0.009
2.74 ± 0.01
2.65 ± 0.01
2.32 ± 0.01
2.18 ± 0.02
2.14 ± 0.02
130 ± 5
30 ± 3
170 ± 5
30 ± 4
40 ± 4
40 ± 4
1
1
1
12
12
13
Good
←
―
―
―
―
―
―
→
Bad
Increasing lookahead (discounting, search depth) increases rewards
Our approach + Q-Learning increases rewards but also increases time
Our approach gains the highest payoffs at generally the fastest rates
Richard Mealing and Jonathan L. Shapiro (Machine
Sequence
Learning
Prediction
and Optimisation
Opponent Modelling
GroupSchool of Computer ScienceUniversity of18
Manchest
/ 22
Soccer
Name
PPMC
LSTM
TDAG
H. N-Gram
LZ78
N-Gram
ActiveLeZi
ELPH
FP
KMP
ɛ Q-Learner
Avg Payoff
0.687 ± 0.006
0.635 ± 0.004
0.63 ± 0.004
0.628 ± 0.003
0.621 ± 0.004
0.62 ± 0.003
0.618 ± 0.003
0.601 ± 0.004
0.536 ± 0.003
0.524 ± 0.002
Name
PPMC
LSTM
FP
N-Gram
H. N-Gram
ActiveLeZi
TDAG
LZ78
ELPH
KMP
Good
←
WoLF-PHC
Avg Payoff
0.701 ± 0.006
0.638 ± 0.005
0.637 ± 0.004
0.614 ± 0.003
0.612 ± 0.003
0.606 ± 0.004
0.606 ± 0.004
0.602 ± 0.004
0.576 ± 0.003
0.564 ± 0.003
―
―
―
Name
PPMC
H. N-Gram
N-Gram
LSTM
TDAG
FP
LZ78
ActiveLeZi
ELPH
KMP
―
―
WPL
Avg Payoff
0.717 ± 0.004
0.674 ± 0.002
0.665 ± 0.001
0.659 ± 0.003
0.659 ± 0.002
0.655 ± 0.003
0.653 ± 0.002
0.651 ± 0.002
0.637 ± 0.002
0.62 ± 0.002
―
→
Name
PPMC
H. N-Gram
ActiveLeZi
FP
TDAG
LSTM
N-Gram
LZ78
ELPH
KMP
PGA-APP
Avg Payoff
0.648 ± 0.006
0.608 ± 0.003
0.599 ± 0.004
0.593 ± 0.003
0.589 ± 0.004
0.585 ± 0.004
0.582 ± 0.003
0.574 ± 0.003
0.565 ± 0.003
0.553 ± 0.003
Bad
Our approach wins above 50% of the games using any predictor
PPMC has the highest performances
Richard Mealing and Jonathan L. Shapiro (Machine
Sequence
Learning
Prediction
and Optimisation
Opponent Modelling
GroupSchool of Computer ScienceUniversity of19
Manchest
/ 22
Conclusions
We proposed sequence prediction and lookahead to accurately model
and effectively respond to opponents with memory
Empirical results show given sufficient memory and lookahead our
approach outperforms reinforcement learning algorithms
Richard Mealing and Jonathan L. Shapiro (Machine
Sequence
Learning
Prediction
and Optimisation
Opponent Modelling
GroupSchool of Computer ScienceUniversity of20
Manchest
/ 22
Future Work
Will apply our approach to domains with:
Larger state spaces
Hidden information
Where the challenges are:
Deeper lookahead (e.g. sampling techniques)
Sequence predictor configuration (e.g. 1 predictor per state)
Richard Mealing and Jonathan L. Shapiro (Machine
Sequence
Learning
Prediction
and Optimisation
Opponent Modelling
GroupSchool of Computer ScienceUniversity of21
Manchest
/ 22
References
[1]
Lempel and Ziv. Compression of Individual Sequences via Variable-Rate Coding. 1978.
[2]
Byron Knoll. Text Prediction and Classification Using String Matching. 2009.
[3]
Alistair Moffat. “Implementing the PPM Data Compression Scheme”. In: IEEE
Transactions on Communications 38 (1990), pp. 1917–1921.
[4]
Karthik Gopalratnam and Diane J. Cook. “ActiveLezi: An incremental parsing algorithm
for sequential prediction”. In: 16th Int. FLAIRS Conf. 2003, pp. 38–42.
[5]
Philip Laird and Ronald Saul. “Discrete Sequence Prediction and Its Applications”. In:
Machine Learning 15 (1994), pp. 43–68.
[6]
Jensen et al. “Non-stationary policy learning in 2-player zero sum games”. In: Proc. of
20th Int. Conf. on AI. 2005, pp. 789–794.
[7]
Ian Millington. “Artificial Intelligence for Games”. In: ed. by David H. Eberly. Morgan
Kaufmann, 2006. Chap. Learning, pp. 583–590.
[8]
Felix A. Gers, Nicol N. Schraudolph, and Jürgen Schmidhuber. “Learning Precise Timing
with LSTM Recurrent Networks”. In: JMLR 3 (2002), pp. 115–143.
[9]
C. J. C. H. Watkins. “Learning from delayed rewards”. PhD thesis. Cambridge, 1989.
[10]
Michael L. Littman. “Markov games as a framework for multi-agent reinforcement
learning”. In: 11th Proc. of ICML. Morgan Kaufmann, 1994, pp. 157–163.
Richard Mealing and Jonathan L. Shapiro (Machine
Sequence
Learning
Prediction
and Optimisation
Opponent Modelling
GroupSchool of Computer ScienceUniversity of22
Manchest
/ 22