Combining Opponent Modeling and Model

Combining Opponent Modeling and
Model-Based Reinforcement Learning in a
Two-Player Competitive Game
Brian Collins
NI VER
S
E
R
G
O F
H
Y
TH
IT
E
U
D I
U
N B
Master of Science
Artificial Intelligence
School of Informatics
University of Edinburgh
2007
Abstract
When an opponent with a stationary and stochastic policy is encountered in a twoplayer competitive game, model-free Reinforcement Learning (RL) techniques such
as Q-learning and Sarsa(λ) can be used to learn near-optimal counter strategies given
enough time. When an agent has learned such counter strategies against multiple diverse opponents, it is not trivial to decide which one to use when a new unidentified
opponent is encountered. Opponent modeling provides a sound method for accomplishing this in the case where a policy has already been learned against the new opponent; the policy corresponding to the most likely opponent model can be employed.
When a new opponent has never been encountered previously, an appropriate policy
may not be available. The proposed solution is to use model-based RL methods in
conjunction with separate environment and opponent models. The model-based RL
algorithms used were Dyna-Q and value iteration (VI). The environment model allows
an agent to reuse general knowledge about the game that is not tied to a specific opponent. Opponent models that are evaluated include Markov chains, Mixtures of Markov
chains, and Latent Dirichlet Allocation on Markov chains. The latter two models are
latent variable models, which make predictions for new opponents by estimating their
latent (unobserved) parameters. In some situations, I have found that this allows good
predictive models to be learned quickly for new opponents given data from previous
opponents. I show cases where these models have low predictive perplexity (high
accuracy) for novel opponents. In theory, these opponent models would enable modelbased RL agents to learn best response strategies in conjunction with an environment
model, but converting prediction accuracy to actual game performance is non-trivial.
This was not achieved with these methods for the domain, which is a two-player soccer
game based on a physics simulation. Model-based RL did allow for faster learning in
the game, but did not take full advantage of the opponent models. The quality of the
environment model seems to be a critical factor in this situation.
iii
Acknowledgements
Many thanks to my supervisor Michael Rovatsos, who was extremely helpful and
supportive throughout the project. Thanks to Amos Storkey for a helpful discussion.
Thanks to my parents for funding my studies and generally being very supportive.
iv
Declaration
I declare that this thesis was composed by myself, that the work contained herein is
my own except where explicitly stated otherwise in the text, and that this work has not
been submitted for any other degree or professional qualification except as specified.
(Brian Collins)
v
Table of Contents
1
Introduction
1
2
Related Work
5
3
Opponent Modeling
9
3.1
4
Mathematical Notation . . . . . . . . . . . . . . . . . . . . . . . . .
12
3.1.1
Multinomial Distribution . . . . . . . . . . . . . . . . . . . .
13
3.1.2
Dirichlet Distribution . . . . . . . . . . . . . . . . . . . . . .
13
3.2
First-Order Markov Chains . . . . . . . . . . . . . . . . . . . . . . .
14
3.3
Mixtures of Markov Chains . . . . . . . . . . . . . . . . . . . . . . .
18
3.4
Latent Dirichlet Allocation on Markov Chains . . . . . . . . . . . . .
20
Reinforcement Learning
23
4.1
Formal Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
4.2
Model-Free Reinforcement Learning . . . . . . . . . . . . . . . . . .
27
4.2.1
Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
4.2.2
One-Step Sarsa . . . . . . . . . . . . . . . . . . . . . . . . .
28
4.2.3
Sarsa(λ)
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
4.2.4
Action Selection . . . . . . . . . . . . . . . . . . . . . . . .
30
Model-Based Reinforcement Learning . . . . . . . . . . . . . . . . .
31
4.3.1
Dyna-Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
4.3.2
Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . .
34
4.3
5
Evaluation
39
5.1
Domain - The Soccer Game . . . . . . . . . . . . . . . . . . . . . .
39
5.2
Opponents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
5.3
Opponent Model Evaluation . . . . . . . . . . . . . . . . . . . . . .
44
5.3.1
46
Predictive Performance . . . . . . . . . . . . . . . . . . . . .
vii
5.4
6
Reinforcement Learning Evaluation . . . . . . . . . . . . . . . . . .
51
5.4.1
Model-Free Reinforcement Learning Agents . . . . . . . . .
54
5.4.2
Model-Based Reinforcement Learning Agents . . . . . . . .
57
Conclusion
67
Bibliography
69
viii
Chapter 1
Introduction
Consider a two-player competitive (zero-sum) game, where at each discrete time-step,
both players observe the current state of the game and independently choose from a
discrete set of actions. The subsequent state of the game will be a stochastic function of
the previous state and the actions of both players. The reward given to each player will
be a stochastic function of the same variables. Since the game is strictly competitive,
the sum of both player’s rewards will be zero. In such a game, when an opponent
follows a stationary stochastic policy, an agent 1 can learn an optimal policy (counterstrategy) using model-free Reinforcement Learning (RL) (Sutton and Barto, 1998).
The learned policy will be optimal in the sense that it will obtain, on average, the
maximum possible long term reward against an individual opponent. The disadvantage
of this method is that model-free RL, where the agent builds neither a model of its
environment (the game) nor its opponent, may take a very long time to converge to
the optimal policy. Additionally, when a new opponent is encountered, it is unclear
whether any particular previously learned policy is appropriate for reuse. Even though
past experience with a variety of disparate opponents may be extensive, model-free
RL does not facilitate effective reuse of this knowledge. The principal goal of this
project is to design an agent that obtains maximal total reward in short games against
unknown opponents, by effectively using past knowledge from games against other
opponents. An agent is assumed to have extensive past experience in the game. The
solution proposed and investigated in this project is to combine separate opponent
models and environment models with model-based RL. Some success was achieved in
1 The
terms agent, player, and opponent are essentially interchangeable in this report. Player usually
refers to a learning agent and opponent usually refers to a fixed policy adversary.
1
2
Chapter 1. Introduction
both opponent modeling and model-based RL, but the full potential of combining the
methods has not been realized.
Basic opponent modeling allows for intelligent policy reuse in model-free RL, by allowing a new opponent to be compared to past opponents. When policies have been
learned for each past opponent, an agent can employ the policy corresponding to the
opponent model which most closely matches the current opponent. Since the opponent models are probabilistic, the best opponent model will be the one that assigns
the highest probability to the observed sequences of behaviors. When a new opponent
has never been encountered previously, an appropriate policy may not be available.
The proposed solution is to used model-based RL methods in conjunction with separate environment and opponent models. The model-based RL algorithms used were
Dyna-Q (Sutton and Barto, 1998) and value iteration (VI) (Puterman and Shin, 1978;
Sutton and Barto, 1998). The environment model allows an agent to reuse general
knowledge about the game that is not tied to a specific opponent. Opponent models
that are evaluated include Markov chains, Mixtures of Markov chains (Cadez et al.,
2000), and Latent Dirichlet Allocation (LDA) on Markov chains (Girolami and Kaban, 2005). The latter two models are latent variable models, which make predictions
for new opponents by estimating their latent (unobserved) parameters. The assumption
is that differences in opponent behavior are due to differences in the underlying latent
parameters. Markov chains can represent stationary opponents that do not use hidden
internal states or an arbitrary length history of the game to make predictions. Markov
chains are only useful as predictive models against very similar opponents. Mixture (of
Markov Chain) models assume that a new opponent’s strategy is equivalent to exactly
one of the previous opponent’s strategies. They are not flexible when this assumption
is violated and will not be good predictive models in this case. Latent Dirichlet Allocation (Blei et al., 2003) models assume that each opponent’s strategy is a convex combination (mixing proportions sum to one) of simpler strategy elements. In this case,
the simpler elements are Markov chains. Opponents can be identified by their mixing
proportions. If opponents’ strategies can actually be decomposed in this way, the LDA
model works well for prediction on a wide range of opponents. It is interesting to ask
when these types of opponents arise naturally, but this is still an open question. In
some situations, I have found that the latent variable methods allows good predictive
models to be learned quickly for new opponents given data from previous opponents.
I show cases where these models have low predictive perplexity (high accuracy) for
3
novel opponents. In theory, these opponent models would enable model-based RL
agents to learn best response strategies in conjunction with an environment model, but
converting prediction accuracy to actual game performance is non-trivial. This was not
achieved with these methods for the domain. The quality of the environment model
seems to be a critical factor in this situation. The specific game used for this project is
a two-player soccer game based on a physics simulation (Collins and Rovatsos, 2006).
It was chosen because I had done previous model-free RL work in the domain and
there was potential for improvement. In domains that are much simpler, the types of
opponent models investigated in this project would not be very useful, since stationary
policy opponents would be so simple that complex methods are not needed. I tried to
select a domain that could potentially benefit from these techniques. Because of the
complexity of the domain, some separate results were achieved in opponent modeling
and model-based RL, but no significant results were achieved in the combination of
the two methods. For model-based RL results, I found that value iteration performed
better than Dyna-Q for this domain and was more efficient.
The main contributions of this project are the following. It is a study of the application
of LDA on Markov chains to a new complex domain. To my knowledge, this is the
first report to describe the application of this method to opponent modeling in games.
Results are presented for both model-based and model-free reinforcement learning in
a game which is difficult to learn. To the best of my knowledge, this is the first study
to integrate these types of opponent models with model-based RL. The remainder of
the report is structured as follows. Chapter 2 presents a description of some related
work. Chapter 3 describes the opponent modeling methods used in a way that is largely
domain independent. Chapter 4 discusses the reinforcement learning methods, some
of which make use of the opponent modeling methods. This discussion is applicable
to a wide range of domains. Chapter 5 discusses the actual game used as a domain and
presents results in opponent modeling, model-free RL, and model-based RL. Chapter
6 concludes the dissertation and discusses possible further work in this area.
Chapter 2
Related Work
The A D H OC approach Rovatsos and Wolf (2002); Rovatsos et al. (2003) was developed to automatically classify opponents in the iterated prisoners dilemma. It is a
heuristic approach which automatically decides how to classify agents and when new
classes are needed. Good results were obtained by assuming opponents could be modeled as deterministic finite automata and learning models of that form using methods
from (Carmel and Markovich, 1996). The game in this project is much more complicated, but similar opponent modeling techniques (Markov chains) can be applied.
The A D H OC approach was extended to D-A D H OC and applied to a simulated robotic
soccer environment in Marı́n et al. (2005). This focused on predicting future locations
of players and seemed to work well. The domain for this project is simpler with only
two agents participating in a game instead of many, but I will attempt to learn more
complex opponent models.
Opponent modeling in games has been studied extensively. For example in the game
of poker, where opponent modeling is necessary to exploit highly non-optimal play of
most opponents, previous work successfully combined opponent modeling with game
tree search Billings et al. (2004). The system was eventually able to defeat all previous
computer systems, but learning times were relatively long. Quickly matching a new
opponent with a previous model was suggested, but more analysis is needed. Outcomes
of games of poker have very large variance because of the randomness inherent in
the game. Even the best human players lose a lot of individual games due to this
randomness. An advantage of poker over the soccer domain used in this project is
that in poker, both players have an exact model of how the game is played. In the
5
6
Chapter 2. Related Work
soccer domain, an environment model must be learned. Recent work on Bayesian
opponent modeling in poker is described in (Southey et al., 2005) and is similar to this
work. In that paper, an assumption is made that opponents play stationary policies; the
same assumption made in this report. The learning agent constructs a model for a new
opponent by averaging over possible opponent strategies weighted by their degree of
belief.
Recent success in model-based reinforcement learning (RL) can be seen in (Abbeel
et al., 2006; Ng et al., 2004; Kalyanakrishnan et al., 2008). In (Abbeel et al., 2006),
the authors describe how the use of models can speed up RL significantly, even when
the models are not necessarily very accurate. In (Ng et al., 2004), model-based RL
is successfully used to learn to control a helicopter with highly stochastic dynamics.
Most similar to this project is the work is (Kalyanakrishnan et al., 2008), where modelbased RL is shown to outperform model-free RL in a soccer keepaway domain. In this
domain, a team of players attempts to prevent another team from intercepting the ball.
As in this project, abstract high level actions are employed rather than low level control.
The authors found that the quality of the environment model is critically important, as
I discovered in this project. They used more sophisticated RL methods, which I did
not have time to investigate, since my focus was on combining model-based RL with
opponent modeling.
In (Wilson et al., 2007), the problem of transferring knowledge between related RL
tasks is investigated. This is very similar to the work presented here, since instances of
the same game with different opponents can be considered related tasks. The domain
investigated in the paper is much simpler than the one presented here, but they employ
an advance Bayesian technique called Dirichlet process mixture models to model the
tasks. These mixture models can have a countably infinite number of components and
the true number is inferred from data. The models are used by averaging over all possible numbers of components weighted by their probability. This allows the true number
of tasks to be determined automatically. A related Bayesian model is the Hierarchical
Dirichlet Process (Teh et al., 2004), which can be used in the same situations as the
LDA models (Blei et al., 2003) investigated in this project. The main difference is
that they replace a Dirichlet prior on the mixing proportions with a Dirichlet process
prior, which allows the number of components to be inferred from data. It also allows
predictions to be made by averaging over all possible numbers of components. In (Teh
et al., 2004), the method is shown to perform as well as LDA for a text modeling task.
7
Note that multi-agent RL methods (Shoham et al., 2003) are available for use in complex domains involving multiple agents. These methods are typically used when the
other player is also learning. Simpler methods RL methods are appropriate for this
project, since they allow a best response to be learned against fixed policy opponents.
Chapter 3
Opponent Modeling
Opponent modeling can have many benefits in a competitive game situation. At each
discrete time-step t in a competitive stochastic game, an agent and their opponent observe the state of the game st and their rewards rt1 + rt2 = 0. Then they each independently select their discrete actions at1 and at2 . The state of the game st+1 at time
1 , r2
t + 1 and the rewards rt+1
t+1 depend on the previous state st and the actions of
both players at1 and at2 . If an agent knows the exact probability distribution that an
opponent uses to select their actions, then assuming they have perfect knowledge of
the game, they can compute a best-response strategy. This strategy will be optimal in
the sense that the agent will receive maximum long term rewards (over many trials).
More details of exactly how this can be done are presented in Chapter 4. Unfortunately, an agent will not usually have such detailed knowledge about their opponent.
Usually, an agent will have no knowledge about a new opponent’s strategy until they
begin playing a game and observing the opponent’s actions. Opponents will generally
not disclose their strategies by other means, because they can usually achieve higher
average rewards when the other player is uncertain of this information.
A game playing agent must model opponents based on observed data, if they wish
to compare opponents or compute estimated best-response strategies off-line. In this
project, only opponents with stationary policies are considered. Stationary policy opponents must use the same policy at all times and therefore cannot “learn.” Additionally, opponents are required to not maintain any hidden internal state. They must make
decisions based only on the history of the game, without changing internal variables
that are hidden from their opponent. Their behavior can still depend on their past ac9
10
Chapter 3. Opponent Modeling
tions, so they are allowed to maintain some internal state, but this information is also
available to their adversary. Such stationary policy opponents can still base decisions
on the entire history of the current game, which complicates the modeling process.
In this project, the Markov assumption is made that an opponent’s decisions depend
only on a finite number of past states. These restrictions are significant, but stationary
Markovian policies (without hidden states) in complex games can be very elaborate
and large quantities of observed data may be needed to learn adequate models “from
scratch.” The aim of using the methods described in this chapter is to reduce the amount
of data needed to build accurate models by using data generated in previously played
games with other opponents. This is done by considering models with latent variables
and assuming that variations in observed behavior between opponents are due to variations in their latent variables. These are variables that are assumed to exist but are
never explicitly observed. Under this type of model, a stationary opponent will have
fixed values for their latent variables, which are not explicitly communicated to an adversary and must be estimated from observed data. The intuition behind this choice is
that the latent variables can be estimated using much less data than the amount of data
to build new models “from scratch.” An agent using these models may therefore be
able to make accurate predictions for a variety of unknown opponents using less data.
Since the data is obtained at a constant rate over the course of a game, this implies that
an agent can have an accurate opponent model earlier in a game, which increases their
potential to exploit the model in the game.
Opponents are assumed to have stochastic policies, so that in a given situation there
will be different probabilities associated with various discrete actions. In the limit of
an infinite number of trials, the number of times an opponent selects each action in a
given state will be proportional to the probability of that action in that state. Probabilistic opponent modeling provides a mathematically sound framework for estimating an
opponent’s policy based on observed data. The estimate can then be used to compute
an estimated best-response strategy as described in the subsequent chapter. In particular, we focus on Bayesian methods (Bishop, 2006) such as Latent Dirichlet Allocation
(LDA) (Blei et al., 2003). Bayesian methods make particularly good predictions when
data is limited, by explicitly representing uncertainty in the model parameters. This is
done by giving the parameters prior distributions that reflect our prior beliefs. Once
actual data has been observed, a posterior distribution is calculated by combining the
11
prior distribution and the observed data via Bayes’ rule1 . Finally, to calculate the probability of future data, integration is performed over the entire parameter space of the
model using the posterior probability distribution over parameters. This allows predictions to be made by considering all possible parameter values, rather than point
estimates. The main difficulties in using Bayesian methods are computational and
computing an exact Bayesian solution is often intractable, even for problems of moderate size. For LDA, the required integral does not have an analytic solution and may
be high dimensional. We address this issue by using a variational approximation algorithm given in (Blei et al., 2003). Specifying a good prior distribution can also be
problematic, since this involves creating a formal description of actual beliefs.
Three types of models are described in this chapter – Markov chains, mixtures of
Markov chains (Cadez et al., 2000), and LDA models of Markov chains (Girolami and
Kaban, 2005). Markov chain models of order m make predictions based only on the m
most recent states of the game. Note that the state of the game includes the last action
of the opponent. Markov chains of order one are considered in this project. Latent
variable models, such as Mixture models and LDA models, make predictions based on
the history of the game and estimates of latent variables. In mixture models, a single
discrete latent variable represents the single mixture component that generated all of
the data. The number of values that this latent variable can take is referred to as the
number K of mixture components2 . Opponents which are different will have different fixed values of this random variable. Therefore, a mixture model can represent
K different opponents. For mixtures of Markov models, the mixture components are
Markov models. If the value of the latent variable was known a priori, the mixture
model would be equivalent to a Markov model. Only the most recent m states of the
game would be needed to make predictions. Predictions will be conditionally independent of all past states given the most recent m states. However, since the value of the
latent variable is not known a priori, predictions will depend on the entire history of
the game, i.e. knowledge of m most recent states will not be sufficient for any finite
m. Therefore, it is not possible to represent a mixture of Markov chains by a single
Markov chain of any order m.
In LDA models, K continuous latent variables represent proportions in which components are combined to produce the observed data. These K variables are constrained
1 Bayes’
2 More
rule is defined later in this chapter
details are given later on how to choose this number.
12
Chapter 3. Opponent Modeling
to sum to one. LDA models can represent an infinite number of unique opponents and
are strictly more expressive than mixture models. Mixture models can be seen as LDA
models in which each of the K variables can only take the values zero and one. The
fact that the values must sum to one ensures that only K unique assignments of values
are possible. LDA models can also be viewed as a variation of mixture models, where
a mixture component used by each opponent is not fixed, but it is selected randomly at
each timestep by sampling from a multinomial distribution with unknown parameters.
A Dirichlet prior distribution is placed on these parameters. Further details of these
distributions and models are presented later in this chapter.
3.1
Mathematical Notation
Standard shorthand probability notation is used in this report (Bishop, 2006). For a
random variable x, p(x) represents the distribution over values that x can take and
∑x p(x) is shorthand for ∑y∈x p(x = y). Note that this sum over the possible values of
a variable equals unity for any probability distribution. A joint distribution of x and
y is written as p(x, y) and represents the distribution over pairs of values for the two
variables. For independent variables x and y, p(x, y) = p(x)p(y) and we say that the
distribution p(x, y) factors as p(x)p(y). Conditional distributions are written as p(x | z)
and represent the distribution over x given the value of z. If p(x, y | z) = p(x | z)p(y | z)
then we say x and y are conditionally independent given z and write this assertion as
I(x, y | z). Distributions can be conditioned on multiple variables (e.g. replace z with
α, β in the above equations). The following rules of probability (Bishop, 2006) are
used extensively in this chapter:
• The sum rule: p(x) = ∑y p(x, y).
• The (continuous) sum rule: p(x) =
R
p(x, y)dy.
• The product rule: p(x, y) = p(x | y)p(y) = p(y | x)p(x).
• Bayes’ rule: p(x | y) =
p(y | x)p(x)
, which can also be stated as
p(y)
| x)p(x)
p(x | y) = ∑p(y
p(y | x)p(x) .
x
Often y represents observed data and x represents the parameters of a probability model. In this case, p(x) is called the prior distribution over parameters,
p(y | x) is called the likelihood of the data given the parameters, and p(x | y) is
the posterior distribution over parameters.
3.1. Mathematical Notation
13
• Conditional forms of all of the above rules exist. These rules work in the same
way, except every term is conditioned on one or more variables. For example,
the conditional sum rule is p(x | z) = ∑y p(x, y | z).
The expected value operator Ex [ f (x)] = ∑x p(x) f (x) represents the value of a function f obtained by summing over the possible values of a random variable x. The
continuous version is Ex [ f (x)] =
R
p(x) f (x)dx. The expected value of a function is
also called the mean. It is possible to compute the expected value over multiple random variables, e.g. Ex,y = ∑x,y p(x, y) f (x, y). The conditional version is defined as
Ex [ f (x) | z] = ∑x p(x | z) f (x).
3.1.1
Multinomial Distribution
The multinomial distribution is a discrete distribution that represents the result of N
trials with K possible outcomes in each trial. The variable mi , for i = 0, . . . , K, represents the number of trials which resulted in outcome i. The variables are constrained
as follows: mi ∈ {0, . . . , N} and ∑K
i=0 mi = N. The probability distribution function is:
N
p(m1 , . . . , mK | µ1 , . . . , µK , N) =
m1 m2 . . . mK
K
∏ µmk k
k=1
The parameters {µi } of the distribution are subject to the constraints 0 ≤ µi ≤ 1 and
∑K
i=0 µi = 1. The following term is used to normalize the equation:
N!
N
=
m1 m2 . . . mK
m1 !m2 ! . . . mK !
and represents the number of ways to assign mi of N objects to each outcome i. For
this project, the multinomial distribution is used to represent a choice of mixture components for mixture models and LDA models. For Markov chains, it also represents
the choice of actions given the state of the game.
3.1.2
Dirichlet Distribution
The Dirichlet distribution can be viewed as a distribution over the parameters {µi } of
a multinomial random variable. It is defined over K random variables µi as follows.
K
p(µ1 , . . . , µK | α1 , . . . , αK ) = C(α) ∏ µαk k −1
k=1
14
Chapter 3. Opponent Modeling
The variables are subject to the constraints 0 ≤ µi ≤ 1 and ∑K
i=0 µi = N. The parameters
αi > 0 of the Dirichlet distribution are required to be positive. When the Dirichlet
distribution is used as a prior over the parameters {µi } of a multinomial distribution,
the parameters {αi } are called hyperparameters to emphasize that they are parameters
of a prior. Under a fully Bayesian solution, integration will be performed over the
parameters of the underlying multinomial distribution. The function C(α) is used to
normalize the distribution and is defined as follows.
C(α) =
Γ(∑K
k=1 αk )
K
∏k=1 Γ(αk )
Where Γ(z) is the gamma function defined as:
Z ∞
Γ(z) =
t z−1 e−t dt
0
The gamma function is a generalization of the factorial function to the real numbers
and Γ(z) = (z − 1)! for non-negative integers z. When the Dirichlet distribution is used
as a prior for the multinomial distribution, for each i = 1, . . . K, the value αi − 1 can be
interpreted as the number of observations of the outcome i. See figures 3.1, 3.2, and 3.3
for an illustration of the effect of the hyperparameters {αi } on the distribution of {µi }.
To express the belief that all actions are possible (have non-zero probability), a Dirichlet prior can be used with αi > 1 for all i. Once actual outcomes have been observed,
Bayes’ rule can be used to compute the posterior distribution using the Dirichlet prior
and the multinomial likelihood. The Dirichlet distribution is a conjugate prior for the
multinomial, which means that the posterior distribution will have the same form (a
Dirichlet distribution) as the prior. The posterior will be a Dirichlet distribution with
parameters αi + mi (for all i), where mi observations have been made of each outcome.
This posterior distribution can then be used as a new prior distribution.
3.2
First-Order Markov Chains
For first-order Markov chains, the distribution over the next state only depends on the
value of the previous state. Formally, p(xt+1 | xt ) = p(xt+1 | xt , xt−1 , . . . , x0 ). This
assertion can also be written as I(xt+1 , xt−1 , . . . x0 | xt ), which additionally implies that
past values do not depend on future values. A sequence of state values at the first T + 1
discrete time steps is abbreviated as x = {x0 , . . . , xT }. The distribution over sequences
3.2. First-Order Markov Chains
15
Figure 3.1: Probability density (vertical axis) for a 3-component Dirichlet distribution
with {αi } = 0.1. In this case, it very likely that a single value of {µi } will be very high
and the others will be very low. The three corners of the triangle represent the situation
when one {µi } value is one and the rest are zero. In the center of the triangle, all {µi }
values are equal. This figure is from (Bishop, 2006).
Figure 3.2: Probability density (vertical axis) for a 3-component Dirichlet distribution
with {αi } = 1. All allowed assignments of {µi } values are equally probable. This figure
is from (Bishop, 2006).
16
Chapter 3. Opponent Modeling
Figure 3.3: Probability density (vertical axis) for a 3-component Dirichlet distribution
with {αi } = 10. In this case, it is unlikely that individual {µi } values will be extremely
high or extremely low. This figure is from (Bishop, 2006).
x can be factored as follows.
T
p(x | θ) = p(x0 | θ) ∏ p(xt |xt−1 , θ)
t=1
Where the dependencies on the model parameters θ are made explicit; θ represents the
initial state probabilities p(x0 ) and transition probabilities p(xt+1 | xt ). For simplicity,
we ignore the initial state x0 and write:
T
p(x | θ) = ∏ p(xt |xt−1 , θ)
(3.1)
t=1
This corresponds to having a uniquely determined initial state. Equation 3.1 is referred
to as the likelihood of x and represents the probability that a sequence was generated
from a Markov model with parameter vector θ. This can be used, for example, to
identify an opponent by choosing the highest probability model. Note that these values
may be extremely small (e.g. exp −20000) for long sequences, so in practice we must
store log p(x) rather than p(x) to avoid numerical problems. The values p(xt+1 | xt ) are
stored in a conditional probability table (CPT), which has size |x|m+1 , where |x| is the
number of possible values of xt for each t. Note that the size of CPTs is exponential in
the order m of the Markov chain. As the order is increased, the amount of data needed
to learn all of the values grows no slower than exponentially. Large values of m can
therefore cause practical problems when used. The distribution used to predict a future
3.2. First-Order Markov Chains
17
opponent action xt+1 based on observed data x is the following.
p(xt+1 | x, θ) = p(xt+1 | xt , θ)
(3.2)
This can be used to predict the most likely action of the opponent at time t + 1, and
can also be used to generate samples from that distribution, which are useful in simulating opponent behavior in off-line simulations. For opponent modeling in two player
stochastic games, the opponent’s action xt at time t depends on the current discrete
state of the game st in addition to their previous action xt−1 . This can be represented
explicitly by rewriting Equation 3.1 as:
T
p(x | sT , . . . , s1 , θ) = ∏ p(xt |st , xt−1 , θ)
t=1
This notation becomes cumbersome when deriving the more sophisticated models,
so the explicit dependence on the game state st is dropped. For simplicity, we use
Equation 3.1, which has an implicit dependence on the state.
When the Markov chain parameters θ are defined a priori, the model is simple to
use. In practice, we do not know an opponent’s policy a priori and must estimate
it from data. The simplest way to do this is with maximum likelihood techniques,
which involve maximizing the value of the likelihood function (Equation 3.1) with
respect to the model parameters θ. Instead of maximizing the likelihood function p(x)
directly, we maximize the log likelihood L = log p(x). The result is equivalent since
log is a monotonically increasing function. The partial derivative of L is computed
with respect to each parameter of the model and set to zero to obtain the maximum
likelihood estimate. Precise details are given in (Bishop, 2006). The result is that the
transition probability estimates p(xt+1 |xt , θ) are proportional to the number of times
the transitions were observed. If a certain transition was never observed, then it will
be given zero probability by a maximum likelihood estimate. Any new sequence x
containing this transition would have zero probability, which is not desirable unless we
are absolutely certain of our model. To correct this, a maximum a posteriori (MAP)
estimate is used instead of a maximum likelihood estimate. For the MAP estimate,
we maximize the posterior likelihood of parameters using their prior distribution and
the likelihood of the data. Because p(xt+1 |xt , θ) is a separate multinomial distribution
for each xt , we can put a Dirichlet prior on each with α values greater than one. We
choose αi = α̂ for each i where 1 < α̂ < 2, which means that we believe to have
observed less than one and more than zero instances of each transition. This means
18
Chapter 3. Opponent Modeling
that we have a vague belief that all transitions are possible. The result of using an MAP
estimator is that each transition probability estimate is proportional to the number of
times the transition was observed plus a constant (between zero and one). This has the
consequence that all transition probabilities will be positive.
3.3
Mixtures of Markov Chains
Mixtures of Markov chain models were proposed in (Cadez et al., 2000). In this model,
K Markov chain models are combined to produce a more complex model. A single
discrete latent variable z ∈ {1, . . . , K} represents the mixture component used by an
opponent. For each opponent, the value z is fixed, but since we don’t know the value we
must sum over it (considering the probability of each value). To compute the likelihood
of a sequence x under a mixture model with parameter vector θ, we use the following
equation. The rules of probability used in each step are shown below.
p(x | θ) = ∑ p(x, z | θ)
sum rule
z
= ∑ p(x | z, θ)p(z, θ)
product rule
z
T
= ∑ p(z | θ) ∏ p(xt | xt−1 , z, θ)
z
Markov property
(3.3)
t=1
When a mixture model is used as an opponent model, this likelihood function can be
used to compare the likelihood of the model to others. The way it is used in this report
is to determine the model parameters θ by choosing the ones that maximize Equation
3.3. For mixtures of Markov chains, there is no closed form maximum likelihood or
MAP solution. The expectation maximization (EM) algorithm (Dempster et al., 1977)
is an iterative procedure that can compute these values. It is a very general procedure
that uses bootstrapping to improve estimates based on other estimates. This algorithm
will converge to a local maximum value, so it can be run several times with randomly
initialized parameters to choose the best result. The version of the EM algorithm used
is the one presented in Cadez et al. (2000) to compute an MAP estimate. As before,
we use the same Dirichlet prior on transition probabilities. It is helpful to define the
following values for each sequence, which represent the probability of each mixture
component for a given opponent based on the observed data.
t
γ tz = p(x | z, θ)p(z | θ) = p(z | θ) ∏ p(x j | x j−1 , θ)
j=1
3.3. Mixtures of Markov Chains
19
This quantity can be computed efficiently using an equivalent recursive definition.
Since these variables can have very small values, we store log γ tz rather than γ tz . The
recursive procedure to compute log γ tz is the following.
log γ 0z = log p(z | θ)
(3.4)
log γ tz = log p(xt | xt−1 , θ) + log γ t−1
z
for t > 0
(3.5)
To make predictions from a mixture model based on observed data, we write the joint
distribution of the next state xt+1 and the latent variable z given the observed history x
and apply a sequence of rules of probability. The dependence on parameter vector θ is
not shown for clarity.
p(xt+1 | x) = ∑ p(xt+1 , z | x)
sum rule
z
= ∑ p(xt+1 | z, x)p(z | x)
product rule
= ∑ p(xt+1 | z, xt )p(z | x)
ind. assumption
z
z
p(x | z)p(z)
∑z p(x | z)p(z)
z
∑ p(xt+1 | z, xt )p(x | z)p(z)
= z
∑z p(x | z)p(z)
∑ p(xt+1 | z, xt )γ tz
= z
∑z γ tz
= ∑ p(xt+1 | z, xt )
Bayes’ rule
Simplification
definition of γ tz
(3.6)
This predictive distribution is not computationally expensive to use, since log γ tz values
are computed recursively. The remaining task is to compute
γ tz
∑z γ tz0
values given the
log γ tz values while avoiding numerical problems. We adopt the following solution.
exp(log γ tz − maxz0 log γ tz0 )
exp(log γ tz )
γ tz
=
=
∑z0 γ tz0
∑z0 exp(log γ tz0 ) ∑z0 exp(log γ tz0 − maxz0 log γ tz0 )
Where maxz0 logγ tz0 represents the maximum of logγ tz0 over all possible z0 . This solution
prevents division by a very small number, since the denominator will be greater than
or equal to one. By definition one of the logγ tz0 values must be the maximum, so a term
in the denominator will equal exp(maxz0 logγ tz0 − maxz0 logγ tz0 ) = 1. If
exp(log γ tz )
∑z0 exp(log γ tz0 )
is computed directly, the denominator may be too small to represent using standard
double-precision floating point numbers.
20
3.4
Chapter 3. Opponent Modeling
Latent Dirichlet Allocation on Markov Chains
LDA was first proposed in (Blei et al., 2003) with applications in document modeling.
In (Girolami and Kaban, 2005), LDA on Markov chains was described and used to
model navigation patterns on websites. In that paper, it was compared to Mixtures of
Markov chains (Cadez et al., 2000) and found to have superior predictive performance
on real-world (non-synthetic) website navigation data. In this project, we use the LDA
on Markov chain model as an opponent model. Like a mixture model, there are K components, but unlike a mixture model a single component is not responsible for generating all of the data. Instead, a component is selected at each discrete time by sampling
from a multinomial distribution with fixed latent parameters λ = {λ1 , . . . , λK }, which
are called the mixing proportions. The probability distribution of actions at each time
is written as follows, where T = {T1 , . . . , TK } is a set of the transition probability matrices for each component.
K
p(xt | xt−1 , T, λ) =
∑ p(xt | xt−1, Tk )λk
k=1
Since we don’t know the mixing proportions λ for opponents, we adopt a Bayesian
solution by placing a prior Dirichlet distribution with parameter vector α on the multinomial parameters and integrating over all possible λ values. The likelihood of data
sequence x under this model is the following.
p(x | T, α) =
Z
p(x, λ | T, α)dλ
sum rule
Z
p(x | T, λ)p(λ | α)dλ
"
#
Z
T
=
∏ p(xt | xt−1, T, λ) p(λ | α)dλ
=
product rule, I(x, α | λ)
Markov property
t=1
Z
=
"
T
(
K
∏ ∑ p(xt | xt−1, Tk )λk
t=1
)#
p(λ | α)dλ
by definition
k=1
(3.7)
Unfortunately, this integral does not have a closed form solution as noted in (Girolami
and Kaban, 2005) and numerical integration over the K dimensional parameter space
is intractable for moderately large K. Several approximation methods are available,
and we use the variational approximation scheme from (Blei et al., 2003; Girolami
and Kaban, 2005). This involves assuming a model has a simpler form and finding
the closest model of that type to the actual model. In this scheme, we replace the
3.4. Latent Dirichlet Allocation on Markov Chains
21
above distribution with a lower bound. A family of lower bound functions is defined
with additional parameters called free parameters and an optimization procedure is
used to find the tightest lower bound. The amount of computation done under the
variational inference scheme grows linearly, rather than exponentially, as the number
of components K is increased. Variational methods are deterministic and can be fast
compared to sampling methods such as Markov Chain Monte Carlo (MCMC) methods
(Bishop, 2006). The advantage of MCMC over variational methods is that in the limit
of an infinite number of samples, the estimates converge to the exact Bayesian solution.
Variational approximations are fast to compute but will usually not result in an exact
solution.
The following shows how to use an LDA model to calculate a predictive distribution
given the observed history x.
p(xt+1 | x, T, α) =
Z
Z
=
p(xt+1 , λ | x, T, α)dλ
sum rule
p(λ | x, T, α)p(xt+1 | x, T, λ, α)dλ
Z
p(λ | x, T, α)p(xt+1 | xt , T, λ)dλ
"
#
Z
K
=
∑ p(xt+1 | xt , Tk )λk p(λ | x, T, α)dλ
=
product rule
Markov property, I(x, α | λ)
by definition
k=1
K
=
∑ p(xt+1 | xt , Tk )
Z
λk p(λ | x, T, α)dλ
simplification
k=1
K
=
∑ p(xt+1 | xt , Tk )Eλ[λk | x, T, α]
by definition of E
k=1
(3.8)
The posterior distribution p(λ | x, T, α) over the mixing parameters λ can be computed
via Bayes’ rule, but it is intractable to compute the expectation with respect to this
distribution. Instead we compute the expectation over a variational approximation to
the true distribution. The variational distribution is represented by q(λ).
Eλ [λk | x, T, α] =
≈
Z
Z
λk p(λ | x, T, α)dλ
λk q(λ | γn , T, α)dλ =
γkn
∑k0 γk0 n
The value γkn defines the unnormalized mixing proportion for component k for the
sequence denoted by n. It is computed iteratively as follows (Girolami and Kaban,
22
Chapter 3. Opponent Modeling
2005), when the transition matrices are known.
i+1
i
γkn
= αk + exp ψ(γkn
) ∑ ∑ N(s0 | s, n)
s s0
p(s0 | s, Tk )
∑k0 p(s0 | s, Tk0 ) exp ψ(γki0 n )
(3.9)
Where N(s0 | s, n) represents the number of times the transition from state s to s0 was
observed for sequence n. ψ(z) =
d
dz log Γ(z)
is the digamma function, which equals
the first derivative of the log gamma function. This arises when differentiating the log
of the likelihood function, which contains occurrences of the gamma function due to
the Dirichlet distribution over λ. The computation of the transition matrices and the
parameters α of the Dirichlet prior are described in (Girolami and Kaban, 2005).
In this chapter, Markov chains, mixtures of Markov chains, and LDA on Markov chains
were described as they relate to opponent modeling in games. For Markov chains
and mixtures of Markov chains, methods for maximum a posteriori inference were
described. For the LDA models, a variational approximation to Bayesian inference
was described. Details were presented that show how these models can be used for
prediction. These methods are evaluated in Chapter 5.
Chapter 4
Reinforcement Learning
In domains where it is not trivial to specify examples of the behavior required by a
learning agent, it may be simple to specify a reward function that indicates how well
they are doing. In this case, a wide variety of reinforcement learning (RL) methods
(Sutton and Barto, 1998) can be used to maximize the expected total reward over time.
For this project, we assume that an opponent in a two player game follows a stationary
policy that does not change over time. At each discrete time step t, a learning agent
and a stationary policy opponent observe the state of the game st and their rewards
rt1 + rt2 = 0. The rewards at each time step sum to zero, since we are interested in
strictly competitive games. Both agents then select their discrete actions at1 and at2
1 , r2
independently. The state of the game st+1 at time t + 1 and the rewards rt+1
t+1
are stochastic functions of the previous state st and the actions of both players at1 and
at2 . The action at2 of the stationary opponent depends on the state of the game st
2 . Because the opponent’s policy does not change over
and their previous action at−1
time, they can be considered to be a feature of the game environment. This allows
the combination of the game and the opponent to be considered a Markov decision
process (MDP), which is the type of domain most RL algorithms are designed for.
Two classes of RL methods are examined in this chapter: model-free and model-based
methods. Model-free methods do not make use of explicit opponent or environment
models, they simply estimate the expected utility (long term reward) of choosing each
action in each state. When selecting an action, an agent can choose the action which
is currently estimated to be optimal; this strategy is called exploitation. Since the
estimates may not be accurate, an agent may need to choose random actions to learn
more about their environment; the process is referred to as called exploration. Various
23
24
Chapter 4. Reinforcement Learning
strategies are employed to balance the exploration vs. exploitation trade-off. Modelbased methods build explicit models of the environment dynamics, which estimate
how the state and reward values change over time based on actions. The advantage
of model-based methods is that they allow an agent to maintain separate opponent
models and environment models. This enables an agent to reuse general knowledge of
the game when a new opponent is encountered. If the agent also has a good predictive
model of an opponent’s strategy, they can compute an estimated best-response strategy.
Using the methods described in the previous chapter, in some cases it is possible to
learn good predictive models quickly by estimating latent parameters. If the agent
also has a good environment model, then a best response strategy can potentially be
computed early in the game. Using model-free RL methods may take many orders
of magnitude longer than this, depending on the quality of the initial utility estimates.
Games against new opponents may not be very long, so the goal is to obtain maximum
rewards as quickly as possible.
Model-free RL techniques do not build explicit models of an agent’s opponent or environment. Instead, they estimate the expected utility of choosing each action in each
state. The expected utility is the expected total reward, which is weighted to prefer rewards that are received sooner. The model-free RL algorithms Q-learning and
Sarsa(λ) are called temporal difference (TD) methods. Temporal difference methods
learn from experience by continually improving an estimate of the optimal policy,
which is guaranteed to exist in an MDP. They can be used to learn optimal policies
against individual stationary policy opponents, since a game against such an opponent
can be considered an MDP. An MDP is essentially a one-player game, where at each
time t a single agent observes the state st , their reward rt and chooses an action at .
The next state st+1 and next reward rt+1 are stochastic functions of st and at . MDPs
have the Markov property, meaning that the future states and rewards are independent
of all previous states and rewards given knowledge of the current state. The policy of
the stationary opponent will affect the state transition probabilities p(st+1 | st , at ) and
the expected reward values E [rt+1 | st , at ] in the single player MDP. An agent using
the optimal policy would receive the highest reward averaged over time. Model-free
algorithms are used to learn policies as an agent plays the game and may require large
amounts of learning time to find near-optimal policies. An agent may not be able to
play against a new opponent for a sufficiently long amount of time to learn such a policy and may receive low rewards until a large quantity of time has elapsed. Another
25
drawback of using these methods in a two-player game is that there is no obvious way
to reuse the utility estimates when a new opponent is encountered. Because the new
opponent may behave according to a different policy, the state transition probabilities
and expected reward values can be different. Therefore, the utility of an agent choosing
each action in each state may also be different. It is possible to use the old utility estimates as initial estimates in a game against a new opponent, but the following problem
arises: if the agent has utility estimates from games against several disparate opponents, it is unclear which ones they should use as initial estimates. A principled way
to accomplish this is by making use of explicit opponent models. If an agent has utility estimates and opponent models for several previously encountered opponents, they
can follow a greedy policy and select the utility estimates that correspond to the most
likely opponent model.
In some cases it is possible that latent variable opponent models can be used to quickly
compute good models of new opponents, which have never been encountered previously. These models may allow an agent to predict their opponent’s choices accurately with less data than needed when learning “from scratch.” Although an agent
may compute good estimates of an opponent’s policy from data obtained in previous
games, good utility estimates for their own actions will not immediately follow from
this. Additionally, it would be helpful to reuse a model of the environment when a new
opponent is encountered. In some game situations, the outcome of a player selecting
an action will be independent of the opponent’s action. For example in a two-player
soccer game, an agent can always score a goal if they are right in front of it and the
opponent is very far away. By maintaining separate opponent and environment models, this type of information will not need to be relearned. Furthermore, it is possible
to compute an estimated best-response strategy from a good environment model and a
good opponent model. This allows an agent to attempt to convert predictive opponent
modeling success into actual in-game rewards. Various model-based RL algorithms
(Sutton and Barto, 1998) are available for this purpose and they fall into two broad
categories: sampling methods and full-backup methods. By sampling from an opponent model in conjunction with an environment model, an agent can simulate games
against a particular opponent. Using sample rewards and transitions generated by this
simulation, an agent can use model-free RL algorithms such as Q-learning to learn
value estimates. In (Sutton and Barto, 1998) the use of Q-learning in this way has been
named the Dyna-Q algorithm, which is one of the methods used for this project. The
26
Chapter 4. Reinforcement Learning
alternative to sampling is full-backup methods, where an agent systematically considers all possible situations and their possible outcomes to compute value estimates. The
term backup refers to the propagation of value information to states from their successors. This is typically done by dynamic programming algorithms, which are efficient
since they employ bootstrapping, i.e. they systematically improve estimates based
on other estimates. The specific dynamic programming method used in this project
is value iteration (Puterman and Shin, 1978), which uses a series of sweeps through
the state space. Dynamic programming methods can be more efficient than sampling
methods and converge to near-optimal policies faster. For problems with many states,
single sweeps may take a very long time to compute and decent approximations can
be computed by sampling methods in much less time (Sutton and Barto, 1998). The
remainder of this chapter will discuss in detail the use of model-based and model-free
RL methods for a two player competitive game.
4.1
Formal Notation
An agent is not exclusively concerned with maximizing immediate reward. The goal
in RL is to maximize expected discounted reward (utility), which is defined as follows.
"
#
2
Rt = E[rt+1 + γ rt+2 + γ rt+3 + . . .] = E
∞
∑ γ i−1rt+i
, where 0 ≤ γ < 1
(4.1)
i=1
Where rt is the actual reward at time t, the parameter γ is called the discount factor,
and Rt is the expected discounted reward at time t. The expectation is computed with
respect to all possible sequences st+1 , . . . of future states. The choice of γ will determine precisely how much preference is given to rewards obtained in the near future
compared to rewards obtained in the distant future. The value of γ will therefore influence the policy that is derived from the utility estimates. If γ is 0, then only immediate
reward is considered. As γ approaches 1, the reward expected in the very distant future
has the same influence on perceived utility as the reward expected in the near future.
Now that Rt has been defined, the goal of learning can be stated explicitly. The goal is
to learn a policy π : S → A from states to actions that maximizes Rt for each starting
state s ∈ S. The problem is that the agent never obtains direct information specifying
the optimal action. The environment only provides the agent with reward values for
pairs of states and actions.
4.2. Model-Free Reinforcement Learning
27
The value of being in a state s and following a policy π is defined as:
"
#
∞
V π (st ) = Eπ ∑ γ i−1 rt+i st
(4.2)
i=1
Where rt , rt+1 , . . . are the sequence of rewards generated by following the policy π
when s is the starting state. The optimal policy for an MDP is denoted π∗ .
π∗ (st ) = argmax E[rt+1 + γ V ∗ (st+1 ) | st , at ]
(4.3)
at
Where argmaxa [ f (a)] denotes the choice of a which results in the maximum value of
∗
f (a). The value function V π (s), which corresponds to the optimal policy, is abbreviated as V ∗ (s).
4.2
4.2.1
Model-Free Reinforcement Learning
Q-Learning
Q-learning was introduced by (Watkins, 1989). Instead of learning an estimate of the
value function V ∗ , an agent using Q-learning continually updates an estimate Q̂ of the
Q-function. The Q-function is also referred to as an action-value function and denotes
the expected value of choosing a given action in a state and then following the optimal
policy π∗ . It is defined as follows.
Q(st , at ) = E[rt+1 + γ V ∗ (st+1 ) | st , at ]
(4.4)
Combining equations 4.3 and 4.4, gives:
π∗ (st ) = argmax Q(st , at )
(4.5)
at
Which shows that the optimum policy π∗ is trivial to compute from the Q-function.
The function can be written recursively as:
Q(st , at ) = rt+1 + γ argmax Q(st+1 , a0 )
(4.6)
a0
The Q-learning rule is given below, which defines how individual action-value estimates are used to improve other estimates.
0
Q̂(st , at ) = (1 − α)Q̂(st , at ) + α rt+1 + γ argmax Q̂(st+1 , a )
a0
(4.7)
28
Chapter 4. Reinforcement Learning
This follows naturally from Equation 4.6 and 0 ≤ α ≤ 1 is a learning rate (or step-size)
parameter. For each timestep t, equation 4.7 is used to update Q̂(st , at ). Q-Learning
is an off-policy algorithm, since the policy that is learned is independent of the policy
that is followed during the learning process. This can be seen by noting that the above
rule does not refer to at+1 , the action taken by the agent at the subsequent timestep.
If the environment is deterministic, then α = 1 can be used. Otherwise, a smaller
learning rate should be used so that past estimates are not thrown away. The estimated
Q-function is often stored in a two dimensional table, which has a row for each s ∈ S
and a column for each a ∈ A(s), where A(s) denotes the set of allowed actions in
state s. The table can be initialized arbitrarily. If the agent has no prior experience
in the game, the table can be filled with zeros. If the agent has stored a Q-table from
a previous game, they can reuse it as an initial table in a new game. Since the Qfunction estimate will not be exact, especially early in the learning process, an agent
must employ an exploration policy to determine when to choose actions estimated
to be optimal and when to choose other actions. Q-learning is a one-step method,
since updates to estimates are based only on information obtained one time-step in the
future. The parameters α and γ can be optimized by trying different combinations in
test games and measuring the total reward obtained. It is important to note that under
certain conditions, the policy learned through Q-learning will converge to the optimal
policy for an MDP. The conditions include reducing the learning rate α over time in a
certain way and experiencing every state-action pair infinitely often.
4.2.2
One-Step Sarsa
One-step Sarsa is very similar to one-step Q-Learning. The difference is that Sarsa is
an on-policy algorithm, i.e. the Q-function that is learned does depend on the policy
that is followed. Sarsa also continually improves an estimate Q̂ of Q, which is often
stored in a table. The Sarsa update rule is as follows.
Q̂(st , at ) = (1 − α)Q̂(st , at ) + α rt+1 + γ Q̂(st+1 , at+1 )
(4.8)
There is only one difference between the Q-learning rule (equation 4.7) and the Sarsa
learning rule (equation 4.8). In the Sarsa learning rule, the term γ argmaxa0 Q̂(st+1 , a0 )
is replaced by γ Q̂(st+1 , at+1 ). This corresponds to updating estimates based on the
actual action taken at time t + 1, rather than the action estimated to be optimal. Sarsa
and Q-learning are both reasonable methods and their performance can be compared
4.2. Model-Free Reinforcement Learning
29
empirically. For large problems, sometimes function approximators (Bishop, 2006)
are used to store Q-functions rather than tables. This is done to prevent tables from
becoming too large and also allows estimates to be generalized to new situations. It is
important to note than when using on-policy methods such as Sarsa with linear function
approximators, guarantees can be made about convergence to near-optimal policies
(Sutton and Barto, 1998). This is not the case for off-policy Q-learning. In addition,
it is straightforward to replace one-step updates in the Sarsa algorithm with multistep updates to obtain the Sarsa(λ) algorithm described in the next section. Sarsa(λ) is
typically more efficient than one-step Sarsa, due to more efficient use of observations,
which affect multiple estimates. It is possible, but much more awkward, to use multistep updates with off-policy methods such as Q-learning.
4.2.3
Sarsa(λ)
For one-step updates, a new estimate of the discounted reward Rt at time t has the
following form.
(1)
Rt
= rt+1 + γ Q(st+1 , at+1 )
It is perfectly reasonable to use a sequence of actual rewards received further in the
future to update predicted reward. An n-step update can be written as follows.
(n)
Rt
= rt+1 + γ rt+2 + γ 2 rt+3 + . . . + γ n−1 rt+n + γ n Q(st+n , at+n )
Sutton proposed a method of blending between updates with varying values of n via a
parameter 0 ≤ λ ≤ 1 (Sutton and Barto, 1998). A new type of update was defined as
follows.
(λ)
Rt
h
i
∞
(1)
(2)
(3)
(n)
= (1 − λ) Rt + λRt + λ2 Rt + ... = (1 − λ) ∑ λn−1 Rt
n=1
This led to a new class of temporal difference algorithms called TD(λ). When λ is
zero, updates to estimates are identical to one-step TD algorithms. When λ is one,
the algorithms become identical to Monte Carlo methods (Sutton and Barto, 1998),
which involve the use of the entire sequence of actual rewards to update Q-values.
Monte Carlo methods do not bootstrap, i.e. they don’t update estimates based on other
estimates. The λ parameter allows for combinations of these extremes, which can lead
to better results.
30
(λ)
Rt
Chapter 4. Reinforcement Learning
cannot be directly applied to real algorithms, since full estimates will not be avail(λ)
able until all future reward values are known. In practice, an estimate of Rt
can be
improved by each future update of the Q-table. Eligibility traces are used to determine
(λ)
(λ)
the extent to which the current Q-table update affects estimates of Rt−1 , Rt−2 , . . . that
were made previously.
Formally, the eligibility for a state action pair (s, a) at time t is defined as et (s, a). Note
that initially all eligibility trace values are zero, i.e. ∀s ∈ S, a ∈ A(s). e0 (s, a) = 0.
(
)
γ λ et−1 (s, a) + 1
if s = st and a = at ;
et (s, a) ≡
γ λ et−1 (s, a)
otherwise.
A value δt is defined as follows, where Qt refers to the estimate of Q at time t.
δt = rt+1 + γ Qt (st+1 , at+1 ) − Qt (st , at )
(4.9)
At each time t, each Q-table entry is recomputed as follows.
∀s ∈ S, a ∈ A(s). Qt+1 (s, a) = Qt (s, a) + α δt et (s, a)
In section 7.4 of (Sutton and Barto, 1998), a proof is provided that for finite episodes,
(λ)
the updates converge to the theoretical updates that directly use Rt
via advanced
knowledge of the future. It is easy to see that for λ = 0, equation 4.9 is equivilent to
the one-step Sarsa rule (equation 4.8). In that case, et (st , at ) = 1 and et (s, a) = 0 for all
s 6= st and a 6= at .
4.2.4
Action Selection
A Q-table approximating the Q-function can be used to define a policy in conjunction
with an action selection strategy. Exploration versus exploitation is a recurring theme
throughout reinforcement learning and there is always a trade-off between the two.
Exploration involves making decisions that are estimated to be suboptimal in hopes
that additional knowledge and ultimately higher rewards can be acquired. Exploitation
entails choosing actions which are thought to be optimal in hopes of obtaining maximum reward, which may lead to low rewards if the estimates are wrong. The following
two methods are commonly used in reinforcement learning to balance exploration and
exploitation (Sutton and Barto, 1998).
Epsilon-Greedy (ε-greedy) exploration is a particularly simple way to choose actions
based on a Q-table. The parameter ε determines the probability that an agent will
4.3. Model-Based Reinforcement Learning
31
choose an exploratory action. At each timestep, an agent chooses an action of maximal
Q̂-value for the current state with probability 1 − ε. With probability ε, the agent
chooses randomly from the set of possible actions. In this case, each action has an
equal probability of being selected. The disadvantage of this method is that assuming
an action with a higher Q̂-value is available, an action with a high Q̂-value has the
same probability of being selected as an action with a low value. The probability of
selecting the action which is actually optimal may be low. Softmax exploration can
provide a solution to this.
Using Softmax exploration, when the environment is in state s, the probability of selecting action a ∈ A(s) is as follows.
exp(Q̂(s, a)/τ)
∑a0 ∈A(s) exp (Q̂(s, a0 )/τ)
The parameter τ is referred to as the exploration temperature. As τ approaches infinity,
there is a uniform probability of selecting each action. As τ approaches zero, the probability of choosing the action with the highest Q-value approaches one. Intermediate
values of τ allow for a blend of exploration and exploitation, where actions with higher
Q-values have exponentially higher probabilities of being selected. Actions with low
Q-values will rarely be selected, while there is a high probability of selecting an action
with a relatively high Q-value. This allows exploration to be focused on actions which
are estimated to be near-optimal. The temperature τ is usually decreased over time.
4.3
4.3.1
Model-Based Reinforcement Learning
Dyna-Q
The Dyna-Q algorithm (Sutton and Barto, 1998) combines regular one-step Q-learning
with planning. In this case, the term planning describes the use of a model of the environment to improve a value function estimate. The algorithm accomplishes this by
building environment models and then learning (updating value function estimates)
based on simulated experience from the model. The parameters α, γ, and the exploration strategy required for Q-learning must also be specified for the Dyna-Q algorithm. An additional parameter N specifies the number of planning steps to be performed in the simulated environment for each time-step of the actual environment.
32
Chapter 4. Reinforcement Learning
The algorithm can be summarized as follows. At each time-step t, observe the state st
of the actual environment. Then choose an action at according to an exploration policy
and observe the next state st+1 and the next reward rt+1 . Update the Q-function estimate using the Q-learning rule (equation 4.7). Update the model of the environment
using the information st , at , st+1 and rt+1 . More details for this step are given later.
Finally, perform N planning steps. Each planning step consists of choosing a random
previously observed state s, a random action a previously used in state s, and sampling
from the model to obtain the next state s0 and the reward r. The Q-function estimate is
then updated using the Q-learning rule in conjunction with the information a, s, s0 and
r. After N planning steps have been completed, the agent waits to observe the next
state of the real environment and the process is repeated. For N = 0, this algorithm is
equivalent to standard one-step Q-learning.
For the algorithms previously described in this chapter, the opponent was considered
to be a feature of the environment. For a two-player competitive game situation, we
can maintain separate environment (game) and opponent models as discussed previously. This allows an agent to make use of prior knowledge of the game, which is
independent of the opponent. It also allows an agent to benefit from predictive opponent modeling techniques described in the previous chapter. For the model-free RL
methods, the agent did not explicitly observe the opponent’s action; it was simply a
factor which affected the dynamics of the overall environment. At this point, it is useful to make the opponent’s action explicit. At each time-step of the actual environment,
an agent observes the state st and chooses an action at1 . Then they observe their oppo1 + r 2 = 0. The next state s
nent’s action at2 , the next state st+1 , and the rewards rt+1
t+1
t+1
depends only on st , at1 and at2 , because of the Markov property of the game. Note that
2 , but it is helpful to separate the terms by rest includes the agent’s previous action at−1
2 , where g is referred to as the game state. The game state g
placing st with gt and at−1
t
t
contains general information about the state of game that, but does explicitly represent
past opponent actions. The overall environment state st contains the game state and the
2 . It can now be seen that s
2
opponent’s most recent action at−1
t+1 is independent of at−1
given the opponent’s current action at2 . This observation reduces the complexity of
the model, which reduces the amount of data required to learn the model. The reward
1 depends on g , g
1 2
2
rt+1
t t+1 , at , at and is conditionally independent of at−1 given the other
information. The dependence of the reward on gt+1 can be removed by computing the
expectation over gt+1 using the state transition probabilities. During model learning,
4.3. Model-Based Reinforcement Learning
33
the dependence is explicit, but when policies are derived the dependency is effectively
removed.
The strategy for learning a game model which is independent of the opponent is
the following. Define T (g0 | g, a1 , a2 ) to be the number of times the transition from
game state g to g0 was observed given that actions a1 and a2 were chosen by each
player. These counts are initially zero for all state transitions and action choices.
When g, g0 , r1 , a1 , a2 is observed, the environment model is updated using the following rule: T (g0 | g0 , a1 , a2 ) ← T (g0 | g0 , a1 , a2 ) + 1, where x ← y indicates that estimate
x is replaced by estimate y. Expected reward values are also maintained of the form
E r1 | g, g0 , a1 , a2 and all values are initially zero. After the transition counts T have
been updated, the expected rewards are updated as follows: E r1 , | g, g0 , a1 , a2 ←
(1 − β)E r1 , | g, g0 , a1 , a2 + βr1 , where β = 1/T (g0 | g, a1 , a2 ). At any given time, the
estimate computed using this rule is equivalent to the average reward received given
g, g0 , a1 and a2 .
Sampling from the game model is performed as follows, for a previously observed
combination of a1 , g and a2−1 (an opponent’s previous action), the opponent’s action a2
is determined by sampling from the opponent model which has the form p(a2 | g, a2−1 ).
The subsequent state is then sampled from p(g0 | g, a1 , a2 ), which is computed from T
by normalizing the distribution as follows:
p(g0 | g, a1 , a2 ) =
T (g0 | g, a1 , a2 )
∑g0 T (g0 | g, a1 , a2 )
This maximum likelihood solution is sufficient for building the model. It is not necessary to put a Dirichlet prior on the state transition probabilities, since the model will
not be used for prediction. The model is only used for sampling, so it does not make
sense to assume transitions that were never observed are possible. The stored value
E r1 | g, g0 , a1 , a2 is then used directly as the agent’s reward in the simulated envi
ronment, i.e. the agent receives the reward r1 = E r1 | g, g0 , a1 , a2 in the simulation.
The transition counts and expected rewards are independent of the opponent’s policy,
since they are only affected by the opponent’s most recent action, which is explicitly
represented. This allows an agent to reuse these models in new games against new
opponents. In fact, an agent only needs to initialize the values of the model to zero
the first time they play the game. In all subsequent games, they can simply load the
previous model at the start, update the model as they play the game, and then save
the model for use in the next game. The opponent’s policy will have some effect on
34
Chapter 4. Reinforcement Learning
the parts of the model that are updated. If an opponent never chooses an action, then
transitions involving that action will never be observed. The result will be that some
parts of the models are based on more real data than others, but this will not bias the
results since the transition probabilities are normalized before use.
Practical issues must be considered when deciding how to store such a model. The
simplest way is by using a table, but the table will have size |g|2 |a|2 , where |g| is the
number of states (not including the opponent’s past action) and |a| is the average number of actions available in each state. For 1000 states and 5 actions per player per state,
the table would have 25 million entries. In practice the table will be very sparse, since
many transitions will never occur. It will not normally be possible for transitions to
happen between any two states, they must be near each other in some sense for games
which are reasonably natural (i.e. not completely artificial). The solution adopted
for this project is the implementation of sparse tables via hash maps. Hash maps are
essentially tables, but rather than being indexed by row number, they are indexed by
information (such as a state description). Efficient algorithms allow for hash maps to
grow automatically as the number of observations increases. In this project, for each
observed state and all pairs of actions, the successor states are stored in hash maps.
This reduces storage requirements by orders of magnitude compared to full-sized tables. An alternative is to store the transition count and expected reward values using
function approximators, such as neural networks (Bishop, 2006) or related methods.
This may allow good models to be learned from less data, but may introduce new
problems of accuracy. Such methods may generalize observations to new situations
in a way that is not accurate. For this project, we assume that an agent has enough
previous experience to learn a model which is stored via a hash map.
4.3.2
Value Iteration
Dynamic programming (DP) methods, such as value iteration, provide an alternative to
estimating a value function from opponent and environment models based on samples.
DP methods allow estimates to be derived directly from opponent and environment
models via systematic sweeps through the state space. That the value of being in state
s and following policy π is defined as follows, restating equation 4.2.
"
#
∞
V π (st ) = Eπ ∑ γ i−1 rt+i st
i=1
4.3. Model-Based Reinforcement Learning
35
Note that Eπ [x | st ] denotes the expectation of x over all future actions and states
at , st+1 , at+1 , st+2 , . . . given st and the policy π. This can be expanded into the recursive
Bellman equation (Bellman, 1957) for V π as follows (Sutton and Barto, 1998).
"
#
∞
V π (s) = Eπ ∑ γ i−1 rt+i st = s
i=1
"
#
= Eπ rt+1 + γ ∑ γ i−1 rt+i+1 st = s
i=1
#)
(
"
∞
= ∑ p(a | s, π) ∑ p(s0 | s, a) E r | s, s0 , a + γ Eπ ∑ γ i−1 rt+i+1 st+1 = s0
∞
s0
a
i=1
= ∑ p(a | s, π) ∑ p(s0 | s, a) E r | s, s0 , a + γ V π (s0 )
(4.10)
s0
a
The Bellman equation 4.10 implies a consistency requirement between the values of V
for different states. This equation can be used as an update equation, which computes
updated estimates based on past estimates. It is shown in (Sutton and Barto, 1998) that
there is an optimal deterministic policy in an MDP. For the optimal policy, the action
which maximizes the expected return in each state will be chosen. The optimal policy
π∗ has the following form, since only the greedy (estimated optimal) action will be
chosen in each state.
π(s) = argmax ∑ p(s0 | s, a) E r | s, s0 , a + γ V π (s0 )
a
(4.11)
s0
The value function V ∗ for the greedy policy π∗ can be written as follows.
V ∗ (s) = max ∑ p(s0 | s, a) E r | s, s0 , a + γ V ∗ (s0 )
a
(4.12)
s0
This is equivalent to equation 4.10 under the greedy policy. The value iteration algorithm (Sutton and Barto, 1998) follows directly from equation 4.12. Let Vi (s) be the
estimate of V ∗ (s) at iteration i of the algorithm. The initial values V0 (s) are initialized
arbitrarily, e.g. with the value zero. At each step of the algorithm, Vi (s) is computed
from Vi−1 for all s as follows.
Vi (s) = max ∑ p(s0 | s, a) E r | s, s0 , a + γ Vi−1 (s0 )
a
s0
Note the similarity to the Bellman equation 4.12 for the optimal policy π∗ . Convergence can be tested by measuring maxs |Vi (s) − Vi−1 (s)| and stopping when the value
is less than a small positive number. If convergence has not occurred, the process is repeated for another iteration. When the algorithm terminates, a policy is returned which
36
Chapter 4. Reinforcement Learning
meets the criteria of equation 4.11. This can be computed easily by keeping track of
which action has the maximum value in each state during execution of the algorithm.
For this project, the value iteration algorithm is modified to use separate environment
and opponent models. This is done by replacing s with g, a2−1 and ultimately factoring the joint distribution of g, a2−1 , where g represents the state of the game and a2−1
represents the opponents last action. Starting with the Bellman equation 4.12 for π∗ ,
occurrences of s are replaced with g, a2−1 and occurrences of s0 are replaced with g0 , a2
as follows. Note that a has been replaced with a1 for clarity since both players are now
considered.
V ∗ (g, a2−1 ) = max
p(g0 , a2 | g, a2−1 , a1 )
∑
0 2
a2 g ,a
E r | g, a2−1 , g0 , a2 , a1 + γ V ∗ (g0 , a2 )
The joint distribution of g0 and a2 is then factored by using the product rule and then
applying the independence assumptions I(a2 , a1 | g) (the players make independent
action choices), I(g0 , a2−1 | a2 ) (the next state is independent of the opponent’s last action given their current action), and I(r, a2−1 | a2 ) (the next reward is also conditionally
independent of the last action).
p(a2 | g, a2−1 )p(g0 | g, a2 , a1 )
∑
0 2
V ∗ (g, a2−1 ) = max
a1 g ,a
E r | g, g0 , a2 , a1 + γ V ∗ (g0 , a2 )
Finally, the above can be simplified by moving one of the summations.
V ∗ (g, a2−1 ) = max ∑ p(a2 | g, a2−1 ) ∑ p(g0 | g, a2 , a1 ) E r | g, g0 , a2 , a1 + γ V ∗ (g0 , a2 )
a1
a2
g0
This Bellman equation can then trivially be made into the following update equation
for value iteration.
Vi (g, a2−1 ) = max ∑ p(a2 | g, a2−1 ) ∑ p(g0 | g, a2 , a1 ) E r | g, g0 , a2 , a1 + γ Vi−1 (g0 , a2 )
a1
a2
g0
This equation represents the form of value iteration as derived and used for this project.
Note that p(a2 | g, a2−1 ) corresponds directly to the opponent model, p(g0 | g, a2 , a1 )
corresponds to the transition probabilities of the environment model, and E r | g, g0 , a2 , a1
represents the reward expectations of the environment model. The factoring of the joint
distribution of g0 and a2 is important, since it reduces the complexity of the environment model that is required. The same environment model can be used given g, a2 and
a1 independently of the value of a2−1 . This assumption is reasonable because of the
Markov property and permits the use of a simpler environment model, which can be
learned from less data.
4.3. Model-Based Reinforcement Learning
37
This chapter described model-free and model-based RL methods as they related to a
two player competitive game. For model-based methods, the use of separate opponent
and environment models was described.
Chapter 5
Evaluation
5.1
Domain - The Soccer Game
The two player competitive game that was used for this project is a one-on-one soccer
game (Figure 5.1). I have previously developed this three-dimensional soccer game
and it is based on a physics (rigid-body dynamics) simulator. The physics objects in the
game include spheres with linear and angular momentum, as well as stationary planes,
cylinders and boxes. Relevant physical equations are solved numerically and the states
of the game’s physical objects are updated at a rate of 100 times per game-second.
Players in the game are represented as spheres and the soccer ball is represented as a
smaller sphere. Each player has their own goal to defend. There are also posts in each
corner with which spheres can collide. To control a player, a human or an automated
agent provides a two-dimensional vector representing the desired acceleration. They
also provide a Boolean value representing their desire to kick the ball. Once a player
or the soccer ball is set into motion, they are gradually slowed by friction. Stopping a
player requires acceleration in the opposite direction of the velocity. Players and the
soccer ball can collide with the walls or corner posts and will lose a portion of their
momentum in the process.
Player acquire the ball by colliding with it. Once this happens, the ball is automatically
attached to the player and follows them around. The player will automatically reposition the ball as their velocity changes, trying to keep the direction from the player to
the soccer ball similar to the player’s direction of movement. Kicking the ball releases
the soccer ball in a straight line. The direction of travel is equal to the angle from
39
40
Chapter 5. Evaluation
Figure 5.1: The Soccer Game – The blue player is about to score a goal. The soccer
field is flat and the white lines don’t have any particular meaning. Walls along the four
sides become visible when there is a collision with them.
the player to the soccer ball plus a small amount of Gaussian noise. The velocity of
a released ball is proportional to the player’s momentum. The ball can also be stolen
from an opponent. To do this, a player needs to first touch the ball while the opponent
has it. Then the player needs to get it away from the opponent before they can touch it
again. This can happen very quickly and the success or failure of stealing can seem to
be somewhat arbitrary. Players score one point for kicking the ball into the opponent’s
goal. They lose a point if they kick the ball into their own goal, which could happen
accidentally since the ball can rebound from the walls.
In the simplified version of the game used for this project, agents do not directly provide acceleration vectors for control. Instead, they make use of a set of low-level
controllers which compute acceleration vectors from game state information. At each
time-step of the game, an agent will select the controller that is used until the following
time-step. At any given time, an agent has approximately four controllers to choose
from. This allows agents to learn to play the game much more easily, by simplifying
the action space from a continuous two-dimensional space to a discrete space with a
small set of allowed actions. Agents observe the game state and update their action
choices ten times per game-second. The controllers are each linked to common behaviors in the game, such as intercepting the ball or retreating a player’s goal. This allows
5.1. Domain - The Soccer Game
41
an agent to express their policy more concisely then when acceleration is used directly
for control, but it can prevent agents from performing actions that are optimal in the
underlying game. In a previous project (Collins and Rovatsos, 2006), model-free RL
agents using Q-learning and Sarsa(λ) were implemented for the simplified game. The
low-level controllers were implemented with neural networks, which were trained from
data. An alternative implementation of the controllers was also provided, in which the
controllers used rules and hard-coded transformations of the game-state data to acceleration commands. Both sets of controllers worked reasonably well, but the hard-coded
controllers were chosen for this project since their output was more consistent. The
agents using model-free RL to learn policies based on these controllers were always
able to defeat opponents using hand-designed policies given enough learning time.
The amount of learning time needed was typically long (hours of game-time) and it
wasn’t always clear when value function estimates could be reused. The model-based
RL methods described in this project have the potential to reduce the learning time
to fractions of a game-hour and provide a principled way of reusing opponent and
environment models when a new opponent is encountered.
The set of low level controllers is designed to facilitate a wide range of strategies, mapping situations to controllers, which could be used for playing the game. The following
is the list of controllers designed for the one-on-one soccer game: intercepting the ball
Intercept), retreating to the player’s own goal (Retreat), attempting to get close to the
potential path of a kicked ball (Defend), aiming the ball toward the opponent’s goal
(LineUp), advancing toward the opponents goal (Advance), stealing the ball (Steal),
and preventing an opponent from stealing (KeepBallSafe). Each of these controllers
are available to an agent as primitive actions with the exception of LineUp, for which
three variations are provided to aim the ball toward different sections of the goal, thus
allowing for a greater variety of strategies. Finally, there is a Kick behavior that simply
executes the kick command in the game.
Agents could potentially use continuous information about the state of the game to
make decisions, but for simplicity a discrete state space is used which preserves some
of the most important information. The state space defined for the domain has 1294
states and was based on quantized values of the most important information, such as
the player’s proximity to the soccer ball and some historical information about recent
ball possession. Without any historical information, an agent may observe that they do
not have the ball immediately before they score a goal and may learn that not having
42
Chapter 5. Evaluation
the ball is good. A fairly simple reward function is used for the purposes of learning
a policy in the soccer game. It would be possible to simply provide the agent with
a positive reward when they score a goal and a negative reward when the opponent
does. Slightly more information is included, which increases the speed of learning.
The reward given to an agent at each time-step is represented in table 5.1. In actual
games, I have found that the total reward given to each player is typically proportional
to the final score difference. This suggests that the extra information in the reward
signal will not dramatically affect the types of policies that are learned.
Table 5.1: The reward signal, for each timestep of the game
+10 If a goal has just been scored in the opponent’s goal.
-10 When a goal is scored in the player’s goal.
-5 If a player’s shot has missed the opponent’s goal.
5 If an opponent’s shot has missed the player’s goal.
1 If the player has just obtained the ball.
-1 If the player has just lost possession of the ball.
-1 If the opponent has just obtained the ball.
1 If the opponent has just lost possession the ball.
0 If none of the above conditions apply.
5.2
Opponents
The opponents used for the learning agents are stationary policy hand-designed opponents, which can be modeled perfectly by first-order Markov models as described
in Chapter 3. These opponents use stochastic policies with distributions of the form
2 ), where a2 is the opponent’s action at time t, a2
p(at2 |gt , at−1
t
t−1 is the opponent’s pre-
vious action, and gt represents the state of the game at time t. Note that gt does not
explicitly include any information about past actions of either player. It includes information related to recent possession of the ball and the positions of both players and
the ball. These players essentially have some internal state, since their actions can be
based on their past actions. For the same game state gt , this allows them to choose
actions from different distributions depending on what their last action was. Opponents that can maintain hidden internal states constitute a richer class of agents. These
5.2. Opponents
43
agents would be able to base their decisions on hidden internal information, which
they can update as the game progresses. For instance, an agent may be more aggressive if it is winning a game and more defensive if it is losing. An opponent would
not necessarily know that the agent is basing their decisions of this information. Such
opponents could be properly modeled with hidden Markov models (HMMs) (Rabiner,
1989), which have been studied extensively in natural language processing. There are
efficient methods for performing inferences with HMMs, but the focus of this project
was to show that the combination of model-based RL and opponent modeling is helpful. To this end, we decided to use simple opponents (with no hidden state) so that it
would be easier to investigate the main issues.
The set of hand designed opponent policies used in this chapter is the PS class of opponents, where PS stands for parameterized stochastic opponent. The main parameter
for the opponent is the defensiveness parameter, which takes values between zero and
one, including the extreme values zero and one. When the defensiveness parameter is
zero, the agent never selects defensive behaviors. If they don’t have the ball (the opponent may have it or nobody has it) they always choose to attempt to intercept or steal
it. When the parameter is one, the agent will always defend its own goal rather than
attempting to obtain the ball. As the parameter varies from zero to one, the probability
of the agent choosing a defensive action in each situation (when they don’t have the
ball) increases nonlinearly from zero to one. The behavior is nonlinear in the parameter to allow an agent to prefer defensive behaviors in certain situations more than other
situations. Using a nonlinear function allows the probabilities to change at different
rates as the parameter changes, while still allowing them to be zero and one at the limits. This is illustrated in Figure 5.2. The parameter values are written in parentheses,
so the opponent with parameter zero is named PS(0) and opponents with other values
are named similarly. All PS class opponents employ the same strategy while they have
the ball, which is a combination of moving closer to the opponent’s goal (Advance),
aiming for an open spot on the opponent’s goal (LineUp) and retreating from the opponent when they are close (KeepBallSafe). Because all of the PS opponent’s have
this common pattern of behavior, it should not be necessary to relearn this when a new
one is encountered. The LDA model provides the potential to do this. Note that it
is possible to attempt to infer the value of the parameter directly, rather than learning
an LDA model. This approach was not chosen, because it would only work for the
specific class of opponents, while the LDA approach has the potential to work well in
44
Chapter 5. Evaluation
Figure 5.2: The effect of the defensiveness parameter on the PS opponent’s action
choices. The probability of selecting a defensive action is shown for selected situations.
Probability of Selecting a Defensive Action
1
Nobody has the ball, the opponent is closer to it.
Opponent has ball and is close to player
Opponent has ball and is far from player.
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
Defensiveness Parameter
general.
5.3
Opponent Model Evaluation
This section evaluates the performance of the opponent modeling methods described
in 3. Some of the data used in this section was produced through running the game,
but actual game results are not presented until the next section. Instead, predictive
performance is presented in terms of predictive perplexity and percent accuracy of
predictions. In many cases the LDA models have better predictive performance than
the mixture models. However, when new opponents are encountered in new games,
better predictive performance can be achieved by using a Markov model trained with
observed data from the current game (i.e. learning from scratch).
Figure 5.3 shows typical progress of the variational learning algorithm for the LDA
models. Each mixture component has a corresponding transition probability matrix
which is randomly initialized. The log likelihood of the data always increases for each
iteration until convergence is detected (the change decreases below a small positive
number). For typical data from opponents in the game (1294 states, 4 actions per state
5.3. Opponent Model Evaluation
45
Figure 5.3: Typical progress of the variational algorithm for learning gamma values (approximate mixing proportions) for each training sequence and a single set of transition
matrices. For 1000 states, 5 actions per state, and 15 training sequences of length
20000 each, every iteration takes up to 30 seconds on a 2.4 GHz machine. The log of
the likelihood function of the training data is shown on the y-axis.
4
−2.5
x 10
Log Likelihood of Data
−2.55
−2.6
−2.65
−2.7
−2.75
0
10
1
10
2
10
Iterations
on average), the algorithm may take a couple hours to converge on a 2.4GHz machine,
when using 25 sequences of length 20000 for training. Running the algorithm multiple
times with different initial transition matrices usually leads to similar log likelihood
values. For the mixture models, the EM algorithm typically converges in less than
20 iterations and each iteration is usually faster than an iteration for the LDA model
variational algorithm. Similarly to the LDA model, each mixture component has a
corresponding transition matrix which is initialized randomly. It is usually necessary
to run the EM algorithm for mixture models up to 10 or 20 times and then choose
the model with the highest log likelihood of the data. The algorithm tends to quickly
converge to a local maximum, and the results of some runs of the algorithm can be
significantly better than others.
46
5.3.1
Chapter 5. Evaluation
Predictive Performance
Predictive perplexity is a measure of how well a predictive model fits the observed data
and was used in (Blei et al., 2003; Girolami and Kaban, 2005) to measure LDA model
performance. It is defined as follows.
"
1
exp −
T
T
#
2
)
∑ log p(at2 | gt , at−1
(5.1)
1
Perplexity can be interpreted as how “perplexed” or “surprised” a person using a model
is upon learning an actual outcome. If the model always assigns probability one to the
actual outcome, the perplexity will have the optimum value of one. Otherwise, the
perplexity will be higher. Consider the rolling of a die with K sides. If a model assigns
uniform probability to each of the K possible outcomes, the predictive perplexity will
be K. For a fair K-sided die, the best possible predictive perplexity will be K. When a
model in any domain has predictive perplexity K (a real number ≥ 1), the interpretation
is that the model is as good at predicting the results as the optimal model for predicting
the outcome of K-sided die rolls. It is also worth looking at percent accuracy for
predictions, which simply measures the number of correct predictions relative to the
total number of predictions made. If the penalty for a wrong prediction is constant, i.e.
does not depend on what the actual results are, then a rational agent’s predictions will
always correspond to the action with the highest posterior probability. The rational
agent does this to minimize the expected penalty. If an opponent performs certain
rare actions with low (but non-zero) probability in certain states, than a rational agent
will never predict these actions. Prediction accuracy alone is not sufficient to measure
the quality of a model. Two models may have the same prediction accuracy, but the
one that is more accurate will be less “perplexed” when the rare actions occur. This
difference in accuracy is quantifiable by examining predictive perplexity.
In the following discussion, LDA models and mixture models were trained off-line,
i.e. before the test games began. After this training, the transition probability matrices
of both models were constant. The predictive performance was compared to Markov
models which were trained on-line with data in the test games. Such Markov models
are referred to as observed models in this chapter. For the observed models, a Dirichlet
prior was used for each row of the transition matrix in conjunction with maximum a
posteriori predictions (MAP). Note that all fixed policy opponents in the game can be
modeled with Markov models. The difference with LDA and mixture models is that
5.3. Opponent Model Evaluation
47
they can represent multiple heterogeneous opponents. For any individual opponent, a
Markov model will be as good as any other models given enough data. For the LDA
and mixture models, three sets of training opponents were used. The models were
then learned from data generated by opponents in the training set. The next paragraphs
describe the different training sets used and the motivation for their use. For LDA
models, k = 10 components was selected, which was arbitrary since the opponents
may not actually fit an LDA model. The number was chosen to be large enough that
opponents could use more than just a couple components to compose their strategies.
For mixture models, the number of components was set to the number of unique training opponents. This works well in the case that opponents are not duplicated in the
training set. The choice of k is discussed in more detail later in this chapter.
The training set Train1 includes the opponents PS(0) and PS(1), which have the most
extreme parameter values allowed. LDA and mixture models have high predictive
performance for opponents in the training set. The LDA model from this set has good
predictive performance for PS opponents with parameters other than 0 and 1, while the
mixture model does not. This is because such opponents have behaviors that are used
by either PS(0) or PS(1), but not both. For example, some action choices of the PS(0.5)
opponent are never made by PS(0), while others are never made by PS(1). Because
all choices of PS(0.5) are sometimes made by either PS(0) or PS(1), the LDA model
performs well. A mixture model does not allow the information to be combined, since
the distribution of the discrete latent variable (for the mixing component) will usually
converge to assign probability one to a single component. The LDA model allows for
a better combination through the continuous latent variables. Perplexity results are
shown in Figure 5.4 and prediction accuracy results are presented in Figure 5.5. For
PS(0) and PS(1), both LDA performance and mixture model performance is better than
observed model performance for over 20 game minutes. For other parameter values,
the mixture model performance degrades rapidly, but LDA performance can be better
than observed performance for between 2 and 10 game minutes. In theory, modelbased RL opponents should benefit by using the LDA model in short 10 minute games
against PS opponents, since they will have a more accurate model earlier in the game.
If the observed model becomes more accurate than the LDA model, they can switch to
using the observed model. In practice, it is not trivial to convert predictive performance
into higher game scores. The reasons for this are examined in the next section.
The second training set, Train2, includes the opponents PS(0), PS(0.25), PS(0.5),
48
Chapter 5. Evaluation
Figure 5.4: Predictive perplexity for modeling PS opponents. The mixture models and
LDA models were trained with two opponents: PS(0) and PS(1). The blue line represents the predictive perplexity of the observed model (the Markov model learned
from scratch). The red line shows mixture model performance. The green line shows
LDA model performance. Results are averaged over 500 trials. The lines show perplexity measured over 30 second intervals, i.e. only the most recent 30 seconds are
considered. For PS(0) and PS(1), the Markov and LDA models perform better than
the observed models for over 20 game minutes. As the parameter becomes closer to
0.5, the Markov model performs very poorly, occasionally outperforming the observed
model for a short time. The performance of the LDA model degrades more gracefully,
with performance better than the observed model for several minutes.
Perplexity
PS(0.0)
PS(0.125)
1.8
1.8
1.6
1.6
1.4
Observed Model 1.6
LDA Model
Mixture Model
1.4
1.2
1.2
1.2
1
0
10
20
1
1.4
0
PS(0.375)
1.5
Perplexity
10
20
1
0
PS(0.50)
1.6
10
20
PS(0.625)
1.8
1.8
1.6
1.6
1.4
1.4
1.2
1.2
1.4
1.3
1.2
1.1
0
10
20
1
0
PS(0.75)
10
20
1
0
PS(0.875)
1.6
10
20
PS(1.0)
1.4
1.5
Perplexity
PS(0.25)
1.8
1.25
1.2
1.3
1.4
1.15
1.2
1.3
1.1
1.1
1.2
1.1
0
10
Game Minutes
20
1
1.05
0
10
Game Minutes
20
1
0
10
Game Minutes
20
5.3. Opponent Model Evaluation
49
Figure 5.5: Percentage prediction accuracy for modeling PS opponents. As in Figure
5.4, The mixture models and LDA models were trained with two opponents: PS(0) and
PS(1). The blue line represents the observed model. The red line shows mixture model
performance. The green line shows LDA model performance. Results are averaged
over 500 trials. Prediction accuracy is measured over the most recent 30 seconds of
game time. For PS(0) and PS(1), the mixture and LDA models perform better than
the observed models for over 20 game minutes. As the parameter becomes closer to
0.5, the mixture model performs very poorly, but the performance of the LDA model
is much better. The observed model performance usually surpasses the LDA model
performance eventually.
PS(0.0)
PS(0.125)
Percent Accuracy
100
98
PS(0.25)
100
98
98
96
94
96
96
92
Observed Model 94
LDA Model
Mixture Model
92
94
92
0
10
20
90
90
88
0
PS(0.375)
10
20
86
0
PS(0.50)
100
10
20
PS(0.625)
100
100
Percent Accuracy
95
95
95
90
90
85
90
80
85
85
75
80
0
10
20
70
0
Percent Accuracy
PS(0.75)
10
20
80
0
PS(0.875)
98
100
96
98
94
96
92
94
90
92
10
20
PS(1.0)
100
98
96
88
0
10
Game Minutes
20
90
94
0
10
Game Minutes
20
92
0
10
Game Minutes
20
50
Chapter 5. Evaluation
Figure 5.6: This figure shows predictive perplexity and prediction accuracy for PS(0.5).
The mixture and LDA models were trained with five opponents: PS(0), PS(0.25),
PS(0.5), PS(0.75), PS(1.0). Performance is only shown for PS(0.5), but the results
are virtually identical for all parameter values show in Figures 5.4 and 5.5. Note that
performance is much better compared to when PS(0) and PS(1) alone were used for
training. The mixture model seems to have the best perplexity by a small margin. This
could be due to the approximation technique employed for the LDA model.
Perplexity for PS(0.5)
Prediction Accuracy for PS(0.5)
1.7
98
Observed Model
LDA Model
Mixture Model
1.6
97
96
Percent Accuracy
Perplexity
1.5
1.4
1.3
95
94
93
92
1.2
91
1.1
1
Observed Model
LDA Model
Mixture Model
90
0
5
10
15
Game Minutes
20
89
0
5
10
15
Game Minutes
20
PS(0.75) and PS(1.0). This set includes a wider range of PS class opponents than
Train1. Both mixture models and LDA models trained from this set have good performance of all PS opponents. Usually PS opponents with other parameter values are
similar enough to one of these for predictions to be good. This is shown in Figure 5.6,
which shows typical prediction performance for PS opponents. Note that predictions of
these models are not very useful when the opponent is a reinforcement learning agent.
In that case, it is usually advantageous to employ the observed model immediately.
The perplexity of the LDA model is typically 2 times greater than the perplexity of the
observed model, while the mixture model perplexity is usually 5 times greater. This
happens when using models trained with Train1 and Train2. This was measured after
the RL agent’s converged to fixed policies. Predictions were only based on data obtained after convergence, so the observed model was performing better than the LDA
model with a very small amount of data.
In order to attempt to make LDA predictions useful for RL opponents, the training
5.4. Reinforcement Learning Evaluation
51
set Train3 was created. It includes all of the opponents in Train2 (PS(0), PS(0.25),
PS(0.5), PS(0.75), PS(1.0)), with the addition of five policies that were learned by
Sarsa(λ) agents in long games against the PS opponents in Train2. After these games
were completed, the policies of the Sarsa(λ) agents were fixed. This was an attempt
to improve modeling performance for RL agents, by extending the training set with
typical counter strategies that may be learned by RL agents. This improved the performance of the LDA models relative to the observed models, but the LDA model
perplexity was still 1.5 times higher than the observed model perplexity on average. It
seems that the LDA models are not too useful for modeling RL opponents. Perhaps
this would have worked better if the training set had included a significantly larger
number of RL opponents.
In general, the LDA models work very well when the a new opponent actually fits the
model, i.e. the transition matrix of their Markov model can be written as a convex
combination of the transition matrices of the LDA components. In this case, the predictions of the LDA model are nearly as good as the predictions of the actual Markov
model. A small decrease in performance may be due to the approximation method
used for LDA. It could also be due to the amount of data needed to estimate the latent
variables. For the Markov models, there are no latent variables so predictions do not
get better over time as they do with the LDA model. Figures 5.7, 5.8 and 5.9 illustrate
these points through the use of opponent models randomly generated from an arbitrary
5-component LDA distribution. Predictive performance on data generated by these
opponent models is shown for various types of models. Additionally, figure 5.10 investigates the entropy (Shannon, 1948) of the transition matrices learned by the LDA
and mixture model learning algorithms.
5.4
Reinforcement Learning Evaluation
This section examines the in-game performance of model-free and model-based reinforcement learning agents in terms of final scores. All results in this section are
for games lasting 10 game-minutes. Note that games lasting 10 game-minutes can be
simulated at up to 700 times faster than real-time (on a 2.4 GHz machine), i.e. such a
game could be simulated in less than a second. We say that there is a speed-up of 700 in
this case. For model-free agents, this is typically possible. However, for model-based
agents, the speed-up can be much lower depending on how much planning is done.
52
Chapter 5. Evaluation
Figure 5.7: Average predictive perplexity for 15 training opponents from a 5-component
LDA distribution (the “actual” LDA distribution). The LDA component transition matrices
were generated randomly, then 15 Markov opponent models (the “actual” training models) were created such that their transition matrices were random convex combinations
of the components. The straight dark blue line shows the average predictive perplexity
for the true Markov models, which does not depend on k (the x-axis). The straight green
line shows the perplexity for the true LDA model (which consists of the component matrices). Having this model is nearly as good as having the actual 15 Markov models.
The red line show the perplexity of a k-component LDA model trained on data generated by the 15 test opponents. With more data, the performance of the learned LDA
model could approach the actual LDA model. The light blue line shows the perplexity of
a k-component mixture model on the training data. The learned mixture models are not
useful, as expected, because the data is generated from an LDA distribution and not a
mixture distribution. Note that performance would be much better if opponents were actually generated from a mixture model, by choosing a single mixture model component
rather than combining the models. In this case, mixture model and LDA performance
would be very similar.
3.3
3.25
Predictive Perplexity
3.2
Actual Models
Actual LDA Model with k=5 components
Learned LDA Model with k components
Learned Mixture Model with k components
3.15
3.1
3.05
3
2.95
3
4
5
6
7
8
9
10
11
12
Number of Components (k)
13
14
15
5.4. Reinforcement Learning Evaluation
53
Figure 5.8: Predictive perplexity for LDA test opponents. Similar to Figure 5.7, but the
test opponents are not from the training set, i.e. their transition matrices are novel convex combinations of the component matrices. Having the true 5-component LDA model
is nearly as good as having the individual transition matrices of the new opponents. For
the learned LDA model, the performance on the new opponents is slightly worse than
the old opponents. This shows the learned LDA distribution works well for opponents
that fit the model but were not in the training set. Predictive perplexity of the mixture
model is worse than in Figure 5.7.
3.6
Predictive Perplexity
3.5
3.4
Actual Models
Actual LDA Model with 5 components
Learned LDA Model with k components
Learned Mixture Model with k components
3.3
3.2
3.1
3
3
4
5
6
7
8
9
10
11
12
Number of Components (k)
13
14
15
54
Chapter 5. Evaluation
Figure 5.9: This figure shows predictive perplexity when new opponents are generated
independently of the LDA distribution. In this case, neither the LDA nor the mixture
models are useful for prediction.
14
Actual Models
Learned LDA Model with k components
Learned Mixture Model with k components
Predictive Perplexity
12
10
8
6
4
2
0
3
4
5
6
7
8
9
10
11
12
13
14
15
Number of Components (k)
Recall that planning refers to updating value function estimates based on the models.
Simply calculating the LDA gamma parameters (the mixing coefficients) at each timestep reduces the speed-up by a factor of 10. As the amount of planning increases, the
speed-up can reduce to between 1 and 10. Therefore significantly more computation
is required to simulate these games. If the agents were playing games in real-time, this
would not be an issue since planning can be reduced until the agents make decisions
fast enough. The speed-up factor was an issue in this project; it restricted the number
of simulated games which can be run for each experiment. Because the game has a
high level of stochasticity, experimental results should ideally be averaged over 100 or
more trials. Due to computational constraints, some games could only be run 10 times.
5.4.1
Model-Free Reinforcement Learning Agents
The first experiment involves the model-free Sarsa(λ) agents. In long training games
(1000 game hours), policies were learned in “training” games against five PS class
opponent (parameter values 0.0, 0.25, 0.5, 0.75, and 1.0). Reasonable values for the
Sarsa(λ) parameters found in the past project (Collins and Rovatsos, 2006) were used
in these games. The values were λ = 0.7, learning rate α = 0.003, discount factor
5.4. Reinforcement Learning Evaluation
55
Figure 5.10: This figure shows the entropy of the rows of the transition matrices of the
models from Figure 5.8. The entropy in binary digits (bits) of a probability distribution
p(x) is H(x) = − ∑x p(x) log2 p(x) (Shannon, 1948). It is the theoretical minimum number of bits needed to encode the result, on average, of sampling from the distribution.
Entropy is zero if a single value has probability one. It is maximal when all possible values have uniform probability. Each row of the a transition matrix represents a probability
distribution over actions for a given situation. The average entropy is calculated over the
rows of a transition matrix. Then, the average results over all transition matrices in a set
are presented. Results are shown ± one standard deviation. Low entropy LDA components (green) are generated by sampling their rows from a Dirichlet prior with {αk } < 1.
Convex combinations of these components then result in higher entropy “actual” models
(dark blue). Note that combining low entropy models naturally results in higher entropy
models. Data is then generated from these actual models and mixture and LDA models
are learned for different numbers of components k. As shown below, the component
transition matrices of the learned mixture models have high entropy. The matrices of
the learned LDA model have lower entropy, which is closer to the entropy of the original
LDA components. The entropy of the learned LDA model’s matrices is lower than the
entropy of the actual models. Thus, by learning the LDA model, simpler (low entropy)
components have been learned from the higher entropy individual actual models.
Actual Models
Actual LDA Models with 5 Components
Learned LDA Model with k components
Learned Mixture Model with k components
3
Entropy (bits) for each
Transition Matrix Row
2.5
2
1.5
1
0.5
0
3
4
5
6
7
8
9
10
11
12
Number of Components (k)
13
14
15
56
Chapter 5. Evaluation
γ = 0.95. Softmax exploration was used with exploration temperature τ decreasing linearly from 1.0 to 0.0 over the course of the game. While these policies were learned,
opponent models (Markov models) were also learned for each of the PS opponents.
Given an infinite amount of computation time, the policy learned against an agent will
perform at least as well against that agent compared to a policy learned against a different agent. In practice, this did occur for all opponents. Sometimes policies learned
against other agents worked slightly better. This will be discussed further later in this
section. Three methods have been tested for deciding which policy to follow in a game
against a new opponent. Usually the identity of the new opponent is not known and
must be inferred from data. The greedy approach is to employ the policy that corresponds to the most likely opponent model. Note that each policy produces a probability distribution over actions for each situation. The voting approach is to weight
these probabilities by the likelihood of each opponent model. The third approach uses
a priori knowledge of the opponent’s identity and employs the corresponding policy.
It is not realistic to have this knowledge, but the method is included as a baseline performance measure. The approach is labeled single in the figures because it only makes
use of a single policy.
The results of these methods, averaged over 100 games, are shown in Figure 5.11 for
PS opponents with parameter values 0.0, 0.25, 0.5, 0.75, and 1.0. I did not find a significant difference in performance between the methods. In these games, which lasted
10 game-minutes each, opponents were usually identified after 30 - 300 game-seconds.
The opponents with extreme parameter values were identified quickly. Around 5 game
minutes were needed to discriminate PS(0.5) and PS(0.75), which have similar behavior. It can be seen that the highest performance is achieved by an agent as the
defensiveness parameter of their PS opponent increases. Figure 5.11(b) shows that the
agents usually score similar numbers of goals against all PS opponents, but the more
defensive PS opponents (with higher parameters) score fewer goals since they are busy
defending their goals. The more defensive opponents are obviously not that good at
preventing their opponent from scoring goals, since the final score of the RL opponent
is largely independent of the defensiveness of the PS opponent. This seems to be due
to the low-level controllers not having high enough quality. I attempted to fix this by
adding new controllers, but the new ones were so good that goals were very rarely
scored. Due to this, I focused on the original set of controllers since I knew that learning a strategy with them was possible. Figure 5.12 shows the results of games when
5.4. Reinforcement Learning Evaluation
57
policies learned in long games against specific opponents are used. These are the same
“training” games described earlier in this paragraph. Policies learned against PS(0.25),
PS(0.5), and PS(0.75) essentially work equally well against all of the other opponents.
Using a different policy from the set doesn’t produce significantly different results. In
contrast, the policies learned against PS(0) and PS(1) perform poorly as against PS
opponents with other parameter values. The reason seems to be that in both cases,
certain situations never occur in training because the training opponents never make
certain choices. Poor performance then results when these situations are encountered
in testing, since the agent did not learn about them in training and the new games are
too short to learn how to deal with this.
5.4.2
Model-Based Reinforcement Learning Agents
Two types of model-based reinforcement learning agents (Chapter 4) are evaluated in
this section – Value Iteration (VI) agents and Dyna-Q agents. All the games in this section are against PS opponents. Both of these agents used the same environment model,
which was learned in a 20000 hour game that took 2 days to compute. The learned
environment model covers 1294 states, but only 630 ever occurred in practice. Agents
in this game followed a range of strategies to cover all behaviors. For this game, it is
necessary to use an opponent model in conjunction with the environment model. An
expected opponent decision must be used as input to the environment map, because of
the way it was designed (described in Chapter 4). Three different types of opponent
models are used throughout the remainder of this chapter, observed models which use
the current game history only, mixture models trained with training set Train1, and
LDA models trained with the same training set. Recall that training set Train1 only includes opponents PS(0) and PS(1). This was chosen because a predictive performance
difference can be seen between the LDA and mixture models for these PS opponents
(Figure 5.4). For the other training sets, any difference in performance is minimal. I
was expecting that the model-based RL agents using the LDA model would have the
highest performance in the 10 minute test games, since they typically have the best
predictive performance. In practice, I did not find any significant difference between
the use of the three different opponents models. I also tested gamma values between
0.9 and 0.999 for both types of model-based agents and did not find a significant difference. The VI agent generally performed better than the Dyna-Q agent and required
less computation. For the Dyna-Q agent, there were issues in setting the learning rate,
58
Chapter 5. Evaluation
Figure 5.11: Game results when using policies learned with Sarsa(λ) in games against
PS opponents. Results are shown ± one standard deviation. The policies were learned
in long previous games against various other PS opponents. The strategies for combining these policies are shown: single (use the policy learned for the specific opponent),
greedy (use the policy corresponding to the most likely opponent model, and voting
(weight action probabilities by opponent model likelihood). All three strategies have virtually the same game performance. Part (a) shows the average difference in final score
in favor of the RL agent. Note that the overall performance increases for increasing
parameter values. Part (b) of the figure shows the average final score for each player.
Solid lines represent the RL player’s score and dashed lines represent the opponent’s
score. It seems that the final scores for the RL players are usually similar, and the PS
opponents score fewer goals as the parameter value increases (due to their increased
defensiveness).
(a)
(b)
80
Greedy
Voting
Single
70
60
50
40
30
60
50
40
30
20
20
10
70
Final Score for Each Player
Score Difference in Favor of the Learning Agent
80
10
0
0.25
0.5
0.75
1
PS Defensiveness Parameter
0
0
0.25
0.5
0.75
1
PS Defensiveness Parameter
5.4. Reinforcement Learning Evaluation
59
Figure 5.12: Game results when using policies learned with Sarsa(λ) in games against
PS opponents. Results are shown ± one standard deviation. The performance of
the greedy strategy from Figure 5.11 is shown, along with the results of using policies
learned in games against specific “training” opponents. The parameter values of the
training opponents are shown in the legend. For policies learned against training opponent PS(0), performance decreases rapidly as the parameter of the test opponent
(x-axis) changes. The same holds for policies learned against training opponent PS(1).
The performance of all other policies is similar across the range of test opponents. Performance is usually higher as the defensiveness parameter increases and the opponent
scores fewer goals. The policy learned against training player PS(0.5) seems to be better than using the greedy policy. At any given time, the greedy policy uses one of the
other policies shown in this figure by selecting the one with the most likely opponent
model. The low performance of the policies learned against PS(0) and PS(1) is most
likely due to certain situations never occurring in training, i.e. the state space wasn’t
Score Difference in Favor of the Learning Agent
exhausted in training because the opponents were particularly simple.
80
60
0
0.25
0.5
0.75
1.0
Greedy
40
20
0
−20
−40
0
0.25
0.5
0.75
PS Defensiveness Parameter
1
60
Chapter 5. Evaluation
which is not a parameter for the VI agent. Note that exploration is not used by any
of the model-based agents; at each time-step they simply select the action which is
estimated to be optimal. I did not expect exploration to be very useful in games of
such short length. There would not be enough time to learn anything significant and
the agents already had reasonable opponent and environment models. Note that the
quality of the opponent model prediction typically improves over time, but explicit exploration is not used to facilitate this. In some cases, it may be useful to test how an
opponent responds in certain situations. I found that model-free Sarsa(λ) agents (using
previously learned policies) outperformed the model-based RL agents, which did not
make use of previously learned policies or value functions. The model-based agents
only made use of previously learned environment and (mixture and LDA) opponent
models.
I believe that the quality of the environment model was the main factor which prevented
the agents from making full use of their accurate predictive models. The environment
model was not very reliable at predicting the next game state, with predictive perplexity
around 2.5, due to the fact that the discrete state space was coarse-grained. Given much
more detailed information about the underlying game states, predictions could be much
more accurate. Much of the stochasticity of the game is due the agents not having full
information about the underlying continuous game state. I thought that it may not
matter that the environment models were course grained, since the agents use the same
state space internally. It seems that it would be possible for an agent using a coarsegrained model to perform as well as an agent using model-free RL with the same
coarse-grained state space. It is possible that the model based RL agents would have
performed better if the environment model had been trained for significantly longer.
I noticed that the model-based RL agent performance was reduced significantly when
using an environment model that was trained for a shorter length of time (5000 game
hours compared to 20000). This suggests that, ideally, the environment model should
have been trained for a significantly longer amount of time.
For the VI agent, a parameter was the number n of game-minutes between runs of
the VI algorithm. The algorithm is run to convergence each time using the current
predictive opponent model (given the observed data). At the beginning of a game, the
VI algorithm is run until convergence. For n = 1, the algorithm is run ten times total
for a ten minute game. Note that it is also possible to run single iterations (sweeps)
without waiting for convergence. For this game, results did not change significantly
5.4. Reinforcement Learning Evaluation
61
after the initial convergence at the beginning of the game. In theory, this would be
helpful as the predictions of the opponent model change. The performance of the VI
agent in games against PS opponents is shown in Figure 5.13. The figure shows that the
performance appears to be independent of the opponent model that is used. The value
n is averaged over. The figure also shows that the Sarsa(λ) agents, that learned their
policies in long training games, outperformed the VI agents slightly. As previously
discussed, I suspect that the main reason for this is that the environment model was of
insufficient quality. Note that the Sarsa(λ) agents are completely incapable of winning
a game lasting ten game-minutes, regardless of their parameter settings. The VI agents,
who do not make use of previous policies or value function estimates, are always able
to win these games against PS opponents, regardless of their parameter settings. Figure
5.14 shows what happens as the parameter n decreases and the VI algorithm is run to
convergence more often. This should be helpful as the predictive opponent model is
changing and becoming more accurate. Surprisingly, results were not significantly
different when this was done. It seems that the VI agents essentially learn the same
policy regardless of the opponent model. The strategy they learn seems to be mostly an
aggressive policy, but it is also defensive at times. This could be a close approximation
of a single policy that is optimal in all situations in the simplified game (with the
discrete state and action space). Unfortunately, it is difficult to know whether or not
this is the case because of the complexity of the game. The initial intuition was that
although there may be an equilibrium policy that could never be defeated on average,
it would be possible to do better against specific opponents. This should be possible
when the opponents do not follow optimal policies. The game was chosen because
it was complex enough that the opponent modeling techniques would be useful for
stationary opponents, but the consequence is that learning in the game is difficult.
Overall, the value iteration method has a lot of potential, but making effective use
of an accurate opponent model in a real game is non-trivial. Converting prediction
performance into actual in-game performance is not a simple process in a complex
game.
Overall, the Dyna-Q agent did not perform as well as the VI agent. Figure 5.15 shows
Dyna-Q performance in games against PS opponents, compared to the performance of
VI an Sarsa(λ) agents. As noted for the VI agent, the choice of opponent model does
not significantly effect the results. This method was computationally very expensive
for n = 5000 planning steps, which was the number required for the Dyna-Q agent to
62
Chapter 5. Evaluation
Figure 5.13: This figure shows the performance of the value iteration (VI) agent against
PS opponents in terms of final score difference. Results are shown ± one standard
deviation. The different lines correspond to the type of opponent model used. Note that
an opponent model must be used in conjunction with the game (environment) model.
Results are for games of 10 game-minutes. They are averaged over 60 trials, where
the number of game-minutes between VI algorithm runs was averaged over. No significant difference was found when the different types of opponent models were used.
The performance of the greedy Sarsa(λ) opponent discussed previously is included for
comparison. Greedy Sarsa(λ) performs better, but that opponent had experience in
long training games and is completely incapable of winning a 10 minute game with no
Score Difference in Favor of the Learning Agent
initial knowledge. The VI opponent has no initial value estimates, only initial models.
70
60
50
40
30
20
10
LDA
Mixture
Observed
Sarsa Greedy
0
−10
0
0.25
0.5
0.75
PS Defensiveness Parameter
1
5.4. Reinforcement Learning Evaluation
63
Figure 5.14: This figure shows the performance of the value iteration (VI) agent against
PS opponents in terms of final score difference. Results are shown ± one standard
deviation. The different lines correspond to the number of game-minutes n between VI
algorithm runs. When n is 10, the algorithm is run once in a ten minute game. For n = 3
it is run 4 times and for n = 1 it is run 10 times. Other parameters are averaged over.
There is no significant difference in performance. The agent seems to essentially learn
Score Difference in Favor of the Learning Agent
the same policy regardless of the value of n.
70
60
10
3
1
50
40
30
20
10
0
0
0.25
0.5
0.75
PS Defensiveness Parameter
1
64
Chapter 5. Evaluation
defeat PS(0). This limited the number of experiments that could be run, so the relative
performance of using each opponent model is not significantly different. Figure 5.16
shows that the performance of the Dyna-Q agent increases as the number of planning
steps is increased. For the Dyna-Q agent, the learning rate α must be set. This is
not necessary for VI, where updates are obtained directly from the Bellman optimality
equations as described in Chapter 4. The solution adopted when updating an estimated
Q-value was to set alpha equal to the inverse of the number of times the Q-value had
been updated. The result is that the Q-value estimates are averaged. This allows an
agent to make full use of their observations immediately. This may not be desired as
the predictions of the opponent model change, because updates made further into the
future have smaller weight. A strategy for dealing with this is to set α 1 to a small
constant, but this led to reduced performance in short games. The VI algorithm avoids
this issue entirely, but has the problem that completing a single full sweep of the state
space may take a prohibitively long time.
Self-play between RL agents has not been evaluated due to technical issues. In the
previous project, Sarsa(λ) agents were able to learn reasonable policies with no initial
knowledge in games where both agents use the same algorithm. I expect that the
model-based RL agents will perform reasonably well against each other, since they
compute viable policies almost immediately, which are capable of scoring goals. The
other player will have a hard time preventing this, because of the limited selection
of effective defensive actions. Therefore, I expect the results to be fairly balanced.
Because the policies of the model-based RL agents don’t seem to depend on their
opponent for this game, I don’t expect anything particularly different to happen in selfplay. Self-play would be interesting to look at for model-based RL agents in other
domains, where the opponent model had an observable effect on a player’s behavior.
5.4. Reinforcement Learning Evaluation
65
Figure 5.15: This figure shows the performance of the Dyna-Q agent in games against
PS opponents, compared to the VI and Greedy Sarsa(λ) agents. Results are shown ±
one standard deviation. The greedy Sarsa(λ) agent performs best as previously noted.
The Dyna-Q agent does not perform better than the VI agent. The Dyna-Q agent used
n = 5000 planning steps, which took a very long time to run and results for DynaQ are only averaged over 10 games. Results are shown for the different opponent
models used by Dyna-Q. These results are not significantly different because of the
Score Difference in Favor of the Learning Agent
small number of trials.
70
LDA
Mixture
Observed
Value Iteration
Sarsa Greedy
60
50
40
30
20
10
0
−10
0
0.25
0.5
0.75
PS Defensiveness Parameter
1
66
Chapter 5. Evaluation
Figure 5.16: This figure shows the performance of the Dyna-Q agent in games against
PS opponents. Results are shown ± one standard deviation. Each line represents
results for a given number of planning steps n. Note that performance increases as n
increases, but does not exceed the performance of VI as shown in Figure 5.15. Results
are average over 30 games, which use the three types of opponent models. Note that
Score Difference in Favor of the Learning Agent
the amount of computation time increases linearly as n is increased.
60
40
20
0
0
50
100
500
1000
5000
−20
−40
−60
0
0.25
0.5
0.75
PS Defensiveness Parameter
1
Chapter 6
Conclusion
This dissertation examined the potential for combining opponent modeling and modelbased RL in a complex two-player competitive domain. For some groups of stationary
opponents, I showed that LDA on Markov chains can produce good predictive models for novel opponents when trained on data from other opponents. I also showed
that when an LDA model is used as a generative model to create an opponent population, a good approximation for the underlying model can be recovered by variational
methods. Furthermore, the learned model is useful as a predictive model for novel opponents generated from the original LDA model. It would be interesting to investigate
when these types of models arise naturally. The intuition was that a population of opponents would base their strategies on simpler elements and combine these elements
in different proportions. In this case, the simpler elements are global Markov models,
which define an agent’s action distribution for all situations. It would be interesting to
investigate whether local models could be combined instead. This would eliminate the
need to define default behavior for all situations and allow strategy elements to be specialized to specific situations. A possible method for accomplishing this is to partition
the space of situations and learn separate LDA models for each partition. This would
not be ideal, since the partitioning would need to be known in advance. It would also
be interesting to look at opponents with hidden internal states. In this case, Hidden
Markov Models (HMMs) (Rabiner, 1989) would replace the Markov chains used in
this report. Additional latent parameters represent the hidden state of the opponent.
The model-based RL agents using Dyna-Q and value iteration (VI) algorithms were
able to learn to play the game in a short amount of time using environment and oppo67
68
Chapter 6. Conclusion
nent models. They were able to do this based purely on models, without any actual
in-game experience. In contrast, the model-free Sarsa(λ) agent performed better overall, but only after large amounts of in-game experience. It is not really natural to play
games against new opponents for such extended lengths of time. The model-based RL
agents have the potential to adapt to new opponents in a short time, but this potential
was not realized in the specific domain used for this project. The learned policies were
largely independent of the opponent model. It is possible that modeling the opponent
is not important in this specific domain. It is also possible that the quality of the environment model was not high enough to fully make use of good predictive opponent
models. For these reasons, it is worth investigating the same methods in a simpler
domain where the results would be easier to interpret.
I would highly recommend further research on the techniques used in the project,
which have great potential for success. The specific game used in this project is not
an ideal candidate for further study on this topic, since it is complex and not wellunderstood. The game was simplified for the project by reducing the continuous state
and action spaces to discrete spaces. Unfortunately, it is not clear whether a player
should modify their policy based on the opponent’s behavior or employ a policy which
is independent of the opponent. It should be possible to engineer a situation where the
influence of the opponent’s behavior on the results can be strictly measured and controlled. It would be advantageous to study these techniques in a simpler environment,
where a perfect environment model is known a priori. This would eliminate the link
between the results and the quality of the environment model, thereby making the results easier to interpret and providing more insight into the underlying issues. It would
also be helpful to look at domains which have been studied extensively and are better
understood. There is much interesting work to be done in both opponent modeling
and model-based reinforcement learning. I strongly believe that, eventually, these two
classes of methods will be combined to produce agents that can quickly adapt to novel
opponents in multi-player game situations.
Bibliography
Abbeel, P., Quigley, M., and Ng, A. Y. (2006). Using inaccurate models in reinforcement learning. In ICML ’06: Proceedings of the 23rd international conference on
Machine learning, pages 1–8, New York, NY, USA. ACM Press.
Bellman, R. (1957). Dynamic Programming. Princeton University Press.
Billings, D., Davidson, A., Schauenberg, T., Burch, N., Bowling, M., Holte, R. C.,
Schaeffer, J., and Szafron, D. (2004). Game-tree search with adaptation in stochastic
imperfect-information games. In van den Herik, H. J., Björnsson, Y., and Netanyahu,
N. S., editors, Computers and Games, volume 3846 of Lecture Notes in Computer
Science, pages 21–34. Springer.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Blei, D., Ng, A., and Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022.
Cadez, I., Heckerman, D., Meek, C., Smyth, P., and White, S. (2000). Model-based
clustering and visualization of navigation patterns on a web site. Technical Report
MSR-TR-00-18, Microsoft Research.
Carmel, D. and Markovich, S. (1996). Learning models of intelligent agents. In
Shrobe, H. and Senator, T., editors, Proceedings of the Thirteenth National Conference on Artificial Intelligence and the Eighth Innovative Applications of Artificial
Intelligence Conference, Vol. 2, pages 62–67, Menlo Park, California. AAAI Press.
Collins, B. and Rovatsos, M. (2006). Using hierarchical machine learning to improve
player satisfaction in a soccer videogame. In Proceedings of the SAB 2006 Workshop on Adaptive Approaches for Optimizing Player Satisfaction in Computer and
Physical Games, Rome.
69
70
Bibliography
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical Society,
Series B, 39(1):1–38.
Girolami, M. and Kaban, A. (2005). Sequential activity profiling: Latent Dirichlet
allocation of markov chains. Data Mining and Knowledge Discovery, 10:175–196.
Kalyanakrishnan, S., Stone, P., and Liu, Y. (2008). Model-based reinforcement learning in a complex domain. In Visser, U., Ribeiro, F., Ohashi, T., and Dellaert, F.,
editors, RoboCup-2007: Robot Soccer World Cup XI. Springer Verlag, Berlin. To
appear.
Marı́n, C. A., Peña Castillo, L., and Garrido, L. (2005). Dynamic adaptive opponent
modeling: Predicting opponent motion while playing soccer. In Alonso, E. and
Guessoum, Z., editors, Fifth European Workshop on Adaptive Agents and Multiagent Systems, Paris, France.
Ng, A. Y., Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., Berger, E., and
Liang, E. (2004). Inverted autonomous helicopter flight via reinforcement learning.
In International Symposium on Experimental Robotics.
Puterman, M. L. and Shin, M. C. (1978). Modified policy iteration algorithms for
discounted markov decision problems. Management Science, 24(11):1127–1137.
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications
in speech recognition. Proceedings of the IEEE, 77(2).
Rovatsos, M., Wei, G., and Wolf, M. (2003). Multiagent learning for open systems: A
study in opponent classification. Adaptive Agents and Multi-Agent Systems, LNAI
2636.
Rovatsos, M. and Wolf, M. (2002). Towards social complexity reduction in multiagent
learning: the adhoc approach. Technical Report SS-02-02, AAAI Press, Stanford,
CA.
Shannon, C. (1948). A mathematical theory of communication. Bell System Technical
Journal, 27:379–423, 623–656.
Shoham, Y., Powers, R., and Grenager, T. (2003). Multi-agent reinforcement learning:
a critical survey. Technical report, Stanford University.
Southey, F., Bowling, M., Larson, B., Piccione, C., Burch, N., Billings, D., and Rayner,
Bibliography
71
D. C. (2005). Bayes’ bluff: Opponent modelling in poker. In UAI, pages 550–558.
AUAI Press.
Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT
Press.
Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2004). Hierarchical Dirichlet
processes. Technical Report 653, Department of Statistics, University of California
at Berkeley.
Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, Cambridge
University, Cambridge, England.
Wilson, A., Fern, A., Ray, S., and Tadepalli, P. (2007). Multi-task reinforcement learning: a hierarchical bayesian approach. In ICML ’07: Proceedings of the 24th international conference on Machine learning, pages 1015–1022, New York, NY, USA.
ACM Press.