Application of reinforcement learning to the simple soccer game

Application of RL to Simple Soccer
THE APPLICATION OF REINFORCEMENT LEARNING TO THE SIMPLE SOCCER
GAME
Title: The Application of Reinforcement Learning to the Simple Soccer Game
For Qualification of: Master of Science
Author: Eamon Costello
Date: August 2007
Supervisor: Dr. Michael Madden
Department: Department of Information Technology
Head of Department: Professor Gerard Lyons
Institution: National University of Ireland, Galway
1
Application of RL to Simple Soccer
Abstract
This thesis looks at Reinforcement Learning (RL) and its application to computer-simulated
soccer games. RL itself is explained and its main points detailed. A summary of research into
computer simulated soccer is then outlined. The Simple Soccer game is introduced and an RL
technique (tabular Q-learning) is applied to it. Aspects of the research literature are reviewed
in the context of Simple Soccer, such as the issues involved in designing and constructing RL
experiments using games. Finally the results of the application are presented and an analysis
of the findings is made
2
Application of RL to Simple Soccer
Contents
Abstract ......................................................................................................................................2
1
Reinforcement Learning (RL)............................................................................................5
1.1
RL Introductory Literature.........................................................................................5
1.2
The RL Problem.........................................................................................................5
1.2.1
Definition ...........................................................................................................5
1.2.2
Rewards..............................................................................................................6
1.2.3
Reward Discounting...........................................................................................6
1.2.4
Balancing Exploitation with Exploration...........................................................7
1.2.5
Dynamic Programming (DP) .............................................................................8
1.2.6
Temporal Difference (TD) Learning .................................................................9
1.3
Advances in RL and Current Research....................................................................10
1.3.1
Function Approximation..................................................................................10
1.3.2
Hierarchical RL................................................................................................11
1.3.3
Least Squares ...................................................................................................11
1.3.4
Relational Reinforcement Learning (RRL) .....................................................11
2 Application of RL to Games ............................................................................................13
2.1
Types of Games .......................................................................................................13
2.2
TD-Backgammon.....................................................................................................13
2.3
Soccer and RL..........................................................................................................14
2.3.1
Approaches to Simulated Soccer .....................................................................15
2.3.2
Keepaway.........................................................................................................15
2.3.3
Half Field Offense............................................................................................16
2.3.4
Progress in Keepaway......................................................................................17
2.4
Summary ..................................................................................................................17
3
Implementation and Verification of C++ Learner ...........................................................18
3.1
Design of the Implemented Learner ........................................................................18
3.2
Testing Against CliffWorld .....................................................................................19
3.3
GridWorld ................................................................................................................21
3.3.1
Overview..........................................................................................................21
3.3.2
Comparison Tests.............................................................................................22
3.3.3
Findings and Implementation Revision ...........................................................26
4 Application of RL to Simple Soccer................................................................................28
4.1
Simple Soccer ..........................................................................................................28
4.1.1
Overview..........................................................................................................28
4.1.2
States ................................................................................................................29
4.1.3
Actions .............................................................................................................31
4.2
Application of RL to Simple Soccer........................................................................33
4.2.1
Differences with GridWorlds...........................................................................33
4.2.1.1 Adversarial...................................................................................................33
4.2.1.2 Complex Object Environment .....................................................................33
4.2.1.3 Real-Time Processing ..................................................................................34
4.2.2
Benchmarking Results .....................................................................................34
4.2.3
Simple State/Action Structure for Novel Policy Discovery ............................35
4.2.3.1 States ............................................................................................................35
4.2.3.2 Actions .........................................................................................................36
4.2.3.3 Rewards........................................................................................................36
4.2.4
Architecture......................................................................................................37
4.2.5
Results..............................................................................................................38
4.2.5.1 Performance .................................................................................................38
3
Application of RL to Simple Soccer
4.2.5.2 State/Action Representation Findings..........................................................40
4.2.6
Testing with Reduced Action Space ................................................................43
5
Conclusions......................................................................................................................45
5.1
Evaluation of Simple Soccer....................................................................................45
5.2
Application of Tabular Q-Learning to Games .........................................................46
References ................................................................................................................................47
Appendix A. Simple Soccer Class Diagrams ..........................................................................53
SoccerTeam class from Buckland (2005) ............................................................................53
PlayerBase Class from Buckland (2005) .............................................................................54
StateMachine Classes from Buckland (2005)......................................................................55
Appendix B. Functions of the Learner Specific to Simple Soccer ..........................................56
Functions added to Simple Soccer Code to implement RL .................................................56
Appendix C. Original Project Plan ..........................................................................................61
Thesis Statement ..................................................................................................................61
Need for the Project .............................................................................................................61
Research methodology.........................................................................................................61
Project Completion ..............................................................................................................62
Project Plan ..........................................................................................................................63
Project Phases ..................................................................................................................63
Gantt Chart.......................................................................................................................65
4
Application of RL to Simple Soccer
1 Reinforcement Learning (RL)
Reinforcement Learning (RL) refers most generally to computational models of learning
from interaction. Kaebling et al (1996) defines the RL problem as that “faced by an agent that
must learn behaviour through trial-and-error interactions with a dynamic environment”. The
concept of trial and error learning has intuitive appeal and has analogues in psychology and
behavioural sciences. However it is the field of computer science which has seen much
research into RL, as an idealized form of learning, which can be studied with mathematical
models or computational experiments.
1.1
RL Introductory Literature
Sutton and Barto (1998) give one of the most comprehensive reviews of RL, while
Kaebling, Littman and Moore (1996); and Russel and Norvig (2003); provide good
overviews. Sutton and Barto’s approach is meticulous in grounding RL within older areas of
research to which it is indebted. In general Sutton and Barto’s book is accessible and detailed.
1.2
The RL Problem
In an RL system constituting an agent and its environment, there are four main
elements: a policy; a reward function; a value function; and, in some cases, a model of the
environment. As the agent interacts with its environment it receives a state signal s. The agent
then takes an action a, which affects a state transition to a successive state s'. This transition is
non-deterministic which means here that the same successive state might not be obtained
even if the same action is taken in a given preceding state. The agent receives a reward value
r for this transition. The agent’s aim, or goal, is to receive, a maximum sum of these rewards
over its lifetime. As we shall see, this summing of rewards can be formulated in different
ways.
1.2.1
Definition
In formal terms the problem can be stated as one that happens over a series of discrete
stages: t = 1,2,3... . At each t the agent receives a representation of its state s such that s t ∈ S
where S is all possible states. The agent selects an action a ∈ A( st ) where A( st ) is the set of
5
Application of RL to Simple Soccer
actions available at st . The agent receives a reward rr +1 ∈ ℜ , at all states after the start state,
and finds itself in a new state st +1 .At each t the agent has a list of probabilities of choosing an
action a given a particular state s. This is known as the agent’s policy π where π t (s, a) is the
probability that at = a given that st = s .
1.2.2
Rewards
We have already mentioned that there is more that one way to think about how an
agent may maximize its returns. Kaebling et al (1996, p. 240) refer to this maximizing of
rewards over time as ‘models of optimal behaviour’. Approaches are categorized according to
whether there is a defined end-state S + or whether the task continues indefinitely. If the task
continues indefinitely, or the end state is not known, then the rewards’ sum must be
structured in such a way that future rewards are less desirable than immediate ones. Kaebling
et al (1996) refers to this scenario of discounted reward as an infinite horizon model, and its
terminal state counterpart as a finite horizon model. In Sutton & Barto’s more widely adopted
terminology, these are know as problems characterized by either continuing, or episodic tasks
respectively (Sutton & Barto, 3.4, 1998).
1.2.3
Reward Discounting
There are two main types of reward discounting. The reward may be diminished over
time according to a discount factor γ (where 0 ≤ γ > 1). If we have a total reward of R and t
refers to a time-step:
∞
(1)
Rt = ∑γ k rt +k +1
k =0
Or in the other discounting method the limiting case as γ approaches 1, at a given timestep h,
gives the formulation (Bertsekas, 1995):
Rh =
1 h
∑rt
h t=0
(2)
This method has the advantage of not needing a discount factor. However it is not easy to
distinguish policies in which more rewards are gained in initial phases (which is often more
6
Application of RL to Simple Soccer
desirable) due to this averaging. For this reason the discount factor γ model has gained most
attention (Kaebling et al, 1996, p. 241).
1.2.4
Balancing Exploitation with Exploration
Sutton & Barto (1998) contrast supervised learning with RL by characterizing the
latter as a learning form that evaluates its feedback. The RL agent is not told which action
would have been best in the long-term after it takes an action. Rather it receives only an
immediate reward and a subsequent state. For this reason the agent must choose its actions
with care and attempt to exploit strategies that have proved successful but also to explore new
ones periodically. This problem has been extensively studied in Mathematics and Probability
in a form that uses only a single state, known as n-armed-bandit problem. In practical terms
this is not useful to RL, which contains multiple states and associated policy mappings, but
researchers have adapted aspects of these theories to the full RL problem.
When an agent chooses to maximize its reward it is said to choose greedily. An
alternative is to behave greedily most of the time but every so often, with small probability ,
to select an action at random. This method is known as
-greedy. It promotes exploration of
parts of the state space not currently known to be useful, but which may be. Its downside is
that even though the agent may learn a good policy it does not always follow this policy as
the exploration rate is fixed. Neither does it discriminate between actions that it selects in an
exploring step.
Another approach, known as Boltzman Selection, is to again choose actions
probabilistically but based on a graded function of estimated value. This means that actions’
relative values are taken into account during selection. So if one value is much higher than
others it is more likely to be selected, while if two values are similar they have more equal
chance of being selected. If choosing action a in state s has a probability p then:
p=
eQt (a) / τ
∑
n
b=1
where
(3)
Qt (b) / τ
e
is the positive parameter temperature. As gets bigger there is less exploration.
Kaebling et al (1998, 2.2) term these types of approaches as “ad hoc” while acknowledging
that they are useful in practical terms. This means that some supervision or tuning is required
7
Application of RL to Simple Soccer
if these techniques are to be used. A more formally justified concept for optimization is
Dynamic Programming (DP).
1.2.5
Dynamic Programming (DP)
The value of a state s, following a policy π , can be written V π (s) . This is known as
the value function. A value function that is greater than or equal to all other value functions is
the optimal value function V * (s) . DP is concerned with optimizing the value function. The
value function itself can be written:
⎧∞
⎫
V π (s) = Eπ {Rt | st = s} = Eπ ⎨∑γ k rt +k +1 | st = s⎬
⎩k =0
⎭
(4)
where Eπ {} represents the expected value when the policy π is followed at each t for all
s ∈ S . This is known as the state-value function. Given a state, and a policy, it returns a
value. The optimal state-value function maximizes over actions, a:
{
V * (s) = max E rt +1 γ V * (st +1) | st = s, at = a
a
}
(5)
This recursive function is an application of the Bellman equation used in DP, which
gives a systematic method for finding optimal solutions (Bellman cited in Kirk, 2004).
Sutton & Barto (1998, 3.7-8) draw upon existing concepts from the literature on Markov
Decision Processes (MDPs) to create another representation of this:
[
V * (s) = max ∑ Pssa' Rssa ' + γ V * (s' )
a∈A( s )
]
(6)
s'
This embodies important assumptions that the state and action sets are finite and that they are
given a set of transition probabilities Pssa' and subsequent rewards Rssa ' . In order for this
formulation to hold, the environment must also be said to have the Markov property, that is,
the state at t + 1 depends only on the state and actions at t. The use of the MDP formalism in
RL comes from Watkins (1989).
Sutton and Barto (1998) identify the DP concept of bootstrapping, where estimates of
the values of states are based on estimates of the values of successor states, as an important
characteristic of many RL techniques. DP techniques can be used to enumerate both policies
8
Application of RL to Simple Soccer
and value functions. When these separate processes work together to converge on optimal
solutions it is known as Generalized Policy Iteration (GPI). Policy evaluation is also referred
to as the prediction problem while value iteration is known as the control problem from its
history in control theory (Kaebling et al, 1998).
The problem with DP is that it assumes that a model of the environment is available.
Monte Carlo methods, on the other hand, require only experience. Therefore they have
characteristics interesting to RL. They sample sequences of states, actions, and rewards from
on-line or simulated interaction with an environment. Monte Carlo methods are classified by
the manner in which they explore, into off-policy or on-policy types. An On-policy method
always explores whilst trying to optimize its policy, while an off-policy method explores but
learns a different deterministic, optimal policy that is not necessarily the one it is following. It
is these methods, added to the bootstrapping feature of DP, which can be used to classify the
most popular RL algorithms currently being used under the heading Temporal Difference
(TD) learning.
1.2.6
Temporal Difference (TD) Learning
Temporal Difference (TD) methods bootstrap in that they estimate values and then
update their estimates based on results. Once a TD learner enters a new state it updates its
estimate with the reward it has received. Unlike Monte Carlo methods, which wait until the
episode is over before apportioning the rewards for each state within that episode, TD
methods assign rewards after each state transition.
Sarsa (Sutton & Barto, 1998, 6.4) is an example of an on-policy TD learning method.
Its name comes from the letters s, a, r', s', a. which represent successive state action pairs and
the reward realized in the successive state:
Q(st , at ) ← Q(st , at ) +α[rt +1+ γQ(st +1, at+1 ) − Q(st , at )]
Sarsa updates the state-action value of the previous time step based on the current
reward and state action pairs at all times after the initial state and before the terminal state.
This is slightly different notation that we have used already. Q(s, a) is used instead of V(s)
when we are talking about state action pairs, but is otherwise the same concept.
9
(7)
Application of RL to Simple Soccer
Another similar-looking algorithm is a TD off-policy method known as Q-Learning (Watkins,
1998):
[
]
Q(st , at ) ← Q(st , at ) +α rt +1 + γ maxQ(st +1 , a) − Q(st , at )
a
(8)
It can be seen that, unlike Sarsa, this algorithm maximizes over next state action pairs
so that the Q it is learning tends towards the optimal Q*. This algorithm has proved one of
the most popular tabular approaches with researchers. It is proposed here to investigate the
application of this algorithm to a computer game with which it has not been used.
1.3
Advances in RL and Current Research
An overview of some RL fundamentals has been given above and two sample
algorithms. There are many other aspects of research that follow from the application of
algorithms such as those mentioned above, for example:
•
Methods for dealing with continuous state/action spaces e.g. function approximation.
•
Heuristic search (from traditional machine learning techniques) as part of a policy
computation.
1.3.1
Function Approximation
One of the problems in applying RL is “the curse of dimensionality” identified by
Bellman (1961). In the context of RL this means that tabular approaches (such as the TD
methods above) are only practical with limited state/action spaces because the computational
complexity grows exponentially as the states increase. One solution to this is function
approximation, where examples of a function (value or/and policy function) are generalized
to give an approximation of the function, instead of a simple table. There are many examples
of this from Machine Learning such as neural networks.
The off-policy tabular Q-learning performs well and is proven to converge (Watkins,
1989), but with function approximation it is more problematic. It has been shown that in
certain cases, even with infinite play, the optimal policy may not be found (Baird, 1995). Onpolicy methods have had more theoretical success with function approximation (Gordon,
10
Application of RL to Simple Soccer
2001), but research persists with trying to find sounder off-policy methods when combined
with function approximation (Precup, D., Sutton, R.S., Dasgupta, 2001).
1.3.2
Hierarchical RL
Research in applying function approximation to off-policy methods is important
because off-policy algorithms are used in “multi-scale, multi-goal, learning frameworks”
(Precup et. al, 2001). These techniques are based on the concept of hierarchical RL. This
involves efforts to scale RL to large state spaces by reusing learned subtasks, or behaviours,
as primitive actions at the next level. Because composite sub-actions violate the Markov
property, hierarchical RL makes use of Semi-Markov Decision Processes (SMDPs). State
transitions (successive decisions) are assumed to be instantaneous in MDPs but in SMDPs
this is generalized, in the simplest case so that the amount of time between one decision and
the next is a random integer (Barto & Mahadevan, 2003). Several hierarchical RL
frameworks exist such as HAMs (Parr, 1998), options (Precup, Sutton & Singh, 1999) and
MAXQ (Dietterich, 2000), which though different, all rely on the concept of SMDP.
1.3.3
Least Squares
Another recent theoretical RL advance is based on least squares, a statistical
technique that determines the values of unknown quantities in a model by minimizing the
difference between the square of predicted and observed values. Several researchers have
developed algorithms based on least squares techniques with success and there is now a lot of
activity in this area (Nedic & Bertsekas, 2003; Yu & Bertsekas, 2006; Geramifard A.,
Bowling M., & Sutton, 2006). Lagoudakis and Par (2003) for instance, demonstrate an
algorithm for least-squares policy-iteration (LSPI) and compare its performance to Watkins
Q-Learning on several tasks. In the more complex task of balancing and riding a bicycle to a
target location, LSPI achieves fairly consistent good performance, whereas Q-learning was
rarely able to balance the bicycle for more than a small fraction of the time needed to reach
the target location.
1.3.4
Relational Reinforcement Learning (RRL)
A criticism of some hierarchical approaches is that if they do not “explicitly represent
objects and relationships between them [they] are fundamentally limited in their ability to
11
Application of RL to Simple Soccer
transfer learning about one object to similar related objects” (Tadepalli, Givan & Driessens,
2004). Relational Reinforcement Learning (RRL) attempts to address this and one of its aims
is better transfer of learning from one task to another. Tadepalli et al (2004) point to
limitations of function approximation techniques, which require careful choice of
propositional features (state/action representation) to be hand-crafted to specific tasks.1
The field of relational learning also broadly goes under the names of inductive logic
programming, relational data mining and probabilistic relational modeling. Q-learning has
been combined with relational regression, allowing for structured information in the Q-values
(and policies) such as objects with properties and relations. First order regression trees and
Gaussian processes have been used in these approaches, as methods of computing
calculations based on the relational Q-values (Driessens, Ramon & Blockeel, 2001).
1
Sutton makes an illuminating comment on this issue in the RL FAQ on his website where he answers the
question of the common failure of newcomers in attempting to apply backpropagation neural networks. “The
primary reason for the failure is that backpropation is fairly tricky to use effectively, doubly so in an online
application like reinforcement learning. It is true that Tesauro used this approach in his strikingly successful
backgammon application, but note that at the time of his work with TD-gammon, Tesauro was already an expert
in applying backprop networks to backgammon. He had already built the world's best computer player of
backgammon using backprop networks. He had already learned all the tricks and tweaks and parameter settings
to make backprop networks learn well. Unless you have a similarly extensive background of experience, you are
likely to be very frustrated using a backprop network as your function approximator in reinforcement learning.”
(Sutton, n.d.)
12
Application of RL to Simple Soccer
2
Application of RL to Games
2.1
Types of Games
Researchers take two broad approaches in the application of RL. In one case a simple
toy game is created or identified. Sutton and Barto (1998) make extensive use of a game
called gridworld for instance in their canonical Introduction to RL2. Gridworld typically
consists of a grid of cells that represent individual states; four possible actions, left, right, up,
down; and a start and end state (with associated reward) or a reward for transition to each
state. These games are particularly suited to expository texts as their simplicity presents the
reader with minimal conceptual overhead. Gridworlds and variants (Madden & Howley,
2004; De Comite, 2005) are useful for exploring fundamental properties of RL algorithms or
for illustrating approaches. With these games an idealized environment can be created in
which to explore a problem. Although such games may present simplified environments they
can also easily pose challenging problems.
Other research applies RL to existing problems, that is, problems that have not been,
in some sense, specifically devised with the study of RL in mind. This research however is
concerned with questions of the broad utility, through application, of RL research. It also
aims to discover properties of RL that may not be apparent from more limited application. In
contrast to the first approach, the second type generally assumes that the environment is
fixed.
All types of research are done in conjunction with mathematical models. Experiments
by themselves are insufficient for fundamental research. For instance in the field of Search,
Nau (1980) discovered that increasing search depth does not necessarily lead to better
decisions, though the converse was thought to be true from all empirical research at the time
using games such as chess, Go, Backgammon etc.
2.2
TD-Backgammon
One of the most famous applications of RL to games was by Tesauro whose
backgammon player achieved successes in gameplay that far outstripped previous efforts,
2
The idea of focusing on artificially simple models known as micro-worlds goes back to Minsky and Papert
(1970) and the name gridworld is an obvious nod to its famous 3-D antecedent, Minsky’s block worlds.
13
Application of RL to Simple Soccer
which had used supervised learning techniques (Tesauro, 1995). Tesauro’s program went on
to play Backgammon at the level of the world’s top human players. Tesauro used Sutton’s
TD (
) method for training multilayer neural networks in a program called TD-Gammon.
TD-Gammon played against itself and learned from the outcome, training itself to be an
evaluation function. Tesauro initially tested TD-Gammon with no hand-crafted domain
knowledge (i.e. strategies from human experts) and was surprised to find that it quickly
learned simple behaviours, and after more extended runs was equivalent to the top computer
black gammon player of the time. Tesauro recounts that his program “appeared to be capable
of automatic ‘feature discovery,’ one of the longstanding goals of game learning research
since the time of Samuel3”.
Tesauro’s evaluation of the importance of TD-Gammon’s performance was borne out.
Campos and Langlois (2004), in an overview of the impact of TD-Backgammon and its
influenced subsequent research, describe it as causing “a small revolution in the field of RL”.
They also explain why its technique did not find equivalent success in games such as Go and
Chess. TD-Gammon is not a completely deterministic game like chess as it involves the
rolling of dice. Tesauro pointed out that “the learner explores more of the state space than it
would in the absence of such a stochastic noise source” and how learners are liable to search
pathologies in games in which randomness is not a feature of play, such as Go and Chess.
The speed of the game-play (TD-Gammon played 1.5 million training sets), and a smooth
evaluation function appropriate for neural networks, are also cited as advantages of
backgammon for RL self-play. It may also be worth noting Sutton’s comments above about
Tesauro’s expertise with neural networks. The game which TD-Gammon displaced as world
computer champion, and which was used as a supervised learning benchmark, was Tesauro’s
own Neurogammon.
2.3
Soccer and RL
The game of Soccer is an area of much applied AI research. Its popularity is partly
due to its suitability for Multi-Agent Robotics study, which typically use either human sports
or insect behaviour as real-world references. The RoboCup framework, which simulates
soccer games, is often used as a benchmark for multi-agent simulation and research
3
Arthur L. Samuel, machine learning pioneer famous for research of computer checkers play in 1952
14
Application of RL to Simple Soccer
challenges. It is described as a “research and educational tool for multi-agent systems and
artificial intelligence” (Kitano et al, 1995). Using this framework, researchers test their
techniques by competing in the annual RoboCup challenge.
The RoboCup Challenge presents three broad challenges focused on:
•
Learning
•
Real-time planning
•
Opponent modelling
Kitano, Tambe and Stone (1997) categorize the challenges according to whether they can be
learned off-line or on-line. Certain skills, particularly those on an individual agent level, are
invariant between games, but others are adversarial and must be learned on-line because
opponents can change between games. Kitano et al (1997) make the point that on-line
adversarial learning techniques must generalize quickly due to the short game length of
RoboCup (20 minutes). It is not surprising therefore that many researchers have attempted to
apply RL techniques to RoboCup (Balch, 1997; Stone Velso, 1999; Stone, Sutton & Singh,
2001).
2.3.1
Approaches to Simulated Soccer
One of the features of soccer is that it is a complex real-time problem. Stone (1998,
2000) has worked on tackling this complexity using hierarchical, layered learning. It is a
modular approach described as allowing for “a bottom-up definition of subtasks at different
levels” (Stone, 2000, p. 92). Stone provides principles and a formalism, for what he terms a
layered learning approach. One of its defining features is that outputs of one layer form the
inputs for another layer. All approaches to soccer can be roughly grouped as forms of a
divide-and-conquer strategy. The identification of the important behaviours and subtasks are
often a significant aspect of this research.
2.3.2
Keepaway
Intercepting the ball, where the agent moves to take possession of the ball, which may
be moving, is an example of a lower level behaviour of an individual player. Keepaway,
where a group of agents attempt to keep the ball from an opposition group, is a higher level
behaviour. During keepaway the agent in possession looks for suitable team-mates to pass to,
15
Application of RL to Simple Soccer
while its team-mates attempt to move into good positions to receive the ball. When a pass is
made the agent uses its lower-level interception behaviour to meet the ball. The keepaway
behaviour is in turn an important building block for attaining scoring positions, preventing
the opposition from scoring, and ultimately winning games.
Several researchers have studied the keepaway subtask using TD methods (Stone,
Sutton & Singh, 2001; Valluri & Babu, 2002; Young & Pollani, 2006) and Genetic
Algorithms (GA) (Gustafson & Hsu, 2001; Hsu, Harmon, Rodriguez & Zhong, 2004) or both
(Taylor, Whiteson & Stone, 2006). They often use Stone et al’s (2001) formulation of the
problem as a benchmark, using three keepers and two takers, on a restricted playing area.
This is known as 3 versus 2 keepaway. Stone cites keepaway as challenging for the following
reasons (which also mostly apply to the more general problem of simulated soccer):
•
The state space is far too large to explore exhaustively
•
Each agent has only partial state information
•
The action space is continuous
•
Multiple team-mates need to learn simultaneously
(Stone, n.d.)
Although studying keepaway is useful in itself for the reasons outlined above, there is
a separate question of whether teams with good keepaway will perform well in full
tournament match-play. The stated goal of the full robotic version of RoboCup is to beat the
world human soccer champions in a game by 2050 (RoboCup, 2007), showing that
researchers’ ultimate aspirations are wider than 3 versus 2 keepaway. Research in this area is
still at a relatively formative stage and the utility of keepaway, as a component of match-play
proper, has not yet been fully examined. In some sports, such as rugby, field position is often
preferred to possession, and similar tactics are sometimes used in soccer (the so-called “longball” game). Researchers have to date followed Stone’s lead in assuming that possession is at
all times desirable.
2.3.3
Half Field Offense
A recent study extended the keepaway task to a larger novel task called half field
offense4 (Shivaram, Yaxin & Stone, 2007). In half field offense the team attempts to outplay a
4
The American spelling is followed here
16
Application of RL to Simple Soccer
defending team in order to shoot, which involves moving the ball up the field while retaining
possession. Like keepaway it has a continuous state space, noisy actions, and multiple agents
but is a much harder problem because of sparse rewards, larger state space, richer action set,
and a more complex policy required. The best previous keepaway algorithm did not transfer
well to half field offense and the study presented a new algorithm in which the agents
communicated during the play to speed up learning.
2.3.4
Progress in Keepaway
Finally it is worth noting that Kuhlmann & Stone’s (2004) study showed progression
on the keepaway task using the same technique as Stone & Sutton (2001) – SMDP Sarsa( )
with tile-coding function approximation. Kuhlmann & Stone’s state and action representation
was simply better, and with the benefit of developments subsequent to the original study, they
had a greater understanding of the problem. They outperformed the results of the first study
despite not removing control noise, as Stone & Sutton had done, to simplify the problem
originally. This result shows the continued importance of hand-coded parameters to RL
successes.
Moreover, Kuhlmann & Stone while finding their study successful, profess themselves
unable to directly compare it to the findings of others on keepaway. In one case this was
because of differing fitness functions and game dynamics (Gustafson & Hsu cited in
Kuhlmann & Sutton, 2004). Another study was more closely correlated (Di Peiro, While &
Barone cited in Kuhlmann & Sutton, 2004), using the same high level behaviours and
performance measure (average episode duration), but neither could it be empirically
compared because its "high-level behaviours and basic skills were implemented
independently" (Kuhlmann & Sutton, pp. 7, 2004). This would suggest that comparing results
of applications to complex problems is difficult and that benchmarks such as keepaway do
not yet work as well as might be hoped.
2.4
Summary
A survey has been presented here of some of the literature in RL. A selection of work
has been looked at relating to its application to games and computer soccer and some issues
that have arisen. It has been seen that there has been much progress in RL, though as yet not
enough to match the high ambitions of its proponents.
17
Application of RL to Simple Soccer
3
Implementation and Verification of C++ Learner
3.1
Design of the Implemented Learner
Tabular Q-learning was identified as a straightforward RL technique to implement. The
basic algorithm is show below (Sutton & Barto, 6.5, 1998):
Figure 1 Q-Learning Algorithm
// s , s '→ state and successive state
// a , a '→ action and successive action
//Q → state action value
// α , γ → learning parameters (learning rate, discount factor)
Initialize Q(s, a) arbitrarily
Repeat (for each episode)
Initialize s
Repeat (for each step of episode) until s is terminal
Choose a from s using policy derived from Q (e.g. -greedy)
Take action a observe reward r, state s'
Q (s, a) ← Q (s, a) + α [r + γ maxa'Q(s', a') − Q (s, a)]
s ← s'
The inclusion of a verification stage in the planning phase, where the implemented
learner is tested against some toy world from the literature, determined the initial
architecture. This initial architecture allowed for a simple representation of world, learner and
test as separate objects: see Figure 2 Initial architecture showing most important operations.
Figure 2 Initial architecture showing most important operations
18
Application of RL to Simple Soccer
3.2
Testing Against CliffWorld
A toy world called CliffWorld demonstrating Q-Learning from Sutton & Barto (1998)
was identified as a simple sample world to implement and use as a benchmark. CliffWorld is
a standard undiscounted, episodic task, with start and goal states, and the usual actions
causing movement up, down, right, and left as shown in figure 2 below. Reward is
on all
transitions except those into the region marked "The Cliff." Stepping into this region incurs a
reward of -100 and sends the agent instantly back to the start. The lower part of the figure
shows the performance of the Sarsa and Q-learning techniques with
selection where
-greedy action
. After an initial transient phase, Q-learning shows values for the
optimal policy, that which travels right along the edge of the cliff. The learner still
occasionally falls off the cliff because of the
-greedy action selection even after learning the
optimal path.
19
Application of RL to Simple Soccer
Figure 3 Taken from Sutton and Barto (1998)
The cliff-walking task. The results are from a single run, but smoothed.
Bugs (logic errors) were found in the implemented learner when testing against this
world. Although this was a simple world, and straight-forward to implement, full details of
all Q-learning parameters used were not given. Nor did it give a detailed list of outputs and
the smoothing scheme used on the single run shown in the diagram is unspecified. At this
point it was decided that another example, which showed complete details would be
beneficial for greater understanding and debugging the implementation.
A second example world, Poole’s GridWorld (see 3.3 below), was identified which
contained full source code of the learner and world (Poole, n.d.). After using Poole’s
GridWorld to debug and verify the learner the fully converged optimal path was shown in
20
Application of RL to Simple Soccer
Cliffworld after 5 x 104 episodes see Figure 4 C++ implemented learner after 5 x 104
CliffWorld episodes (optimal path)below.
Figure 4 C++ implemented learner after 5 x 104 CliffWorld episodes (optimal path)
3.3
3.3.1
GridWorld
Overview
GridWorld is a Q-learning demonstration program written in Java by David Poole
from the University of British Colombia. It consists of a 10 by 10 grid. It has four rewarding
states (apart from the walls), one worth +10 (at position (9,8); 9 across and 8 down), one
worth +3 (at position (8,3)), one worth -5 (at position (4,5)) and one worth -10 (at position
(4,8)). In each of these states the agent gets the reward when it carries out an action in that
state (when it leaves the state, not when it enters). If it bumps into the outside wall (i.e., the
square computed as above is outside the grid), there is a penalty (i.e., a reward of -1) and the
agent does not change its position. When an agent acts in one of the states with positive
reward, it is taken, at random, to one of the 4 corners of the grid world (regardless of the
action it tries to take). There is some control noise in the system with a 0.3 chance that the
agent will take a direction other than that which it tries to act.
21
Application of RL to Simple Soccer
3.3.2
Comparison Tests
To make the world more predictable, and hence easier to compare, the control noise
and randomization of the restart states were removed. Both learners were run for a set number
of iterations several times and differences were found between runs. Poole’s GridWorld has a
larger state space than Sutton & Barto (100 squares, compared to 48), which indicated that it
would take longer for reproducible behaviour to appear. However, even at 105 iterations
differences remained. (Note: Poole’s program is designed to be run for a specified number of
iterations/steps rather than episodes – though it is an episodic learner).
To see if the results of Poole’s learner could be duplicated exactly, all remaining
random computations were removed from both the implemented learner and Poole’s. There
were three of these (excluding random start states and control noise which had already been
removed):
•
To decide whether to exploit or explore
•
To decide which action to take when exploring
•
To decide which element to start from when performing the sort that maximizes over
Q-values
A randomly generated sequence of 800 actions (0..3) was created and added as a fixed
array to both implementations. Whenever a random action was needed it was taken from this
fixed sequence. Exploit/explore and the maximize algorithm also used this sequence in lieu of
calls to pseudo-random functions (combining successive values to simulate sequences other
than those of probability 0.25).
This implementation was debugged for a while (the C++ implementation’s action
coding scheme made the exact same as Poole’s e.g. go left on 0, up on 1 etc.). At this point
however the learners were still not identical. Eventually a small bug was found in Poole’s
implementation: changing the values of alpha or discount through the GUI has no effect on
the actual learner. This meant that the implementation and the reference were using different
discount rates. A minor change was made to Poole’s code so that altering the discount (and
alpha) through the GUI passed these values properly to the learner. Following this change
both learners performed identically (see Figure 5 and Figure 6).
22
Application of RL to Simple Soccer
Figure 5 Poole’s Q-learner after 103 steps with no random behaviour. Discount = 0.9, alpha = 1/(s , a)
23
Application of RL to Simple Soccer
Figure 6 C++ Q-learner after 103 steps with no random behaviour. Discount = 0.9, alpha = 1/(s, a)
At this point the three random steps were added back in to the learner and tests
performed to see if the implemented learner showed evidence of learning and that it was
24
Application of RL to Simple Soccer
similar to Poole’s learner. Some tests found that after approximately 107 steps the optimal
policy was reliably shown (see Figure 6 and Figure 7).
Figure 7 Poole’s Q-Learner after 107 steps (top rows not shown). Exploit = 90%, Discount = 0.9, alpha =
1/(s,a)
25
Application of RL to Simple Soccer
Figure 8 C++ Q-Learner after 107 steps (top rows not shown). Exploit = 90%, Discount = 0.9, alpha =
1/(s,a)
3.3.3
Findings and Implementation Revision
One of the lessons learned from this test phase is that because of the three required
points where the learner executes code probabilistically, individual runs can vary greatly in
their exact details. Problems with even relatively modest state paces (e.g. 102 in the case of
GridWorld), and very small action spaces (4 in both cases), can require a lot of iterations to
find the optimal policy (107 iterations for GridWorld).
The architecture of the implemented learner was modified at this time to become
more general. Commonalities in the representation of CliffWorld and GridWorld were
identified and re-factored into an abstract parent class. Commonalities included utility
functions such as for printing grids of Q-values, showing which value is optimal in a state etc.
The resultant code was simple and modular. A world object is passed to the learner and
26
Application of RL to Simple Soccer
functions that correctly belong to the word and the learner can be cleanly separated. This
allowed for effective experimentation on grid-type problems. However this object-orientated
approach does not transfer as easily to problems that have more complex environments such
as soccer.
Figure 9 Revised architecture showing most important operations
WorldLearnerTest
-learner
-worlds[]
+test()
-contains
World
+getEpisode()
+print(in Qlearner, in filename)
1
1
QLearner
-world
+updateQ(in state1, in state2, in action, in reward)
+learn(in totalSteps)
+maxAction(in state)
+maxQ()
+pickAction()
CliffWorld
GridWorld
+act(in state, in action)
+getReward()
+getStartState()
+act(in state, in action)
+getReward()
+getStartState()
27
Application of RL to Simple Soccer
4
Application of RL to Simple Soccer
4.1
4.1.1
Simple Soccer
Overview
For the scope of this exercise the RoboCup Soccer Simulator was deemed too
complex (like games such as Quake, the Robocup Soccer Simulator runs to hundreds of
thousands of lines of code and is a client-server architecture designed for online multiplayer
gaming). Buckland’s (2005) Programming Game AI by Example is a book that provides a
practical and accessible introduction to AI techniques. It includes code samples, including
one for a restricted version of a soccer game known aptly as Simple Soccer. The game is
specifically designed to be developed and adapted by AI students and enthusiasts. Buckland’s
stated aim is that the game should enable the user to “design and implement a team sports AI
framework capable of supporting your own ideas” (Buckland, 2005, p. 133). Although it is
not a sophisticated research platform of the level of RoboCup, Simple Soccer is still a
relatively complex environment which informed decisions about experiment design. A copy
of an early version of the project plan is included in Appendix C.
Similar to keepaway (Stone et. al., 2001), Simple Soccer can be seen as a
simplification of the full soccer problem. It has no offside rule, for instance, only five players
per team, and the ball cannot travel out of play but simply rebounds off the sidelines. The
game environment is compromised of the following:
•
A soccer pitch
•
Two goals
•
One ball
•
Two teams
•
Eight outfield players (four on each team)
•
Two goalkeepers
All of these entities are represented as objects. Goalkeepers and outfield players are
both derived from a base player type (PlayerBase). Players are also coupled to a team entity
(SoccerTeam) see Appendix A. Simple Soccer Class Diagrams
28
Application of RL to Simple Soccer
Each player owns a state machine object (StateMachine) which gives it the ability to
change its behaviours based upon the current state. The SoccerTeam entity also owns a
StateMachine object giving the team the ability to change its state based upon the current
state of play. The implementation of behaviour at the team level, in addition to the individual
player level, is according to Buckland a form of “tiered AI” (Buckland, 2005). It is similar in
general terms to the layered approaches of Stone (2000) and others when trying to represent
complex systems.
4.1.2
States
The tiered approach makes application of RL more complex. An individual player
(PlayerBase) polls its environment directly to determine aspects of the current state, such as:
•
Whether the ball is within kicking range
•
Which is the closet team-mate to the ball
•
If it is possible to pass etc.
But the team entity (SoccerTeam) receives other state signals such as:
•
which player should request a pass
•
whether it is possible to shoot
•
which player should support the player in possession, etc.
We can see that the state representation is spread across entities which each take actions.
It is also to be noted that these are high level state signals, calculated in turn from low
level sensory perceptions. For instance to determine if it is possible to shoot from a given
position, a function samples random positions along the opponent's goal-mouth checking if a
goal can be scored if the ball was to be kicked in that direction, with a particular power and
given the current vector of the ball. There are many such calculations made based on low
level sensory data of the game’s mechanics. Considering all environment data by the RL
agent would be unfeasible. The study Unsupervised Learning of Realistic Behaviour in
Soccer (Crowley & Southley, n.d.) for instance floundered upon an underestimation of the
difficulty involved in representing the state space5.
In initial formulation of the state/action representations, calculations based on low
level data were considered as part of the environment and unknown to the agent. In
5
The authors’ frankness about their failure in this unpublished work is as educational in many ways as other
studies deemed sufficiently ‘successful’ for publication.
29
Application of RL to Simple Soccer
considering the boundary between agent and environment, these high level sensory functions
(detailed later) were defined as a component of the state. Sutton & Barto (1998) provide a
discussion of the agent-environment boundary issue pointing out that the boundary is often
drawn closer to the agent than its physical boundary, giving the example where “the motors
and mechanical linkages of a robot and its sensing hardware should usually be considered
parts of the environment rather than parts of the agent” (Sutton & Barto, 1998, 3.1).
The functions that return state information for the SoccerTeam entity are slightly
more complex than those of the PlayerBase entity. Several of these functions return a
reference to a PlayerBase entity such as:
•
DetermineBestSupportingAttacker
•
PlayerClosestToBall
•
SupportingPlayer (player making himself available for a pass)
•
Receiver (player to whom a pass is being made)
As there are four outfield players these functions can each return four possible values.
Other state components are not discrete values however, such as the function GetSupportSpot
which returns a vector wrapper class, containing variables of type double which are
continuous values representing x and y pitch co-ordinates. Moreover other functions take
both complex and continuous datatypes as arguments e.g.
bool isPassSafeFromAllOpponents
(
Vector2D from,
Vector2D target,
const PlayerBase* const receiver,
double PassingForce
)
Though this function returns a simple Boolean value, in effect it represents many possible
values of state components due to its parameters.
The functions of the PlayerBase class provide a more convenient discretisation of the
state space. The following no-argument functions of PlayerBase act as state sensors that
return Boolean values:
30
Application of RL to Simple Soccer
•
BallWithinKickingRange
•
BallWithinReceivingRange
•
InHomeRegion
•
isAheadOfAttacker
•
AtSupportSpot
•
AtTarget (player has reached its steering target)
•
isClosestTeamMemberToBall
•
PositionInFrontOfPlayer
•
isClosestPlayerOnPitchToBall
•
InHotRegion (player is close to opponent’s goal)
These were initially identified as functions that could provide a useful state
representation. The player also has one Boolean sensor function that takes an argument,
PositionInFrontOfPlayer (Vector2D position), and three functions that return a continuous
value representing a distance:
•
double DistSqToBall (values are often left in squared form to spare expensive square
root calculations)
•
double DistToOppGoal
•
double DistToHomeGoal
PlayerBase initially appeared a more useful entity to give RL to than SoccerTeam. It
has mostly simple binary sensor functions which are suitable for tabular Q-learning. It has
three functions that return a continuous value that would be possible to discretise by dividing
into ranges.
4.1.3
Actions
Behaviours in Simple Soccer are implemented using a state machine design-pattern.
The idea behind this is that high level behaviours can be created and added easily. It also
allows effective management of the transition between behaviours: states fire entry and exit
methods automatically as they are instantiated and destroyed. Although the terms ‘transition’
and ‘state’ are used in Simple Soccer, their meaning is not exactly the same as that in the
terminology of the RL problem. The concept of a state in Simple Soccer is a mode of
31
Application of RL to Simple Soccer
behaviour where the agent follows a defined series of actions. The most interesting of these
behaviours for players are:
•
Attacking
•
ChaseBall
•
Dribble
•
KickBall
•
ReceiveBall
•
SupportAttacker
•
Wait
In Simple Soccer the transition between much high level behaviour is distributed
between multiple classes because of the state machine design pattern. This pattern in effect
wraps a behavioural method with a class. For instance when a player in is the ChaseBall state
and reaches the ball it may transition to the KickBall state. It only relies on one sensor to
make this decision – the player’s Boolean method BallWithinKickingRange. It is a modular
design with high degrees of cohesion and low coupling between classes.
32
Application of RL to Simple Soccer
4.2
4.2.1
Application of RL to Simple Soccer
Differences with GridWorlds
The effort to apply the learner from the verification stage to Simple Soccer proved
greater than expected. Three factors showing differences from grid-type worlds that became
apparent were:
•
adversarial environment
•
complex environment of interacting objects
•
real-time processing
4.2.1.1 Adversarial
Because the agents each must take a turn, rewards and successive states are not the
instantaneous result of actions. In grid-worlds an algorithm can take a reward or successive
state directly as the result of an act method. In adversarial games the learner may have to wait
until the opponent has taken their turn before the computation of the reward, and possibly
successive state also.
The Q-learner must store the previous reward and state give the environment an
action and then pass control to the environment. The environment then takes the learning
agent’s action, passes control to the opponent for its turn and returns control to the Q-learner
to compute and update Q-values. Because the learner and environment interact
asynchronously like this the architecture is slightly more complex than for grid-worlds.
4.2.1.2 Complex Object Environment
Originally it was planned to throw out all learner behaviour at some level and
substitute with Q-Learning. Before even considering the state space this would involve, it
became apparent that it was more difficult to do this within the existing architecture of
Simple Soccer than compared to a grid-type world. The state signal functions and the action
functions are spread across multiple classes. Adding the learner code into these objects is
difficult. To do so in an object-orientated fashion would require the learner to be passed
between methods of all of these objects as an argument, requiring changes in many method
signatures of the game. A quicker but less elegant option would be to create the learner as an
object with global scope. However both of these options still require small amounts of very
similar code in a lot of different places to invoke the learner when an action needs to be
33
Application of RL to Simple Soccer
taken, which is an inelegant design. A possible solution might be Aspect Orientated
Programming (discussed later).
4.2.1.3 Real-Time Processing
Large numbers of episodes could be generated in grid-worlds in minutes. However in
Simple Soccer there are many built-in time delays to simulate the physics of real-time play.
Some sleep commands were removed and the graphics rendering disabled apart from a
window outputting current variables such as states, actions, rewards and Q-values. With this,
speed can be increased to 175 steps per second before the program crashes. It still remains
difficult to generate large numbers of goals per team as goals take by default close to a
minute to score at this speed. Tests of up to 627 goals were generated (but generally the
program would hang after running for more than 200 goals).
Given that full convergence to the optimal policy took 10 x 54 episodes in CliffWorld,
with only 48 states and 4 actions, it is clear that optimal behaviours can be difficult to find.
However the frequency with which the learner reaches the goal over time is also a useful
observation. Some learning can be shown after only a short initial transient phase in the gridtype problems and near-optimal policies can be quickly seen (see Figure 3 above). It was
hoped here to find evidence of learning after extended runs.
4.2.2
Benchmarking Results
In Simple Soccer the optimal policy would see the learner perform equally to the
default i.e. score and concede goals at the same rate. We can consider scoring and conceding
as a combined ability – winning. The performance of the learner will always be relative to the
opponent so we can gauge ability through the difference in their scoring rates. If both teams
are scoring with the same average frequency (minus the exploration rate) the learner is
behaving optimally – it has learned the same policy as its opponent, the benchmark. If the
learner scores at a rate greater than random then it exhibits some learning. There is also the
possibility that the learner could learn a novel policy that outperforms its opponent. In order
for this to be a possibility the learner would need to be able to take actions other than those
that its opponent takes.
34
Application of RL to Simple Soccer
4.2.3
Simple State/Action Structure for Novel Policy Discovery
4.2.3.1 States
One simple way to structure the learner to look for novel policies is to take actions
that are by default only possible in one state and add them to another. To test this, an area of
code where actions and state signals were most closely situated together was identified. This
was in the onMessage method of FieldPlayerStates. This is a method of a class that
specifically deals with transitions between states (see Actions 4.2.3.2 below). This method
examines several sensory inputs during its operation:
•
Whether the current player’s state machine is an instance of SupportAttacker
•
Whether the current player’s team has a valid receiver for a pass
•
Whether the current player is in kicking range of the ball
•
Whether the current player is receiving a message that says:
o Receive ball
o Support attacker
o Wait
o Pass to me
It provides a relatively clear presentation of states and actions. There is one exception,
where the player receives the message return-to-home-region. The required action in this case
is part of the environment, because in the rules of soccer all players must return to their home
region when a goal is scored before the restart.
As these are Boolean methods we can formulate a state based on whether each one
returns true or not. The code in the default includes branching which makes the player only
examine one or two of these functions before returning an action. Here this branching was
removed and all of the sensors taken into account to see if there might be any extra
information for the player.
Some of these functions still return mutually exclusive results e.g. the player can only
receive one message at a time. There is also some more redundancy in the encoding (see state
table below) as not all combinations of sensory inputs are possible. However by examining
the original method it is also clear that the learner now has more information on which to
base its actions that the default (though this may be worthless data). See Appendix B.
35
Application of RL to Simple Soccer
Functions of the Learner Specific to Simple Soccer for the code of the default and learner
methods.
An example of a basic ‘useful’ transition between states would be where a player
transitions successfully from a receiver to an attacker.
4.2.3.2 Actions
The action encoding is made from a binary number with each digit representing
whether each of the following actions is to be taken or not taken (though the bullet below
shows three actions that are definitely mutually exclusive):
•
Change the current player’s state machine to:
o ReceiveBall
o SupportAttacker
o Wait
•
Make current player find support spot (a good position to receive a pass)
•
Make current player look for support players
•
Make current player receive a pass
•
Make current player intercept the ball
This gives an action space of 27 (128). Some of these actions will be useless, some
will be the default behaviour, and others will be variants – subsets or supersets – of default
behaviour.
There were two behaviours found in early tests that could be taken only if specific
state signals were present. If the player attempted to intercept a pass or request a pass in an
invalid state the program would crash. Code was added so that if the agent attempted to take
either of these actions when the specific state signal was not present, the player would do
nothing.
4.2.3.3 Rewards
Something that became apparent from initial application is that 0 must be the default
reward and not -1 as in the grid worlds. If a soccer team received small negative rewards at
each time-step they might learn to concede goals quickly to minimize the accumulation of
36
Application of RL to Simple Soccer
negative rewards per episode. Episodes were defined to end upon the learner scoring a goal.
This is a more unlikely event for learners initially than conceding a goal.
4.2.4
Architecture
To add RL to the game the Q-Learner was completely embedded within the code of
Simple Soccer. The Simple Soccer StateMachine class was subclassed by an
RLGlobalPlayerState class. The functions of a Q-Learner were added to this class e.g. maxQ,
maxAction, updateQ etc. A flag was passed to the constructor of each Team to indicate
whether it was the learner or not during game set-up. The learner Team instantiated its
players array with instances of RLGlobalPlayerState, while the default used the pre-existing
GlobalPlayerState class. If the learner was to be applied to other classes, the learner code
could be re-factored to an abstract parent class. (However no other class contains more than
one or two actions and state signals, and hence were not considered good candidates for Qlearning with flat action representations.)
The action components and state signals are spread across different classes. This is
one reason why it is useful for the learner to be embedded in the game and part of the
team/player class hierarchy. The significant operations of the RLGlobalPlayerState and the
methods used by the learner in other classes are shown in Figure 10 below (More details of
RLGlobalPlayerState are given in Appendix B. Functions of the Learner Specific to Simple
Soccer.
37
Application of RL to Simple Soccer
Figure 10 Methods used by the Simple Soccer Q-Learner (RLGlobalPlayerState)
4.2.5
Results
4.2.5.1 Performance
Several runs of the game were made to generate data. After some trials, a 90% fixed
exploitation rate was used with 0.9 fixed discount rate and alpha (learning rate) of 1/(state,
action). 727 goals, including those from both learner and opposition, were generated from a
38
Application of RL to Simple Soccer
single run of the learner which took about five hours. (This was the longest run generated as
the program often hung after generating a certain number of episodes.) On average for every
goal scored by the learner, the default opponent scored 3.77 (conceded-to-scored ratio). This
was compared with a player acting at random. Its conceded-to-scored ratio was 6.66 goals,
showing that the learner performed better than random. The default playing against itself
would have an average conceded-to-scored ratio of 1. Including a 90% fixed exploitation
rate, and the 6.66 conceded-to-scored ratio on random, the average optimal performance for
the learner would be a conceded-to-scored ratio of (9 + 6.66)/10 = 1.57. This means that
•
For every nine steps the learner should average 1 (playing against itself it scores
and concedes the same amount of goals over time).
•
Every tenth (exploring) step, it will average 6.66 (as found above).
•
Adding the nine default steps to the exploring step and dividing by ten gives us
the average step in this case.
The below graph, Figure 11, shows the goals conceded-to-scored ratio, averaged over
every fifteen goals for:
•
learner
•
random action selection
•
optimal possible for an exploring learner as single averaged value
39
Application of RL to Simple Soccer
Figure 11 Goals conceded-to-scored ratios playing against the default Simple Soccer team
QLearner Performance
conceded-to-scored ratio
12
Random
10
8
Q-Learning
6
4
Optimal
Possible
with
Exploration
2
0
1
54 107 160 213 266 319 372 425 478 531 584 637
total goals scored
4.2.5.2 State/Action Representation Findings
Table 1 shows a selection of the state space discovered. These are states from which
the learner transitioned into a state where a goal was scored or conceded i.e. states directly
preceding reward states. These are interesting because we can examine the best actions that
the agent learned for these states and compare this with the default. The states are described
in 4.2.3.1 States above. The binary encoding of each state is shown. The state with decimal
36 is binary 0100100 where the first three digits correspond to No or Yes for the first three
state components reading left to right. The final four digits of each binary number indicate
whether the state components Msg_ReceiveBall, Msg_SupportAttacker, Msg_Wait or
Msg_PassToMe are present. An alphabetic label is also given to each state.
40
Application of RL to Simple Soccer
Table 1 States found that transitioned to a goal/scored conceded state
State Name/Encoding
State Components
Is Support
Receiver( ) ==
Ball Within
Decimal
Binary
Label
Attacker
NULL
Kicking Range
Msg
36
0100100
A
No
Yes
No
Msg_SupportAttacker
52
0110100
B
No
Yes
Yes
Msg_SupportAttacker
33
0100001
C
No
Yes
No
Msg_PassToMe
97
1100001
D
Yes
Yes
No
Msg_PassToMe
104
1101000
E
Yes
Yes
No
Msg_ReceiveBall
40
0101000
F
No
Yes
No
Msg_ReceiveBall
120
1111000
G
Yes
Yes
Yes
Msg_ReceiveBall
97
1100001
H
Yes
Yes
No
Msg_PassToMe
8
0001000
I
No
No
No
Msg_ReceiveBall
113
1110001
J
Yes
Yes
Yes
Msg_PassToMe
100
1100100
K
Yes
Yes
No
Msg_SupportAttacker
56
0111000
L
Yes
Yes
Yes
Msg_ReceiveBall
49
0110001
M
No
Yes
Yes
Msg_PassToMe
17
0010001
N
No
Yes
Yes
Msg_PassToMe
1
0000001
O
No
No
No
Msg_PassToMe
The state most frequently encountered before a positive reward state was state A
(highlighted in Table 1 above). The learner was then run again to generate a file showing
state-action mappings. The best run of this test yielded 356 goals. The optimal action that was
found in state A was not to perform any of the possible functions described above in 4.2.3.2
Actions i.e. a decimal encoding of zero. The optimal action which the default player would
take in this state is decimal two, which is very similar, differing by only one bit. These
actions are shown in
41
Application of RL to Simple Soccer
Table 2 below. An examination of the code of the default revealed that they were actually
identical actions. The only function that action two caused to be triggered was to change the
current player’s state to that of an attacker. However the current player is already an attacker
in this state so doing nothing is equivalent. This shows that there is some redundancy in the
code of the game. It also shows that there may be multiple optimal actions available to the
learner (notwithstanding redundancy in the encoding scheme).
42
Application of RL to Simple Soccer
Table 2 Optimal action of the default (2) and optimal action discovered by the learner (0)
Action Name/Encoding
Action Components
Get
Decimal
Binary
Label
Get support
support
Receive
Intercept
Change behaviour
spot
players
pass
ball
(state machine)
2
0000010
A
No
No
No
No
SupportAttacker::Instance()
0
0000000
B
No
No
No
No
-
Unfortunately the tests did not show the learner finding a new policy to outperform
the default (minus exploration). This was something it was thought might be more interesting
than simply watching the learner finding the default behaviours. This option involved
widening the boundary of the learner’s capabilities. It introduced the possibility that the
learner could try actions in states that would cause the program to enter infinite loops or crash
in other ways e.g. if certain operations were only allowed in certain combinations or
sequences. The learner could also discover actions that, whilst not causing a crash, violated
rules central to the games operation e.g. the environmental rule that all players must return to
the home region after a goal is scored.
Figure 11 shows better performance than random but not necessarily strong evidence
of learning. For this, performance would need to be shown to improve after an initial phase
where knowledge was being gathered and hence no real exploitation was possible. The
general trend is of slight improved performance for the lifetime of the run but not near the
magnitude seen in examples such as Sutton & Barto, Figure 3, above.
4.2.6
Testing with Reduced Action Space
To see if better performance could be achieved by the implemented learner, a simpler
task was chosen. The action set of the previous task was reduced from a possible 128 to four.
The state and reward formulation were left unchanged. The learning variables of 0.9 discount
and 90% exploration rate were also left unchanged. A run of the learner was made which
generated 161 goals. This was the same amount of goals as scored in the previous test but in
this case the learner reached this number in less time with a lower conceded-to-scored ratio of
1.81. It is shown in Figure 12 below against a random run with averaging over every fifteen
goals.
It can be seen that, as expected, the learner performs better on the task with four
actions available to it than with 128. In addition to scoring more goals in a shorter space of
43
Application of RL to Simple Soccer
time (which shows improved performance) there is also clearer evidence of improvement in
behaviour as time goes by (which shows learning). A steep initial decline in conceded-toscored ratio, for the first 100 goals, shows the learner improving as it gains more knowledge
to exploit. After 100 goals the learner has settled near its optimal policy and only deviates
slightly from this for the duration of the run due to its fixed exploration rate.
Figure 12 Conceded-to-scored ratios playing against the default with four actions
conceded-to-scored ratio
12
10
Random
8
Q-Learning
6
4
Averaged Optimal
Possible for
Learner
2
0
1
37 73 109 145 181 217 253 289 325 361 397 433 469
total goals scored by both teams
44
Application of RL to Simple Soccer
5
5.1
Conclusions
Evaluation of Simple Soccer
Unpredictable program crashes made Simple Soccer difficult to work with. This can
be frustrating for Q-learning where learning is online and results of experiments are
cumulative, often requiring the generation of many episodes. With some other machine
learning techniques the results of short runs could be added together to form larger training
sets. However although it may take many runs to find fully converged optimal policies
tabular Q-Learning can learn reasonable policies quickly which confirms that it is a useful for
online real-time tasks such as Simple Soccer. Although the game logic can be decoupled
from the graphics rendering, and the frame rate consequently increased, the game still runs in
real time and it takes many hours to generate useful data sets.
The game was originally evaluated on a PC with Windows 2000. The game worked
reasonably well at this point and installation and setup was quite easy with a Visual Studio 6
project file supplied. After an upgrade to Windows XP, Visual Studio 6 could no longer be
installed as this product is now discontinued in favour of Visual Studio dot Net for which a
free Express edition is available. Converting the Simple Soccer project to Visual Studio
Express from version 6 proved more difficult. Changes in how the program handles vector
iterators required some fixing of the Simple Soccer code and seems to be a source of crashes.
Ultimately Visual Studio 6 is now a deprecated and unsupported application, weakening
Simple Soccer’s value as an educational tool or for conducting RL experiments.
The time allocated for programming was exceeded many times over. Some of this was
due to a steep learning curve of both RL and C++ and problems encountered with Visual
Studio Express and Simple Soccer. The author found that there were nuances to the
implementation of the algorithms which might not be apparent to newcomers. Seeding
pseudo-random numbers, and some basic sorting techniques to ensure for instance, that max
selections start randomly from different positions in a range, were not immediately obvious
until compared in detail with benchmarks where all randomness was removed.
A lot was learned about C++ during the project. Perhaps the biggest positive outcome
of the thesis was a greater understanding and appreciation of RL through its application.
45
Application of RL to Simple Soccer
Some studies indicate that application of RL is a skill in itself that takes much practice
(Kuhlmann & Stone, 2004; Tesauro, 1995).
5.2
Application of Tabular Q-Learning to Games
It is relatively straightforward to build a generalized architecture to test RL on
different grid-type worlds and several frameworks exist (Hackman, 2006; Neumann &
Neumann, 2003; De Comite, n.d.). Applying RL to a game composed of interacting objects is
more difficult. It is possible the rules and dynamics of a simple grid system may be easily
understood or described. However it is more difficult to understand the mechanisms of games
where many objects are interacting in various ways. Determining whether novel superoptimal policies exist, relative to the default, might be difficult through simply examining the
code.
A possible tool for changing the default state/action representations exposed by games
might be Aspect Orientated Programming (AOP). AOP provides a mechanism for
modularization of cross-cutting concerns in programs (Filman et al., 2001). In AOP new code
can be injected into existing classes in a systematic way without change to the code of those
classes. This may have applications for applying RL to computer games where action and
state sensor functions are not located contiguously but spread throughout classes. AOP could
possibly be used to combine actions and state signals in novel ways in a non-invasive
manner. This might open up the application of RL techniques to more computer games or
other software.
Tabular Q-Learning may require a lot of iterations to converge to an optimal policy.
Its performance suffers when the state or action space increases. A big advantage it has
however is that it can discover reasonable policies quickly. Also it can learn without a model
of its environment and while it is playing. This makes it a useful algorithm for real-time
games that is relatively easy to implement.
46
Application of RL to Simple Soccer
References
Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function
approximation. Proceedings of the Twelfth International Conference on Machine
Learning, pp. 30-37. Morgan Kaufmann
Balch T. (1997). Integrating RL and behavior-based control for soccer. Proc. of the IJCAI
Workshop on RoboCup, Nagoya, Japan, 1997
Barto A. G. & Mahadevan S. (2003). Recent advances in hierarchical reinforcement
learning. Discrete Event Systems, Special issue on reinforcement learning, Vol. 13:,
pp. 41–77, 2003
Bellman, R.E. (1961). Adaptive control processes. Princeton University Press, Princeton, NJ
Bertsekas, D. P. (1995). Dynamic programming and optimal control. Athena Scientic,
Belmont, Massachusetts. Vols 1 and 2.
Buckland M. (2002). AI techniques for game programming. Premier Press. 2002
Buckland M. (2005). Programming game AI by example. Wordware Publishing. Texas, 2005
Crowley M., & Southley T. (n.d.) Unsupervised learning of realistic behaviour in soccer.
Retrieved January 10, 2007, from
<http://www.cs.ubc.ca/~crowley/academia/papers/cpsc540_proj_crowley.pdf>
De Comite F. (2005). PIQLE : A platform for implementation of Q-learning experiments,
Neural Information Processing Systems 2005, Workshop on Reinforcement Learning
benchmarks and Bake-off II
De Comite F. (n.d.) A platform for implementation of Q-Learning experiments. Retrieved
January 15, 2007 from <http://www2.lifl.fr/~decomite/piqle/piqle.html>
47
Application of RL to Simple Soccer
Dietterich T. G. (2000). Hierarchical reinforcement learning with the MAXQ value function
decomposition. International Journal of Artificial Intelligence Research, Vol. 13, pp.
227–303
Driessens K, Ramon J, & Blockeel H (2001).Speeding up relational reinforcement learning
through the use of an incremental first order decision tree learner. In Raedt L. D., &
Flach P. Editors, Proceedings of the 13th European Conference on Machine Learning,
Vol 2167 of Lecture Notes in Artificial Intelligence. pp. 97–108. Springer-Verlag,
2001
Filman E., Elrad T., Clarke S. & Aksit M. (2004). Aspect-oriented software development.
Addison-Wesley Professional. October 06, 2004
Geramifard A., Bowling M., & Sutton R. S. (2006). Incremental Least-Squares Temporal
Difference Learning. In Proceedings of the Twenty-First National Conference on
Artificial Intelligence (AAAI-06), pp 356-361
Gordon G. J. (2001). Reinforcement learning with function approximation converges to a
region, Advances in Neural Information Processing Systems, Vol. 13
Gustafson S. M., & Hsu W. H. (2001). Layered learning in genetic programming for a
cooperative robot soccer problem. In J. F. Miller et al, editors, Proceedings of
EuroGP'2001, v. 2038 of LNCS, pages 291--301, Lake Como, Italy, 18-20 April
2001. Springer-Verlag
Hackman L. (2006) Standard software protocol for benchmarking and interconnecting
reinforcement learning agents and environments. University of Alberta. Retrieved
August 04, 2007 from <http://rlai.cs.ualberta.ca/RLBB/top.html>
Hsu W. H., Harmon S.J., Rodriguez E., & Zhong C. (2004). Empirical comparison of
incremental reuse strategies in genetic programming for keep-away soccer. Breaking
Papers at the 2004 Genetic and Evolutionary Computation Conference, 26 July 2004
48
Application of RL to Simple Soccer
Kaebling L. P., Littman, M. L. & Moore A. W. (1996). Reinforcement learning: a survey.
Journal of Artificial Intelligence Research, Vol. 4, pp. 237–285
Kirk D. E. (2004). Optimal control theory: an introduction, Prentice-Hall, 1970 reprinted
2004
Kitano H., Asada M., Kuniyoshi Y., Itsuki N., & Osawa E. (1995). RoboCup: the robot world
cup initiative. in Proc. Of IJCAI-95 Workshop on Entertainment and AI/Alife,
Montreal, 1995
Kitano H., Tambe M., Stone P., Veloso M., Coradeschi S., & Osawa E. (1997), The RoboCup
synthetic agent challenge 97. in Proceedings of the Fifteenth International Joint
Conference on Artificial Intelligence, pages 2429, San Francisco, CA, 1997. Morgan
Kaufmann
Kostiadis K., & Hu H. (2001). KaBaGe-RL: Kanerva-based generalisation and
reinforcement learning for possession football. Proceedings IEEE/RSJ International
Conference on Intelligent Robots & Systems (IROS 2001), Hawaii, October 2001
Kuhlmann G., & Stone P. (2004). Progress in learning 3 vs. 2 keepaway. In Polani D.,
Browning B., Bonarini A., & Yoshida K., Editors, RoboCup-2003: Robot Soccer
World Cup VII, Springer Verlag, Berlin, 2004
Lagoudakis M. G., & Parr R. (2003) Least-Squares Policy Iteration. Journal of Machine
Learning Research Vol. 4, pp. 1107 - 1149
Madden M. G., & Howley Tom (2004). Transfer of experience between reinforcement
learning environments with progressive difficulty. Artificial Intelligence Review,
Springer Netherlands, Vol 21, June 2004
Nau S. D. (1980). Pathology on game trees: a summary of results. Proceedings AAAI-80,
Stanford, CA, 102-104, 1980
49
Application of RL to Simple Soccer
Nedic A., & D. P. Bertsekas (2003). Least squares policy evaluation algorithms with linear
function approximation, Discrete Event Dynamic Systems. Vol. 13, pp. 79–110
Neumann G. (2005). The reinforcement learning toolbox: Reinforcement learning for optimal
control tasks, MSc Thesis, Graz University of Technology, Graz, 2005. Retrieved
August 29, 2007 from <http://www.igi.tugraz.at/ril-toolbox/thesis/DiplomArbeit.pdf>
Parr R. (1998). Hierarchical control and learning for markov decision processes.
PhD Thesis, University of California, Berkeley, 1998
Poole D. (n.d.) Q-Learning Applet. Retrieved May 05, 2007 from
<http://www.cs.ubc.ca/spider/poole/demos/rl/q.html>
Precup, D., Sutton, R., & Dasgupta S. (2001). Off-policy temporal-difference learning with
function approximation. Proceedings of the 18th International Conference on Machine
Learning, (ICML 2001), Williams College, Williamstown, MA, USA, June 28 - July
1, 2001. Morgan Kaufmann 2001
RoboCup.org (2007). The RoboCup federation. Retrieved February 09, 2007, from
<http://www.robocup.org/>
Russell S., & Norvig P. (2003). Artificial intelligence: a modern approach. Pearson
Education, Upper Saddle River, New Jersey, second edition, 2003
Shivaram K., Yaxin L., & Stone P. (2007) Half field offense in roboCup soccer: A multiagent
reinforcement learning case study. In Lakemeyer G., Sklar E., Sorenti D.,
&Takahashi T., Editors, RoboCup-2006: Robot Soccer World Cup X, Springer
Verlag, Berlin, 2007
Stone P. (2000). Layered learning in multiagent systems: A winning approach to robotic
soccer. MIT Press, 2000
Stone P. (1998). Learning in multi-agent systems. PhD thesis, Computer Science Department,
Carnegie Mellon University
50
Application of RL to Simple Soccer
Stone P. (n.d.). Learning to play keepaway. Retrieved February 07, 2007, from
<http://www.cs.utexas.edu/users/AustinVilla/sim/keepaway/>
Stone P., & Sutton R. S. (2001). Scaling reinforcement learning toward RoboCup soccer. In
Proceedings of the Eighteenth International Conference on Machine Learning, pp
537-544. Morgan Kaufmann, San Francisco, CA, 2001
Stone P., Sutton R. S., & Singh S. (2001). Reinforcement learning for 3 vs. 2 keepaway.
RoboCup2000: robot soccer world cup IV, Stone P., Balch T., and Kreatzschmarr G.,
editors, Springer Verlag, Berlin, 2001
Stone P., & Veloso (1999). Team partitioned opaque-transition reinforcement learning.
RoboCup-98: Robot Soccer World Cup II, Asada and H. Kitano, editors, Springer
Verlag, Berlin, 1999
Sutton R. S. (n.d.). Reinforcement Learning FAQ: Frequently Asked Questions about
Reinforcement Learning. Retrieved August 13, 2007 from
<http://www.cs.ualberta.ca/~sutton/RL-FAQ.html#backpropagation>
Sutton R. S, & Barto A. G. (1998). Reinforcement learning: An introduction. A Bradford
Book, The MIT Press, Cambridge, Massachusetts, London, England, 1998
Sutton R. S., Precup D., and Singh S. (1999) Between mdps and semi-mdps: A framework for
temporal abstraction in reinforcement learning. Artificial Intelligence, 112:181{211,
1999
Tadepalli P., Givan R., & Driessens K., (2004) Relational reinforcement learning: An
overview, In Tadepalli, P & et al. Editors, Proceedings of the ICML'04 Workshop on
Relational Reinforcement Learning, pp. 1-9, 2004
Taylor M. E., Whiteson S. & Stone P. (2006). Comparing evolutionary and temporal
difference methods in a reinforcement learning domain. ACM Press. Proceedings of
51
Application of RL to Simple Soccer
the 8th annual conference on Genetic and evolutionary computation GECCO '06, July
2006
Tesauro G.J (1995). Temporal difference learning and td-gammon, Communications of the
ACM 38, 1995
Tsitsiklis J. (n.d.). Neuro-dynamic programming. Retrieved February 01, 2007, from
<http://web.mit.edu/jnt/www/ndp.html>
Valluri S., & Babu S. K. (2002). Reinforcement learning for keepaway soccer problem.
Department of Computing and Information Sciences, Kansas State University. 2002
Watkins, C.J.C.H. (1989). Learning from delayed rewards. PhD Thesis, University of
Cambridge, England.
Young T., & Polani D. (2006). Learning RoboCup-keepaway with kernels. In Proceedings of
the Gaussian Processes in Practice Workshop. June 12, 2006
Yu H., & D. P. Bertsekas (2006). Convergence results for some temporal difference methods
based on least squares. MIT, LIDS Technical Report 2697, 2006
52
Application of RL to Simple Soccer
Appendix A. Simple Soccer Class Diagrams
SoccerTeam class from Buckland (2005)
53
Application of RL to Simple Soccer
PlayerBase Class from Buckland (2005)
54
Application of RL to Simple Soccer
StateMachine Classes from Buckland (2005)
55
Application of RL to Simple Soccer
Appendix B. Functions of the Learner Specific to Simple Soccer
Functions added to Simple Soccer Code to implement RL
/*
Method:
Description:
OnMessage
The point at which the agent recives the impetus to act.
Invokes the Q-Learner if a choice is required
*/
bool RLGlobalPlayerState::OnMessage(FieldPlayer* player, const Telegram& telegram)
{
if(telegram.Msg == Msg_GoHome) //one of the environmental rules
{
player->SetDefaultHomeRegion();
player->GetFSM()->ChangeState(ReturnToHomeRegion::Instance());
return true;
}
//invoke the learner
int state = getState(player, telegram);
int action = updateQPickAction(player, state);
act(action, player, telegram);
return true;
}
/*
Method:
Description:
updateQPickAction
Updates the Q-value for the previous step and returns action
for the current step
*/
int RLGlobalPlayerState::updateQPickAction( FieldPlayer *player, int state ){
int currState = state;
int i = episodeSteps.size()-1;
double r = getReward(player);
updateQ(previousState, currState, previousAction, r);
int action = pickAction(currState);
previousState = currState;
previousAction = action;
episodeRewards[i] += r;
episodeSteps[i]++;
return action;
}
/*
Method:
Description:
getState
Queries the state signals and encodes them
as a decimal number
*/
int RLGlobalPlayerState::getState(FieldPlayer *player, const Telegram& telegram){
vector<bool> v;
v.push_back(
player->GetFSM()->isInState(
*SupportAttacker::Instance()
56
)
Application of RL to Simple Soccer
);
v.push_back(
player->Team()->Receiver() == NULL
);
v.push_back(
player->BallWithinKickingRange()
);
v.push_back(
telegram.Msg == Msg_ReceiveBall
);
v.push_back(
telegram.Msg == Msg_SupportAttacker
);
v.push_back(
telegram.Msg == Msg_Wait
);
v.push_back(
telegram.Msg == Msg_PassToMe
);
return toDec(v);
}
/*
Method:
Description:
act
Takes an decimal number representing an encoded action
and carries out corresponding action components
*/
void RLGlobalPlayerState::act(int action, FieldPlayer *player, const Telegram& telegram){
//Note: the repitition of if(i >= len) indicates that re-factoring of this code
//to a more elegant design would be possible
vector<bool> v = toVect(action);
int i = 0;
int len = (int) v.size();
if(i >= len) return;
if (v[i++]) player->GetFSM()->ChangeState(ReceiveBall::Instance());
if(i >= len) return;
if (v[i++]) player->Steering()->SetTarget(player->Team()->GetSupportSpot());
if(i >= len) return;
if (v[i++]) player->GetFSM()->ChangeState(SupportAttacker::Instance());
if(i >= len) return;
if (v[i++]) player->GetFSM()->ChangeState(Wait::Instance());
if(i >= len) return;
if (v[i++]) player->FindSupport();
if(i >= len) return;
if (v[i++]) {
if (telegram.Msg == Msg_PassToMe){
FieldPlayer* receiver = static_cast<FieldPlayer*>(telegram.ExtraInfo);
player->Ball()->Kick(receiver->Pos() - player->Ball()->Pos(),
Prm.MaxPassingForce);
Dispatcher->DispatchMsg(SEND_MSG_IMMEDIATELY,
player->ID(),
receiver->ID(),
Msg_ReceiveBall,
&receiver->Pos());
}
}
if(i >= len) return;
if (v[i++])if (telegram.Msg == Msg_ReceiveBall) player->Steering()>SetTarget(*(static_cast<Vector2D*>(telegram.ExtraInfo)));
}
/*
57
Application of RL to Simple Soccer
Method:
Description:
toDec
Convert a a boolean vector corresponding to a decimal number
(binary to decimal conversion)
*/
int RLGlobalPlayerState::toDec(vector<bool> v){
int total = 0;
int power = 1;
for(int i = (int)(v.size())-1; i >= 0; i--){
int val = v[i] ? 1 : 0;
total += val * power;
power *= 2;//could be generalized to ranges of values > 2
}
return total;
}
/*
Method:
Description:
toVect
Convert a number (integer) to a boolean vector corresponding to
That number in binary.
*/
vector<bool> RLGlobalPlayerState::toVect(int num){
vector<bool> v;
int i = 0;
while(num > 0){
bool val = ( (num % 2) == 0 ) ? false : true;
v.push_back ( val );
num = (int) num/2;
}
reverse(v.begin(), v.end());
return v;
}
/*
Method:
Description:
getReward
Calculates the reward: 0 by default, -1 for conceeding a goal
and 1 for scoring.
*/
int RLGlobalPlayerState::getReward( FieldPlayer *player ){
int reward = 0;
int newOpponentScore = player->Team()->HomeGoal()->NumGoalsScored();
int newOurScore = player->Team()->OpponentsGoal()->NumGoalsScored();
//This assumes that this function will be called in every timestep
//and that more than one goal can not be scored concurrently by either
//team
if(newOurScore > ourScore){
reward = 1;
log();//log results
incrementEpisode();
}else if(newOpponentScore > opponentScore){
reward = -1;
log();//log results
}
ourScore = newOurScore;
opponentScore = newOpponentScore ;
58
Application of RL to Simple Soccer
return reward;
};
/*
Method:
Description:
updateQ
Does the main work of the Q-learning algorithm
Takes two successive states, the action that
caused their transition, and the reward gained.
Uses these to update the qTable.
*/
void RLGlobalPlayerState::updateQ(int state1, int state2, int action, double reward){
double currQ, maxNextQ, newQ, alpha, newVal;
currQ = qTable[state1][action];
maxNextQ = maxQ(state2);
newVal = (reward + (DISCOUNT * maxNextQ));
visits[state1][action]++;
alpha = 1.0/visits[state1][action];
newQ = ((1-alpha) * currQ) + (alpha * newVal);
qTable[state1][action] = newQ;
}
59
Application of RL to Simple Soccer
60
Application of RL to Simple Soccer
Appendix C. Original Project Plan
Thesis Statement
The thesis statement is: The Application of Reinforcement Learning (RL) to
Computer Soccer: A Case Study using the Simple Soccer Game
Reinforcement Learning, as defined by Kaelbling et. al (1996), is “the problem faced
by an agent that must learn behaviour through trial-and-error interactions with a dynamic
environment”. This proposal aims to look at a game, Simple Soccer (Buckland, 2005), in
order to study the applications of RL to game agents.
The thesis will attempt to answer the following questions:
•
Can game players learn to play soccer through trial and error (using RL)?
•
To which aspects of soccer game-play can RL be successfully applied?
•
How do RL learners compare to the ‘default’ players in this case?
Need for the Project
RL has proved successful in solving problems in games, most famously backgammon
(Tesauro, 1995), and has been applied to many game types. RL researchers have
recommended the identification of games to which RL can be applied. This would help build
a test suite to examine how RL techniques perform in different contexts.
Soccer is a popular area of AI and robotics research. Researchers compete annually in
a robotics soccer tournament and also in computer simulated soccer leagues. The RoboCup
Soccer Simulator describes itself as a ‘research and educational tool for multi-agent systems
and artificial intelligence’ (Robocup, 2005). However due to its complexity there is a need for
an exploration of more lightweight frameworks for experimenting with AI in soccer. This
approach is appropriate here given the scope of this project.
Research methodology
One of the primary research methods to be used in the project is experimentation. As
the project is computational in nature it will lend itself to this method. This method is one in
which a researcher manipulates a variable (anything that can vary) under highly controlled
conditions to see if this produces (causes) any changes in a second variable. Experiments in
61
Application of RL to Simple Soccer
Computer Science may also involve construction and manipulation of the world in which the
experiment’s subject operates.
There are toolsets available for carrying out RL experiments. These contain aids for
setting up experiments for the RL researcher such as implementations of RL algorithms. For
the purposes of this project an experimentation toolset will not be used. This follows from the
aim that the project involves the construction of a light-weight RL implementation. The
implementation of the Learner in C++ will form a significant part of the work involved.
Experiments may be carried out to determine:
•
State representation and appropriate RL algorithm
•
Useful actions
•
Different Parameters to the algorithm such as learning rate and discount factor
(in the case of a Q-Learning algorithm)
•
Different reward structures
Before experiments are carried out a ‘gold standard’ of behaviour must be identified and
criteria decided on that will measure behavioural success e.g. goals scored, possession of ball.
Project Completion
Research closure will come about upon successful construction and completion of the
appropriate experiment or experiments. A successful experiment will be one which has a
measurable outcome and as such will have a defined start and end.
There will also be subjective criteria as to whether anything useful has been learned or
contributed to the field by the results of the experiments. The results will need to be
interpreted in the context of the current literature and with the help of the project supervisor.
The questions answered might include:
•
Has anything been learned from this that could not have been more easily
determined by studying the existing literature?
•
Has anything been learned that differs from the findings of similar work?
•
What are likely directions for useful future research given these results?
62
Application of RL to Simple Soccer
A qualitative evaluation of the process of implementing the learner will also be made.
Project Plan
The following phases include secondary research at each step as required and under
guidance of my supervisor:
Project Phases
•
Review the codebase and architecture of the Simple Soccer Framework
•
Conduct research and submit literature review.
•
Implement the RL learner
•
Design and construct experiments to determine appropriate RL framework
(experimental cycle 1)
•
Run experiments (cycle 1)
•
Interpret experiment results (cycle 1)
•
Depending in results in step 5 it my be necessary to alter or create new
experiments (back to step 3)
•
Design and construct experiments to determine valid rewards and parameters
(experimental cycle 2)
•
Repeat experiment cycle as per steps 3 – 6 for cycle 2
•
Conduct discussion with supervisor and further literature review if necessary to
contextualize results
•
Reflect on and qualitatively evaluate the implementation issues
•
Write thesis first draft
•
Write final thesis
The final date for delivering the thesis is August 31at however I hope to have my
thesis completed before this date. I intend to schedule a lot of the workload during breaks in
the schedule of my taught and project modules. The following is proposed:
Between December 18th and January 27th carry out project phases 1-4.
By January 27th contact supervisor with the results of my initial experiments and
implementation of the RL learner
63
Application of RL to Simple Soccer
January-February receive feedback and incorporate this into my research for delivery
of term paper due February 19th
On January 18th Karen Young provided us with details of the Term Paper due as part
of our thesis and detailed that this paper should consist of “reasonably comprehensive
literature review, plus a solid draft of your research methodology and problem statement(s)
for your primary research (altogether, 30+ pages and 15+ references, as a guideline)”.
64
Application of RL to Simple Soccer
Gantt Chart
Phase
Deliverables
Complete
Final Research
Question
Nov 30
Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug
Develop Research Question/Plan
Literature Review
th
Literature Review
Term Paper
Feb 19th
Complete
Literature Review
Primary Research
Implement RL Learner
Design Experiments
Initial
Experiments
Complete
Run Experiments
Interpret Results
Feedback
Incorporated
Analyis
(extra
experiments)
Complete
Receive & Incorporate Feedback
Analyze/Interpret Data
Draft Thesis
Create first draft
Advisor review
Finalize Thesis
Complete modifications
3rd party review
Advisor review
Modifications
Final Thesis Preparation
65
Jan 15th
Feb 26th
April 28th
First Draft
Finished
May 30
Final Thesis
July 20
th
th

Download Report

Application of reinforcement learning to the simple soccer game

Paperzz.com

Your Paperzz