Sandbox Strategy Learning in Sports Games

Learning Goals in Sports Games
Jack van Rijswijck1
Department of Computing Science
University of Alberta
Edmonton, Alberta, Canada T6G 2E8
[email protected]
Abstract: The illusion of intelligence is most easily destroyed by predictable or
static behaviour. The ability for game characters to learn and adapt to the game
player's actions will become increasingly important as the Artificial Intelligence
(AI) in games develops. Yet in many games, specifically in all sports games, the
AI must be kept in a "sandbox": It must not potentially evolve into nonsensical
directions. This paper focuses on a strategy learning experiment as part of an AI
architecture under design in collaboration with Electronic Arts for their series of
sports games.
Keywords: Artificial Intelligence, Learning, Strategy.
Introduction
The ability to learn from experience is generally regarded as one of the most important future
developments in game AI [9]. In most game genres, especially the genre of sports games, any
adaptive AI must be prevented from developing nonsensical behaviours. The purpose of this
paper is to describe a strategy learning method as part of an AI architecture under
development for sports games. The method was tested with the game engine of Electronic
Arts’ FIFA 2002 [5]. It uses a behavioural model [1], in which the strategy is one of a number
of existing drives that are all active simultaneously. The drives are implemented as force fields,
similar to ones that have for example been used in The Sims [7] and in the “Robo-Cup” robotic
soccer tournaments [6,8]. This model can be extended to other sports games, and one can
also think of using it in Real Time Strategy games.
Where AI is informally sometimes as "anything that is not graphics", this paper will adopt the
definition that AI refers to those and only those decisions that the human gamer makes as
well. This in effect regards the AI as just another gamer. It has the advantage of providing a
clean interface between the AI and the rest of the code, and necessarily avoids cheating on
the part of the AI.
In the case of sports games, the characters within the game are sports players. To avoid
confusion, in this paper the human game player will be referred to as the gamer. The players
on the soccer field are the game characters, including the ones controlled by the gamers as
1
The author gratefully acknowledges the support of Electronic Arts Canada, the Alberta Research Council, iCORE, IRIS, and
NSERC.
well as the ones controlled by the AI. Whenever the word player is used, it refers to a
character, not a gamer.
Learning
In this paper, learning refers to acquiring truly new behaviour, as opposed to just modifying the
parameters of already existing behaviours. Learning is an attractive prospect, but there are
serious concerns that need to be addressed. However, it is possible to create a successful
game that features real learning, as for example Black & White [3] demonstrates.
Gamers often prefer online gaming against other humans over playing against AI opponents.
However, sports games involve entire teams of characters, out of which the gamer only
controls one character at any given time. Good AI is then still important, since it must be able
not only to work against the opponent, but also to work with the human. In addition, contrary to
many other games, the characters in a sports game tend to not die during the game. This
gives them ample time to display their level of intelligence.
One of the concerns of releasing a game that learns after it ships is that the AI is much more
difficult to test. It is worth noting, though, that the developers Black & White feel that the
testing problem is actually quite manageable [2]. Another concern is that the AI must not learn
the wrong lessons from gamers who are, possibly deliberately, being incompetent. Finally, the
learning method must of course be able to run in realtime within the hardware constraints of
game platforms.
Commercial games are different from Robo-Cup soccer and most other games that academics
have traditionally studied. The goal in a commercial game is not to win, but to entertain. When
learning from a loss, an adaptive AI should not ensure that it never loses again — just that it
never again loses in the same way.
Sports games
Sports games AI faces one specific challenge that is absent from many other games genres.
The game is to simulate events that actually happen in the real world on, and most gamers will
be quite familiar with what those events look like. Every feature of sports games carefully tries
to re-create the real world example as faithfully as possible; the AI should be no exception.
This goes not only for the behaviour of the individual players, but also for the team as a whole.
One unique feature of sports games is that ideally the characters and teams should behave
like their specific real-world counterparts. In a basketball game, the Shaquille O'Neill character
should never go for a lay-up instead of a dunk, and Dennis Rodman should never take a 3point shot. In a soccer game, playing against Brazil should feel different from playing against
Italy. This makes the strategy sandbox even smaller; not only should the adapting Brazilian
strategy stay sensible, it should stay Brazilian. Behavioural models make this possible.
Behavoural models
The Finite State Machine (FSM) is a commonly used paradigm for game AI, appearing in
classical and fuzzy varieties. One feature of the FSM is that a character is by definition always
in just one particular state. Suppose the creature in Figure 1 can be in states like "afraid: flee
from enemy" and "hungry: search for food". If the creature chooses to flee, it will also move
away from all the food. If it chooses to find food, the hunger state might tell it to head for the
nearest food source, but that will take it dangerously close to the enemy. While being hungry,
it forgot all about its other goals in life.
Figure 1: Various influences
In a behavioural model [1], such drives are all active simultaneously, and they have varying
levels of influence. In the model known as "schema-based coordination", each behaviour
generates a force field, pushing the creature in a certain direction. The force driving the
creature is the sum of all these influences. In Figure 1 there would be attractive forces from
both food items and a repelling force from the enemy. The result is that the creature can
satisfy both drives. Models like these are used for example in the Sims, with their
attractiveness landscape, and in Robo-Cup soccer [8].
If a game AI engine attempts to calculate “whether or not” predictions, such as whether or not
a character can reach a particular goal or whether or not it should execute a certain task, the
resulting behaviour can be rigid and predictable. The problem is analogous: Since the decision
is always this or that, the character is confined to a patchwork of regions of identical
behaviour. Behavioural approaches can produce more dynamic and fluid behaviour.
In sports games like soccer, hockey, and basketball, the drives that govern an agent can be
things like "intercept ball", "stay onside", "attack goal", etc. The resulting behaviour satisfies all
these goals as much as possible. Each drive has a certain direction and strength. The strength
can depend on how important the drive is at a given instant. For example, "intercept ball" is not
important when a teammate has the ball.
An additional advantage of behavioural models is maintainability. When a new state is added
to a FSM, the programmer needs to figure out all the new state transitions, as well as
remembering what other already existing goals need to be satisfied at the same time in the
new state. By contrast, adding a new behaviour in a behavioural model does not invalidate or
duplicate the already existing behaviours, since they are all active simultaneously. Each
behaviour just needs to know how important it is at any given time. These urgency levels
present an important opportunity: They can be subject to learning.
A learning example
When something undesirable happens, such as a goal scored by the opponent or maybe even
just a shot on goal, it may be possible to go back and adjust the urgency levels of the various
behaviours in order to stop the same thing from happening next time. One could even add a
new behaviour for each mistake, taking care of avoiding the same mistake in the future. Each
learning sample contains information about the situation in which it occurred, and information
about what the agents should have done about it. The latter piece of information is a drive
pushing the agents in a direction that hopefully avoids the mistake. The urgency of this drive
depends on how similar the current situation is to the learning sample.
One can think of ways to figure out how similar two situations are, and ways to do this quickly
such that the whole procedure is computationally feasible under real-time constraints. But how
to figure out the drives -- what should the agents have done to repeat a certain mistake?
Figure 2 shows a learning example in FIFA2002. It encodes a situation that occurred in the
past, where the blue team ended up scoring. The trace shows the trajectory of the ball during
that play. When the trace is solid blue, one of the players of team Blue, indicated by the jersey
number, has the ball. When the trace is dotted, the ball is underway from one player to
another. White dots indicate that the ball is close enough to the ground to be intercepted by a
player; if the ball is too high in the air, the dots are red. The sequence starts with a goal kick by
the blue goalkeeper, and ends with player 6 scoring a goal.
Figure 2: A learning example
If the AI has not learned anything from that sequence, then the same events can happen
again. Figures 3 and 4 show two snapshots of the play sequence as it unfolded. In both cases,
one of the blue players is about to send off a pass.
Figure 3: Blue 3 passes to Blue 11
Figure 4: Blue 11 crosses to Blue 6
Figure 3 shows player 3 about to pass the ball to the location indicated by the dotted circle.
Player 11 will run to that location to receive the pass. He can do this because he is closer to
the dotted circle than any of the opponent’s players. Figure 4 shows player 11 about to cross
the ball in front of the goal. The cross is targeted at player 6, who connects and scores the
goal. Again, the player was able to do this because he was closer to the reception point than
any of the opponents.
What might the other team have learned from this? The mistake was mostly due to defender
5, who allowed attacker 6 to slip past him. In the first snapshot, attacker 6 was still far behind
him. Defender 5 could have prevented this by moving closer to the key spot in Figure 4. At an
earlier point in the play, defender 2 or 6 could have interfered with attacker 11’s activities by
moving closer to the key spot in Figure 3.
The defending team does not need to prescribe who goes to those spots, just that someone
does. By the same token, it does not matter which one of the attacking players has the ball,
just that he has the ball in a location that resembles the one in the learning example. It is the
proximity of the ball to one of the key points that matters. Thus the learning example does not
give strategic hints to specific players, but rather to specific areas of the field. The same
learning example can become active in situations that are not identical but similar, when the
ball is near the trajectory of the learning example. It can also be used by the attacking team, in
order to try and repeat a successful play.
Force fields
The various drives of the players act as force fields. One of the drives is the one that
represents the adaptive strategy. Pseudocode for this force field is given in the appendix. The
behaviour learned from the example in the previous section is a force field that is anchored to
the playing field, not to any player in particular. This follows the adage "the intelligence is in
the environment, not in the ant", as in the Sims, where the instructions on how to use an
object are contained in the object, not in the character that uses it.
A force field specifies its influence on the characters, as a function of their location on the
soccer field. In addition, it also needs to specify the context in which it applies. A particular
learning example is relevant only if the current situation is similar to the one that happened in
the learning example. Thus the force depends on two parameters: location, and context.
When learning is triggered, the algorithm first needs to choose when the relevant play
sequence started. In this case the start of the sequence is defined as the latest time when the
scoring team gained possession of the ball, or the latest re-start of play, whichever happened
later. Next, the force field is calculated by making all the points on the ball trajectory attractors.
This excludes the points where the ball was high in the air, indicated by red dots, since the ball
could not be intercepted there. The attractor forces can be calculated as a gravity field,
diminishing with the square of the distance. This ensures that not all players head towards the
same spots, and also deals with the problem of a single player caught between having to
defend two spots: Instead of staying in the middle and defending neither spot, other forces will
make the player drift towards one of the two spots and then the stronger attraction will force a
commitment to that one.
For the learning example of Figure 2, the resulting force field is shown in Figure 5. The field is
indicated only on the points of a grid that covers the field. Since the field is well-behaved, it is
sufficient to store the field values only on those grid points and use interpolation elsewhere.
The grid resolution can be chosen to meet any storage capacity limits. Figure 5 shows the ball
trajectory in white, and the salient points on the trajectory are indicated as white dots. Salient
points are those where the ball changes possession; the ball can be possessed by a player or
by “ground” or “air”. Thus the points where the ball changes from close to the ground to high in
the air are also salient points. In this example, the attractive forces are calculated not for all
the points on the ball trajectory, but just for the salient points. This may be a sufficiently
effective summary of the trajectory. Using the full trajectory requires more computation, but it
only needs to be done once.
Figure 5: Force field resulting from the learning example
The second parameter of the force field, its context, specifies how much influence it carries in
any given situation. In fact, it is relevant to the Yellow team if Yellow does not have possession
of the ball, and the ball is near the trajectory of the learning example. Thus the strength of the
force field depends on the position of the ball. Figure 6 shows the force field resulting from two
learning examples, for a situation where the ball is closer to the original trajectory than to the
new one. In this situation, the former force is stronger than the latter.
Figure 6: Two learning examples
As Figure 6 suggests, it fortunately is not necessary to maintain one force field for each
learning example, since force fields are additive. For each position of the ball, the strengths of
all the force fields can be calculated and the fields can be added together. This results in one
net force field for each position of the ball. The collection of positions of the ball can, in turn,
also be sampled in a grid and interpolated. At runtime, the position of the ball indexes into one
of the force fields. In order to find the resulting force during game play, all that is needed are
four direct look-ups and an interpolation.
Temporal discounting can be introduced by making the earlier points more attractive than the
later points, to encourage the players to disrupt the play as soon as possible. This increases
the computational cost, but again that does not matter since it is an offline computation that
only needs to be done once. Figure 7 shows such a field.
Figure 7: Temporally discounted force field
Later in the same sequence it is no longer necessary to control the points that have already
been passed. Only the remaining points on the trajectory are interesting. This can be
addressed by adding the partial trajectory from each salient point to the end point as a
learning example, which is equivalent to temporal discounting where the later points in the
sequence are more attractive. This discounting factor is a parameter that can be played with.
Before and after
After the forces resulting from the learning example have been calculated, the same situation
can be re-started to see if the Yellow team have learned anything. Figure 8 shows the result.
The play starts approximately the same. It does start deviating a bit when the ball gets to
player Blue-3, but the pass to the outside left wing is still fired off. However, defender Yellow-6
has now altered his behaviour sufficiently to get there in time and even manage to intercept
the ball. Figures 9 and 10 show a closer look at Yellow-6’s trajectory. In the first case Yellow-6
wastes too much time before heading over to Blue’s pass reception. The second case starts
the same, but then sees Yellow-6 turning around sooner and getting back in time.
Figure 8: After learning, Yellow intercepts the ball
Figure 9: Yellow-6's path before...
Figure 10: ... and after learning
Discussion
The main focus of this learning experiment is to augment, not replace, any existing strategy.
This may be compared to the subsumption architecture as used in robotics [4], and
corresponds to the behavioural approach of allowing all behaviours to be active, instead of
only one. An additional benefit of this is that the learning component is easy to add to an
existing commercial games program.
The learning model is deliberately kept simple in order to be cheaply calculated at runtime. It
involves a one-time calculation which can be done offline, for instance during goal celebration
animations. At runtime, a table lookup and an interpolation suffice. The memory requirements
can be adjusted as needed, by modifying the solution of the grid on which the force fields are
sampled.
No attempt is made to discover theoretically optimal solutions, nor to model or predict game
situations as they occur. The goal is to let game characters behave human-like, as well as
make sure that they do not become unbeatable; for both purposes, optimality is undesirable.
Less important than making sure that the AI does not lose is making sure that it does not lose
in the same way repeatedly.
Another requirement that is kept in mind is that the model should be able to learn from very
few examples, since the examples are to be provided by a human gamer playing the game in
realtime. The examples shown in this paper all involve changes that occurred as the result of a
single training example. In general, very few examples are needed to adjust the playing style
sufficiently to disrupt previous mistakes, while not disturbing the overall strategy.
There are several parameters and options to play with in the force field model. The decay rate
in space, time, and relevance can all be adjusted. The forces can be modeled as gravity fields,
with the strength of the force proportional to 1/ r2 where r is the distance to the attractor point,
but other fields are also possible. For instance, the field could be proportional to 1/r 2 when r>R
and r/R3 when r<R, corresponding to the gravity field of an object of radius R. The intuition
behind this type of field is to diminish the attractive force when the location is already under
control of the player; near the centre of attraction, the force becomes zero.
The experiments in this paper were performed using Electronic Arts’ FIFA 2002 software, but
they could also be applied to other sports games, such as hockey and basketball, as well as
other games genres, such as Real Time Strategy games.
Acknowledgments
I am indebted to Electronic Arts Canada, and in particular to John Buchanan, Jason Rupert,
and Matt Brown, for their cooperation, and to Jonathan Schaeffer for providing feedback on
early drafts of this paper.
References
1. Arkin, Ronald C. Behavior-Based Robotics. MIT Press, 1998.
2. Barnes, Jonty, and Jason Hutchens. Testing Undefined Behavior as a Result of Learning.
In Steve Rabin, editor, AI Game Programming Wisdom, pages 615-623. Charles River
Media, 2002.
3. Black & White. Lionhead Studios / Electronic Arts, 2001. See www.bwgame.com.
4. Brooks, Rodney A. Challenges for Complete Creature Architectures. From Animals to
Animats: Proceedings of the First International Conference on Simulation of Adaptive
Behavior, MIT Press, 1990.
5. FIFA 2002. Electronic Arts, 2001. See www.fifa2002.ea.com.
6. Robo-Cup soccer tournament. See www.robocup.org.
7. The Sims. Maxis / Electronic Arts, 2000. See www.thesims.com.
8. Stone, Peter, and David McAllester. An Architecture for Action Selection in Robotic Soccer.
In Jörg P. Müller, Elisabeth Andre, Sandip Sen, and Claude Frasson, editors, Proceedings
of the Fifth International Conference on Autonomous Agents, pages 316-323, Montreal,
Canada, 2001. ACM Press.
9. Woodcock, Steven. Game AI: The State of the Industry 2001-2002. Game Developer
Magazine, July 2002, pages 26-31.
Pseudocode
Below follows high-level pseudocode for calculating the force field associated with a new
training example, updating the existing force fields, and determining the forces at runtime. Let
FieldGrid and BallGrid be discrete sets of points both covering the soccer field at some
arbitrary resolution.
When a new training example arrives, containing a ball trajectory, a force field NewField[p] is
calculated where p is the position on the field.
foreach p in FieldGrid {
foreach t in trajectory {
NewField[p] += (t-p) / |t-p|2;
}
}
Note that p and t are vectors, and that |t-p| denotes the length of the vector.
The existing forces are encoded in MainField[b,p], which encodes the force at field location p
when the ball is in position b. The parameter b specifies the context. The new field is added to
the main field with its influence depending on the distance of b to the trajectory.
foreach b in BallGrid {
/* determine d = distance of ball to trajectory */
d = infinity;
foreach t in trajectory {
d = min(d, |b-t|2);
}
/* add NewField to MainField[b] with strength 1/(1+d) */
foreach p in FieldGrid {
MainField[b,p] += NewField[p]/(1+d);
}
}
At runtime, the force at field position p when the ball is in position b can be looked up directly
by snapping p to the FieldGrid and b to the BallGrid, and looking up MainField[b,p] directly. If
the two grids have low resolution, the field can be looked up in the surrounding points and
then interpolated.