Artificial Intelligence for Pac-Man
Callum R. Breen - 1122265
BSc Computer Science - Cardiff University
Supervisor: Yukun Lai
Moderator: Frank C Langbein
May 6, 2014
Abstract
This report will demonstrate the effectiveness of applying the artificial intelligence technique Reinforcement Learning to a variation of
the popular arcade game of Pac-Man. The solution uses a combination of Q-Learning and function approximation. The approach was
proven to be successful as the agent learn to interact successfully in a
small amount of training.
1
Contents
1 Introduction
1.1 Overview . . . . . . . . .
1.2 The Game . . . . . . . .
1.3 Reinforcement Learning
1.4 Function Approximation
1.5 Framework . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 Approach
2.1 Requirements . . . . .
2.2 Q-Learning Algorithm
2.3 State Generalisation .
2.4 Reward Function . . .
2.5 Features . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
. 7
. 7
. 8
. 9
. 10
.
.
.
.
.
3
3
3
4
5
6
3 Implementation
3.1 Episode Logic . . . . . . . .
3.2 Calculate State Worth . . .
3.3 Update Weights . . . . . . .
3.4 Calculate Transition Reward
3.5 Choose Move . . . . . . . .
3.6 Features . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
12
12
13
13
14
14
4 Results & Evaluation
4.1 Experiments . . . . . . . .
4.1.1 Experiment One .
4.1.2 Experiment Two .
4.1.3 Experiment Three
4.2 Evaluation . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
16
16
16
18
20
20
.
.
.
.
.
5 Future Work
22
6 Conclusions
23
7 Reflection on Learning
24
2
1
1.1
Introduction
Overview
The aim of this project is to develop an artificially intelligent autonomous
agent which will be able to successfully learn to play the game of Ms. Pacman. The implemented agent should be able to improve its score by seeking
out pills, power pills, edible ghosts, whilst trying to avoid non-edible ghosts.
In order to develop this artificially intelligent agent, the ”Reinforcement
Learning” technique (Particularly the Q-Learning Algorithm [4] ) was used
as to allow the agent to learn how to respond to certain situations within the
environment – this was also combined with ”Function Approximation” as it
was necessary to generalise states – without it, the amount of time taken for
the agent to learn how to respond in each situation would be too great.
There have been many papers [8][9] which have used the game of Pac-Man
(or variation of it) to demonstrate the effectiveness of artificial intelligence
techniques – the reason for this is that even though the game has very simple
rules, it still requires a good solution for an agent to get a decent score – this is
due to the environment being stochastic (Ghosts in Ms. Pac-Man have semi
random movements), and it having enormous state space (lots of variables
for a single state).
Originally, the initial plan was to develop a rule-based approach which
used an evolutionary algorithm [7] to adjust the rules parameters. However,
after a bit more research I found that the Reinforcement Learning approach
would be a more rewarding experience as I had zero knowledge on the subject,
and I already knew quite a lot about evolutionary algorithms from a previous
module.
This report will discuss in more detail about the approach used to solve
the learning problem. It will also show the design of the agent, give an
account of the implementation process, and present the results and findings
from the research carried out.
1.2
The Game
Pac-Man is an arcade game developed by Namco and first released in Japan
on May 22nd, 1980 - it’s considered to be a landmark in the history of video
games, and is among the most famous arcade games of all time (Others being
Space Invaders, Asteroids, and Pong).
The objective of the game is for Pac-Man (controlled by the player) to
collect all the ”pills” within the maze whilst being chased by Pac-Man’s
enemies (Known as Blinky, Pink, Inky, and Clyde) - collecting all the pills and
3
power pills will allow the player to progress to the next stage. By performing
certain actions (e.g. Collecting pills, Power Pills, bonus items, and eating
vulnerable enemies) the player can accumulate a higher score – the higher
the score, the better the player/agent has performed.
The framework which will be used to implement the solution, uses a variation of the original game – it’s called ”Ms. Pac-Man”. The main differences
between the versions is that in Ms. Pac-Man there are four levels instead
of one, and the ghosts are harder to beat as they have semi-random movements built into their behaviour unlike the original game whose ghosts follow
a strict pattern. The in-game rewards for Ms. Pac-Man are:
Event
Pill Eaten
Power Pill Eaten
Ghost Eaten
Reward
+10
+50
+200
Description
Ms. Pac-Man has eaten a pill.
Ms. Pac-Man has eaten a power pill.
Ms. Pac-Man eats one ghost.
After one ghost has been eaten, the score doubles for every other ghost which
is eaten for the duration of that single power pill.
1.3
Reinforcement Learning
Reinforcement Learning [1] is a method in the subject of Artificial Intelligence which focuses on how an agent learns to make the right decisions
by interacting within its environment – the main goal is for the agent to
maximise its cumulative reward.
An agent learns by recording situations (also known as states) and assigning a corresponding value which represents how good or bad that state
is – so the next time the agent needs to decide between moving to a certain
state, they can decide based on a value. In order to determine how good or
bad a certain state is, there’s an event function which specifies what value
should be returned if a certain event occurs – for example, if Pac-man were
to be eaten, that would return a value of -99 and that state’s value would be
updated.
The environment is formulated as a Markov Decision Process [5] – it has
four main elements:
• The first is the representation (states) of the environment (S) – states
can be as basic as just the position of the agent, or the position of both
agent and enemies, or even how far away something is, or something
far more complicated (just as long as it is representable of something
within the environment).
4
• The second is the set of actions an agent can take within its environment
(A) – actions are the elements which allow the agent to transition from
one state to another state. For example, in the game of Pac-Man, his
actions would be to either move UP, DOWN, LEFT, RIGHT, or remain
NEUTRAL.
• The third is the discrete time steps (St ) – whenever a transition is
made from one state to another, the time will be counted up by one.
For example, if Pac-Man were in the fifth state he has been in since the
game began, it’s state would be referred to as (S5 ), and if he were to
move to another state from there, his new state would be called (S6 ).
• The fourth is the transition reward (R) – when an agent moves from
one state (S1 ) to another state (S2 ) a reward will be returned dependent
on the result of that transition. For example, if Pac-Man were to move
to another state and be eaten by a ghost, then the transition reward
would be -99.
1.4
Function Approximation
Function approximation [2] is an approach which takes a given function and
”generalises” its output so that similar values will be grouped together, which
in turn will reduce the amount of different output – this is very useful when
combined with the reinforcement learning algorithm as it can be applied to
states – this both reduces the amount of states (saving memory) and the
time taken to learn the values for the states. For example, if Pac-man is
in state which is directly next to a state in which a non-edible ghost is in,
and Pac-Man were to move into that ghost and get eaten, then without
function approximation, Pac-man would only know that move would be bad
on that particular place of the environment – However, if you were to apply
approximation to this situation, Pac-man would know that all states which
have a ghost next to himself would be bad to move into.
Not only is function approximation beneficial to the problem, but it is
necessary. If you were to only record the position of Pac-Man as a state
in an environment with one-hundred possible locations to move into, then
that would be one-hundred different states, but if you were to take into
consideration the position of another ghost as well, that would make it tenthousand (100 ∗ 100) different states just for two positions. The amount
of states gets exponentially bigger as you have more features (and more
complicated features).
5
1.5
Framework
In order to implement the solution to this problem, the framework from
http://www.pacman-vs-ghosts.net/software was used – the reason for using
this framework is that it’s very well made, provides lots of functionality to
query the state of the game, and allows an easy implementation of the solution. The biggest benefit of using this framework is that the paths between
every node have been pre-computed and saved to a file, making parts of the
implementation very efficient. The framework is written in Java, and so will
the solution.
6
2
Approach
2.1
Requirements
The final system should have:
• A successful implementation of the reinforcement learning algorithm.
• A successful implementation of the state approximation function.
• An agent which learns to interact within a small amount of training.
• Feature which allows the agent to learn to eat pills.
• Feature which allows the agent to learn to eat power pills.
• Feature which allows the agent to learn to chase after edible ghosts.
• Feature which allows the agent to learn to avoid or run away from
ghosts.
• All the features working together.
2.2
Q-Learning Algorithm
Q-Learning is a reinforcement learning algorithm which aims to learn an
optimal policy for every given state – an optimal policy is the action the
agent can take from the given state which would result in the most reward –
maxQ(s, a).
It uses temporal differences [6] in order to calculate the policy value. It
first begins with a default estimate of a state and action (Q(s, a)), it then
receives an episode – an episode is when an agent is in state (S1 ), performs
action (A), receives reward (R), and ends up in (S2 ) – (s, a, r, s0 ). Using
both the previous estimate and this new episode, it updates the old estimate
(to be used again the next time the state-action is visited) according to the
Q-Learning formula:
Figure 1: Q-Learning Formula.
7
The actual algorithm is as follows:
Figure 2: Q-Learning Algorithm.
The learning rate (α) is a number set between 0 and 1 – the higher the
value the faster the agent learns.
The discount factor (γ) is also an number set between 0 and 1 – the higher
the value the more the agent will take into consideration future rewards,
rather than just the immediate.
2.3
State Generalisation
In order to implement the Q-Learning algorithm, adjustments had to be
made in order to make the state size manageable. To do this, features had
to be introduced - a feature is a function which returns a numerical value
which represents certain characteristics of an environment. The combination
of features is known as the ”state”. For example, to create a feature which
would represent whether or not a ghost is in that particular location, the
feature would return 1 if it were there, and 0 if it were not.
Each feature will be given it’s own ”weight” [3] – the weight is a value
which represents how good or bad it is. For example, if the feature were to
represent whether a power pill was in that particular part of the environment,
and since a power pill has positive event reward (section 2.4), the weight
should learn a positive value.
In order to calculate what each feature’s weight is worth - the Q-Learning
algorithm is applied to each of them separately – the weight is the ”estimate”.
So instead of:
Q(s, a) = Q(s, a) + α[R(s, a, s0 ) + γ ∗ max
Q(s0 , a) − Q(s, a)]
a
If there were two feature functions, it would be:
Dif f erence = [R(s, a, s0 ) + γ ∗ max Q(s0 , a) − Q(s, a)]
a
8
W1 (s, a) = W1 (s, a) + α[Dif f erence] ∗ F1 (s, a)
W2 (s, a) = W2 (s, a) + α[Dif f erence] ∗ F2 (s, a)
”W” represents the weight estimate and ”F” represents the value returned
by the feature function.
And in order to calculate what a state is worth, you calculate it by multiplying the feature functions value by its weight, and adding those values
together. For example, if you had two features, one representing whether a
ghost was in that state, and that there was also a pill, it would be:
(GhostF eatureF unction = 1) ∗ (GhostF eatureW eight = −99) = −99
(P illF eatureF unction = 1) ∗ (P illF eatureW eight = 5) = 5
−99 + 5 = −94 = StateW orth
2.4
Reward Function
In order for the agent to learn whether its actions were good or bad, a reward
function must be designed, so that after every transition, a reward is returned
dependent on what had occurred. The rewards are:
Event
Initial Reward
Pill Eaten
Power Pill Eaten
Agent Ate Ghost
Agent Was Eaten
Reward
0
+5
+25
+25
-500
Description
This is the default value for transitioning states.
The agent has eaten a regular pill.
The agent has eaten a power pill.
The agent has eaten a vulnerable ghost.
The agent was eaten by a ghost.
For example, if the agent ate a ghost and a pill in the latest transition, it
would be:
InitialReward + AgentAteGhost + P illEaten = T otalReward
0 + 5 + 25 = 30
The rewards were chosen to be representative of how good or bad a certain
event is in relation to each other.
9
2.5
Features
Careful consideration had to be taken in when designing these features as not
only did they have to effectively work on their own without any unexpected
results, but they had to work with others in harmony. Due to the approximation function, these features did not need to be designed with consideration
into the amount of different outputs they produce as a states worth is calculated on demand with ”weights” (section 2.3), rather than having each
state-value saved to memory. It’s important to note that the outputs produced by these features, if greater than 1, will be converted to a value less
than 1 with the formula:
ConvertedOutput = (1.0/OutputT oCovert)/10
The reason for converting the output is that, if it is greater than 1, it causes
the results to diverge wildly. If any of the features below output anything
other than 0/1, then the formula will be used.
Go after Pill
In order for the agent to learn to actively seek out pills, the agent must be
told where those pills are. This feature will return the distance from the
agent to the closest pill – the distance will be returned as the shortest path
distance. The agent will learn that as it gets closer to the destination the
greater the reward it may receive, and this will be shown to work as the
feature weight takes on a positive value.
Go after Power Pill
In order for the agent to learn to actively seek out power pills, the agent
must be told where they are. This feature will return the distance from the
agent to the closest power pill – the distance will be returned as the shortest
path distance. The agent will learn that as it gets closer to the destination
the greater the reward it may receive, and this will be shown to work as the
feature weight takes on a positive value.
Go after Edible Ghosts
In order for the agent to learn to actively seek out edible-ghosts, the agent
must be told whether any ghosts are edible, and if there is, the feature
function should return the shortest path distance between the agent and
the nearest edible ghost. The agent will learn that as it gets closer to the
10
destination the greater the reward it may receive, and this will be shown to
work as the feature weight takes on a positive value.
Ghost Is Close
This feature will allow the agent to learn whether being in proximity of a
non-edible ghost is good or bad. This feature will return 1 to represent that
a ghost is very close, and 0 if it is not. The weight for this feature function
should learn a negative value, meaning that the agent will avoid all states in
which this feature returns a 1.
Escape from Ghosts
In order for the agent to stay alive as long as possible, it’s important to
create a feature which will allow it to keep a good distance between itself
and the ghosts when it has the opportunity to do so. The feature will return
the average shortest path distance of all non-edible ghosts. This feature
functions weight should learn a negative value.
Nearest Safe Junction
This feature will evaluate the nearest junctions in all directions from the
current positions of the agent – the evaluation is to find a junction which
the agent can reach before any other ghost can. Once that junction is found,
the function will return the shortest path distance from the agent to that
junction. The idea behind this is that the agent will be able move from
junction-to-junction safely. The weight for this feature should take on a
positive value.
Dangerous Direction
This feature will evaluate each direction from the current position of the
agent – it will check to see whether or not there is a ghost between the
nearest junction in that direction and the agent. This should prevent the
agent from moving in the direction to a junction in which a ghost is blocking.
The weight for this feature should take on a negative value.
11
3
Implementation
During implementation each section of code was thoroughly tested to make
there weren’t any errors, or undesirable results. Each feature was tested
separately with the learning algorithm to make sure that they were working
as intended.
3.1
Episode Logic
The code below shows what each episode consists of. It starts by the agent
choosing the best possible move it can make from the current state (Line
3). Then it stores the worth of the current state (line 7). The agent then
”transitions” between states based on the move chosen (Line 9). Using both
previous states worth, and the reward just received, the weights are updated
with the Q-Learning function (Line 11).
1 while ( ! game . gameOver ( ) ) // For each e p i s o d e
2 { // Get a g e n t s move ( The move w i t h most reward )
3
MOVE pacMove = getMove ( game . copy ( ) , −1) ;
4
5
EnumMap<GHOST,MOVE> ghostMove = g h o s t C o n t r o l l e r . getMove ( game .
copy ( ) , −1) ;
6
7
oldQValue = c a l c u l a t e Q V a l u e ( game . copy ( ) , game .
getPacmanCurrentNodeIndex ( ) ) ;
8
9
game . advanceGame ( pacMove , ghostMove ) ;
10
11
updateWeights ( game . copy ( ) ) ;
12
}
13 }
3.2
Calculate State Worth
This method calculates the states worth for the given node. Line 4-6 returns
the feature functions values and assigns them to the features array. Those
values are then multiplied by its weights counterpart and added onto the
default Q-Value (line 12). The Q-Value is then returned to the method
which called it.
1 c a l c u l a t e Q V a l u e (Game game , int node )
2 {
3
q F e a t u r e s f t s = new q F e a t u r e s ( ) ;
4
f e a t u r e s [ 0 ] = f t s . c l o s e s t F o o d D i s t a n c e ( game , node ) ;
12
5
6
7
8
9
10
11
12
13
14
15
16 }
3.3
f e a t u r e s [ 1 ] = f t s . a v e r a g e G h o s t D i s t a n c e ( game , node ) ;
// Other f e a t u r e s h e r e . . . .
double qValue = 0 . 0 ;
f o r ( int i = 0 ; i < w e i g h t s . l e n g t h ; i ++)
{
qValue += f e a t u r e s [ i ] ∗ w e i g h t s [ i ] ;
}
return qValue ;
Update Weights
This method takes the previous estimate and reward to calculate the new
estimate. Line 5 calculates the difference between the expected result and
the actual result. Line 9 uses that difference to calculate the new estimate
for each weight.
1 updateWeights ( )
2 {
3
double [ ] o l d F e a t u r e s = f e a t u r e s ;
4
5
double d i f f e r e n c e = ( c a l c u l a t e R e w a r d ( game ) + d i s c o u n t F a c t o r ∗
c a l c u l a t e Q V a l u e ( game , game . getPacmanCurrentNodeIndex ( ) ) ) −
oldQValue ;
6
7
f o r ( int i = 0 ; i < w e i g h t s . l e n g t h ; i ++)
8
{
9
weights [ i ] = weights [ i ] + learningRate ∗ d i f f e r e n c e ∗
oldFeatures [ i ] ;
10
}
11 }
3.4
Calculate Transition Reward
This method returns the reward for the transition just made. It uses some of
the frameworks methods to check whether an event has happened, and then
adds the corresponding rewards if it has. Line 11 returns the total reward to
the updateWeights() method which called it.
1 c a l c u l a t e R e w a r d (Game game )
2 {
3
double reward = 0 ;
13
4
5
6
7
8
9
10
11
12 }
i f ( game . w a s P i l l E a t e n ( ) )
{
reward = reward + 5 ;
}
// Other e v e n t c h e c k s h e r e . . .
return reward ;
3.5
Choose Move
In order for the agent to choose its next move, the getMove() was made
– in order to calculate what the next move should be, all states which are
immediately reachable from the agents current state are evaluated and the
one with the highest state worth is chosen (greedy policy), and the move
which is required to get to that state is returned.
3.6
Features
All features share the same structure. They take a copy of the game, the
index of the state which is being calculated (int node), and possibly a direction.
1 f e a t u r e F u n c t i o n (Game game , int node , MOVE move )
2 {
3
featureValue = 0;
4
5
// Perform f e a t u r e c a l c u l a t i o n s
6
7
return f e a t u r e V a l u e ;
8 }
Many of the features were implemented as they were described in the
design section. However, there were a few features which needed to have
adjustments to make them work with the system correctly.
The first was the feature which returns the distance to the closest food.
The problem with this was that it’s worth would override other features,
causing the agent to move into a ghost – this was solved by disabling the
feature when a non-edible ghost is very close. This solution was also applied
to other features which caused the same effect.
The second was the feature which returns the average shortest-path distance to all of the ghosts. The problem was that if any of the ghosts were
14
really far from the agents position, then the feature would produce unintended effects. The solution to this problem was to disregard ghosts which
were greater than 50 nodes away.
The third was the feature which returns the distance to the closest Power
Pill. It was very difficult to get the feature to take on a positive value. The
solution to this was to only activate the feature when the agent is less than
20 nodes away – this also stopped the agent from moving straight from one
power pill to the other – From experimentation it was found that eating power
pills in quick succession would be detrimental to the agents score. Another
adjustment was to only allow the agent to purposely go to the power pill
when a non-edible ghost is in close proximity – this helps the agent get out
of dangerous situations.
The biggest problem when implementing these features, was the inability
to get the ”Nearest Safe Junction” and ”Dangerous Direction” feature to
work with the algorithm. A copious amount of time was spent trying to fix
them – many things were tried – they were re-wrote at least a dozen times.
Even after spending so much time trying to get it to work, the problem
remains unsolved.
15
4
4.1
Results & Evaluation
Experiments
In order to demonstrate the effectiveness of the approach, three experiments
were carried out. The first was to show how fast the agent could learn to
play the game at a reasonable level with the best combination of features.
The second was to test how well the agent played against different ghost
behaviours. The third was to see how long the agent takes to calculate the
best move from its given state. These experiments also demonstrate that the
implementation is fully functional.
In order to ensure the accuracy of the results, for each experiment, training was done for n amount of times, training was then turned off, and using
those weights learned, it was run 5000 times and then averaged. They were
also carried out on the same computer (64-Bit, Quad-Core 3.4GHz, 8GB
RAM).
The learning rate and discount factor were both set to 0.5 as this would
strike the right balance between the agent absorbing new information and
taking into consideration future rewards.
The features used in these experiments were ones which were successfully
implemented – they are ”Go after Pill”, ”Go after Power Pill”, ”Go after
Edible Ghost”, ”Ghost is Close”, and ”Escape from Ghosts” (see 2.5) – the
initial weight values were set to 0.
4.1.1
Experiment One
The purpose of this experiment is to determine how fast the agent learns to
play the game – meaning, how long does it take the algorithm to find optimal weights for the maximum average score. Each training run can consist of
thousands of episodes - a training run begins when the game starts, and ends
when the agent runs out of all its lives. The table below (figure 3) shows the
different in-game scores achieved by the agent, as well as the average amount
of pills eaten per game.
16
Figure 3: Experiment 1 Score Results.
As you can see from the results above, it only takes a difference of 1
training run to improve its score by a significant amount – the reason for the
massive improvent from 1 training run to 2 training runs was because
in the first one, the weight which leads the agent to pills was negative and
hadn’t enough time to become positive – it becomes positive after the second
training run – however, as each training run consists of many episodes, it is
possible for it to take on a positive value in the first training run.
Another important thing which affected its score was that the weight for
”Go after Power Pill” had remained 0 as the agent didn’t get to encounter
states with the power pill in it – only after the agent learnt that chasing
normal pills would lead it to a positive score, would it have the chance to
discover that power pills also lead to a positive score – the power pills weren’t
what made the score significantly larger, it was that it made the ghosts edible.
An interesting aspect of the agents behaviour was from 10 training
runs to 2000 training runs, you can see that the agent slowly learns to
maximise its rewards by disregarding pills (the average pill count gets lower)
and favouring edible ghosts instead – this is done by the lowering of the
weight for ”Go after Pill”, and increasing the weight for ”Go after Edible
Ghost” – even if the change in weights is slight, it will show after 5000 runs.
If you wanted to make the goal of the agent, to collect the most amount
of pills, you could lower the ”Go after Edible Ghosts” feature’s reward in the
reward function and increase the ”Go after pill” reward – however, it may
17
be possible that going after edible ghosts is what allows the agent to survive
longer and collect the most amount of pills – it’s very hard to get right as
there are so many variables. The table below (figure 4) shows the difference
in weights learned.
Figure 4: Experiment 1 Weight Results.
4.1.2
Experiment Two
The purpose of this experiment is to determine how well the agent plays the
game against different ghost behaviours. The framework which the solution
is implemented in, provides a number of ghost behaviours.
From the previous experiment, it was found that 250 training runs seems
to be an ideal amount of training required for optimal weights – this is the
amount of runs which will be used in this experiment. Below is a table of
the results of from this experiment.
18
Figure 5: Experiment 2 Results.
From looking at the average scores you can see that the agent performs
best against ”RandomGhosts” – this is due to the ghosts not having any
strategy to trap the agent. However, after a certain amount of time, the
agent does make unfortunate moves and gets trapped in a tunnel – this
is due to the agent not having the features ”Nearest Safe Junction” and
”Dangerous Junction” working to prevent the scenario. You can also see
that the agent performs much worse against ”Legacy2TheReckoning” – this
is due the ghosts behaviour having a very effective strategy.
Despite the agents inability to avoid being trapped by ghosts, the basic
functions such as going after the pills, chasing edible ghosts, and steering
away from danger work quite well. The feature ”Escape from Ghosts” isn’t
completely useless as without it, the scores can be much lower. The table (Figure 6) below shows the results from playing against all the ghost
behaviours without the feature.
Figure 6: Experiment 2 Results Without Escape Feature.
The first noticeable change in the results was that, on average, the agent
performed better against ”RandomGhosts” – the reason for this was because
the ghosts sporadic behaviour would cause the agent (with the feature) to
seemingly move around randomly, trying to avoid the ghosts. Without the
feature, the agent would just ignore the ghosts (except immediate danger),
and focus on improving its score. Even though the escape feature doesn’t
completely prevent the agent being trapped, the results prove that it does
help quite a bit, as without the feature, the agents score against ”Legacy”
19
and ”StarterGhosts” becomes considerably lower – this was also shown by
the ghosts minimum score – the score reduced about 15 thousand.
The second noticeable change (or lack of) was the agents score against
”Legacy2TheReckoning” – the reason for this was that the ghosts behaviour
was so good that it made the feature completely ineffective. Just like the
agents match against the ”RandomGhosts” – disabling the feature allowed
the agent to focus on improving its score (The score is slightly higher, and
there is a greater amount of pills collected).
4.1.3
Experiment Three
The purpose of this experiment is to determine how long the agent takes to
make decisions. Since the worth of states is being calculated with the use
of weights rather than storing the state-value pair in memory, it’s important
to investigate how long it takes the agent to calculate it’s next move. The
reason for doing so, is that in some circumstances there may be a time limit
for the agent to make its decision.
One example of there being a time limit, is the competition in which the
framework is taken from – users get to upload their own agent or ghost AI
in order to compete to see who has developed the best controllers. In the
competition the controller must make a decision every 40 milliseconds.
In order to record the time taken for an agent to make a decision, the
Java method ”System.currentTimeMillis()” [12] was used to measure how
long the getMove() method took to execute and finish. On average, with 5 of
the active features, the agent takes a maximum of 2 milliseconds to make a
decision. Even though the agent makes its decisions with considerable time
left, it’s still important to minimise the time taken as it will allow for more
features, and more complicated ones.
Currently, the implementation doesn’t implement any speed optimisation
techniques. One technique to improve the performance of decision making
could be with the use of ”multi-threading” [10] – this allows multiple calculations to be performed at the same time.
4.2
Evaluation
The first experiment shows that the solution was a success as it demonstrates
many of the algorithms strengths. The first strength is that the agent can
learn quite fast (1-2 training runs) what elements are good and bad in the
environment. The second is that it can find a satisfactory set of weight
values which lead to a very good average score – just after 5 training runs it
reaches close to it’s peak score (Figure 3). The third is that it shows that
20
the agent is constantly trying to improve its cumulative reward (alters the
weights to favour ”Go after Edible Ghosts”). The fourth is that with the
reward function, it’s possible to change the goal of the agent quite easily.
An important thing to note is that, the solution did all this with the initial
weights set to 0 – the weights could have been set up before hand (e.g. +50
for ”Go After Pill”) to find the optimal weight values even quicker.
The second experiment also shows that the project was a success as it
demonstrates the effectiveness of the features. Even with just the successfully
implemented features, the agent was able to fluidly interact within the environment. Another positive is that features can be easily added and removed
when needed – making making implementations very flexible. However, the
successfully implemented features aren’t enough for the agent to get a very
good score. In order to get a very good score, more advanced features need
to be implemented – things such as whether moving to a certain junction
would be good or bad decision, or what parts of the environment are more
likely to get the agent trapped.
The third experiment shows that the agent can make decisions very
quickly, but it is important to try and limit that time taken as there may be
time constraints in some situations. However, much of the efficiency is due
to the framework having pre-computed all paths and distances – this may
not be the case in other problems.
Overall, I think that the solution to the problem was a success, and
the only hindrance to it was the inability to successfully implement those
advanced features which would have gotten the agent a much greater score.
21
5
Future Work
The first major thing I would do would be to continue working on trying to
get the failed features to work with the algorithm. Another important aspect
I think I would need to investigate, would be to find a way to create features
which represent certain areas of the environment of the different mazes –
the reason for this is that some mazes are more difficult than others due to
there being parts of the their environment having less ways for the agent to
escape – having a feature which represents a certain part of the maze, would
allow the agent to learn that part of the environment is bad, and avoid it.
The second major thing I would do would be to investigate ways in which
to make the implementation faster - I believe this could be done with the
”java.util.concurrency” [11] library. I would also re-explore pre-existing code
to see if there is any redundancy or inefficiency, and replace it.
22
6
Conclusions
The aim of this project was to develop an artificially intelligent autonomous
agent which is able to play the game of Pac-Man, and demonstrate it’s effectiveness. In order to develop the AI, the reinforcement learning technique
Q-Learning was used to teach the agent how to respond within its environment. Q-Learning also had to be combined with function approximation as
the amount of states was too great. In order to demonstrate its effectiveness,
experiments were carried out to measure the main aspects of the solution.
The project was a success as the reinforcement learning algorithm combined with function approximation was fully implemented and working correctly. Five features were also successfully implemented, and working in
combination.
From the experiments, much has been learnt about the effectiveness of
the approach. The first was that the Q-Learning technique was very suited
to the problem as many of the strengths were identified – the agent can learn
to interact reasonably fast, finds optimal weights very quickly, the agent
constantly strives for the best cumulative reward, and that the algorithm is
flexible in that you can change the goal of the agent quite easily.
Another major thing which was learnt was that function approximation
is essential to reinforcement learning tasks as in most problems, there can be
an enormousness amount of states.
The experiments also demonstrated the effectiveness of features. With
just the basic features, the agent was able to perform quite well. However, in
order for the agent to get a much better score, more advanced features need
to be implemented.
23
7
Reflection on Learning
The first major lesson I learnt from doing this project was that my perception
of time was wrong. Originally, I thought that three months would be plenty
of time to get everything done. The were a number of things which I didn’t
take into consideration. The first was the possibility that I could be ill - the
reason for this was that I hadn’t been in ill in over five years (However, I
think it might have been food poisoning which had made me ill for a week).
The second was that I was overconfident on how long it would take to do an
assignment from another module – it took me two weeks to complete it as it
was really hard (I did get 93% in it though) – this also taught me that I’m
terribly slow at writing report/essays. Thankfully, I used that knowledge for
this report – I cut my development time short so I could finish this report in
time. I think that if I do a project like this again, I’ll be more realistic about
what’s possible in what time, and account for contingencies. Unfortunately,
I have no idea on how to increase my report/essay writing speed.
The second thing I learnt was that Reinforcement Learning has a very
steep learning curve. It was very hard to understand many of the concepts
as most reports/tutorials seemed to be designed for other people with an
already good understanding of the subject – it probably didn’t help that
the last time I did Maths was at GCSE – mathematical formulas/notation
can be very confusing. It was also very hard to find answers as the subject
is quite niche. Now that I do understand it all, I can read papers on RL
without much difficulty, and I’m going to use the knowledge to apply it to
other problems for programming experience.
The third thing I learnt was that in the case of this project, a perfect
understanding of how everything works is required as one small mistake can
mess with the whole system. In the beginning of the development, if I encountered a problem I would often ”brute force” it by changing things until
it would work – this approach is very time consuming, and causes more problems than it solves – which is why I started going through each piece of code
bit by bit and checking its output – as well as this approach solving problems
quicker, it also gave me a much better understanding of how it all works.
The fourth was that not having any real direction made things harder and
slowed progress. My first implementation was one which state just consisted
of the agents position and one ghost – so after implementing this and fixing all
the errors which came with it, I discovered that this solution was useless, and
needed to be disregarded – a lot of time was spent trying to do things such as
creating unique keys for states, and then retrieving those values. A lot of time
was also wasted by my inexperience in programming and general knowledge
about certain things – one big problem I had was that I was trying to use
24
the hashCode() function on a string built from the StringBuilder() function,
only to find out that hashCode doesn’t work on it. Next time I do a project
similar to this, I’ll fully research every element before implementing.
25
References
[1] Reinforcement Learning – http://webdocs.cs.ualberta.ca/~sutton/
book/ebook/node7.html - 27/04/14 - 16:50
[2] Function Approximation – http://webdocs.cs.ualberta.ca/~sutton/
book/ebook/node85.html - 27/04/14 - 16:50
[3] Q-Learning Approximation – https://www.youtube.com/watch?v=
8IakYZnMdIk - 27/04/14 - 16:50
[4] Q-Learning
–
http://link.springer.com/article/10.1007/
BF00992698 - 27/04/14 - 16:50
[5] R. Bellman. A Markovian Decision Process – http://www.iumj.
indiana.edu/IUMJ/FTDLOAD/1957/6/56038/pdf - 27/04/14 - 16:50
[6] Temporal-Difference Learning – http://webdocs.cs.ualberta.ca/
~sutton/book/ebook/node60.html - 27/04/14 - 16:50
[7] Evolutionary Algorithms – http://www.cs.vu.nl/~gusz/ecbook/
Eiben-Smith-Intro2EC-Ch2.pdf - 27/04/14 - 16:50
[8] Low-Complexity Rule-Based Policies – http://www.jair.org/media/
2368/live-2368-3623-jair.pdf?q=pacman - 27/04/14 - 16:50
[9] Higher-Order Action-Relative Inputs – http://ieeexplore.ieee.org/
xpl/articleDetails.jsp?arnumber=6615002 - 27/04/14 - 16:50
[10] Multi-Threading
–
http://en.wikipedia.org/wiki/
Multithreading_%28software%29#Multithreading - 27/04/14 16:50
[11] Concurrency
–
http://docs.oracle.com/javase/tutorial/
essential/concurrency/ - 27/04/14 - 16:50
[12] currentTimeMillis - http://www.tutorialspoint.com/java/lang/
system_currenttimemillis.htm - 27/04/14 - 16:50
26
© Copyright 2026 Paperzz