5 Project design

‫ מכון טכנולוגי לישראל‬- ‫הטכניון‬
‫הפקולטה להנדסת חשמל‬
‫המעבדה לבקרה ורובוטיקה‬
TECHNION - ISRAEL INSTITUTE OF TECHNOLOGY.
Department of Electrical Engineering
Control and Robotics lab.
Project book: special project (project c)
Artificial Intelligence via Reinforcement Learning
Intellibot Reinforcement Learning in
RoboCode
By
Avishay Livne
Instructor
Tomer Merkado
Spring semester 2006
1 Table of contents
2 Figures and tables list
3 Introduction
3.1 Abstract
3.2 Theoretic Background
4 Project goal
5 Project design
5.1 The RoboCode environment
5.2 Problem formulation
6 System architecture
7 Simulations results
8 Summary
8.1 Conclusions
8.2 Future work
9 Appendices
9.1 Result graphs
9.2 RoboCode user manual
9.2.1 Setting up the environment
9.2.2 Creating a battle
9.2.3 Adding robots to RoboCode
10 Bibliography
Introduction
Abstract
This project is a research on the subject of Machine Learning, with focus on
Reinforcement Learning algorithms (RL).
The first part of the project is the theoretic part, and includes literary review of papers
and academic material on the subject.
The second part of the project is the practical part. In this part the theoretical topics
from the first part were implemented in order to solve a real problem.
The environment that was used for implementing the RL algorithms is called
RoboCode. RoboCode is an environment that was developed by IBM as a tool for
learning Java. In RoboCode Java objects (called robots) battle each other under a set
of given rules. Every programmer can develop his own robot, using the interface the
environment supplies. Most of the robots use different movement patterns (in order to
dodge bullets) and patterns for aiming (predicting the place of the enemy in the near
future). It is empirically proven that robots which use patterns are better than robots
which act randomly).
In this project robots that aim to learn such patterns were developed. These robots
were designed to improve by playing against existing robots.
In order to compare different RL algorithms, several different robots were developed.
After the implementation phase the robots were battled against existing robots in
order to train their agent and improve. A few aspects were investigated during the
simulations – how the learning of a robot against robot A affects the performance of
the robot against robot A, against a learning robot that learned to play against robot B
and against some other non-learning robot.
Moreover a comparison between the learning times of the different robots was done.
The learning time of a given robot is the time that passes since the time a robot starts
to learn until its winning percentage against the robot from which it's learning
becomes stable.
In some of the result graphs it's possible to see a real improvement while in other the
outcome isn't so unequivocal.
Theoretic Background
Reinforcement learning – survey with focus on Temporal
Differences and Evolutionary Algorithms
Introduction
Reinforcement learning (RL) is an approach to automating goal-directed
learning and decision-making. This approach comes to solve problems
in which an agent interacts with an environment. The agent's actions
receive a reward from the environment (usually represented as a real
number). RL algorithms aim to find a policy that maximizes the
cumulative reward for the agent's actions.
The basic RL model consists of:
1. Set of environment states.
2. Set of possible agent actions
3. Set of real scalar rewards.
In a given time the agent perceives its state and a set of possible
actions. It decides an action and receives a reward and a new state.
Based on this interaction with the environment the agent should develop
a policy that maps states into actions, such that the cumulative reward
for the agent's actions is maximal.
RL has two main classes of methods for RL: methods that search in the
space of value functions and methods that search in the space of
policies. The evolutionary algorithms approach is example for the
second class, where the policy is modified as a whole. Temporal
difference (TD) is an example for the first class, in which the attempt is
to learn the value function (that determines the rewards). The function is
learnt through trial and error and exploring the environment.
Difficulties and tradeoffs in RL
RL has some main problems:
1. The reward lets the agent know how good its action was, but not if it
was the best action or how good was it compared to other possible
actions. The agent has to evaluate the rewards over time and
develop a policy based on its evaluations, which could be wrong.
2. Exploitation vs. Exploration: In a given state the agent has to decide
on one possible action. Sometimes it'll have a favorable action, for
which it received high reward in the past. The agent could pick this
favorable action, but then he waives the chance to explore other
actions, which could lead to better reward in the future. Exploitation
is using the accumulated knowledge to choose a greedy action. This
behavior comes to maximize the reward for a single play. Exploration
is choosing a random action. This way the agent learns more about
the environment and its database becomes more accurate. In terms
genetic algorithms exploitation is analogical to recombination, while
exploration is analogical to mutation. Since the agent can't explore
and exploit on the same step this tradeoff is one of the main
problems of RL.
3. Large state space: in some systems the space of states is too large
to be explored and the agent can't develop a policy that would map
every state into action. In cases like this the agent should be able to
apply a single rule to set of states.
4. Incomplete state information: the agent bases its actions on the way
it perceives the environment, but sometimes it doesn't aware to all
the information in the environment. The agent can think its current
state is s1 while actually the state is some different s2 , and the agent
can't separate between the states because of limited perception.
This situation will lead the agent to select inappropriate action.
Methods and approaches in RL
As mentioned above RL has 2 main classes of methods.
1. Methods that search in the policy-space, such as Evolutionary
Algorithms, perform the learning after a complete execution.
These methods don't refer to reward given in every single step
during the run, but examine only the cumulative reward after
completion.
2. Methods that search in value function space, such as Temporal
Differences, consider the reward received in each of the steps.
Every approach has its own advantages and disadvantages, so
choosing an approach should be based on the specific problem
itself. Moreover, some methods consider both step reward and
final cumulative reward, so there's no strict distinction.
Policy space case: Evolutionary Algorithms
In order to find an optimal solution (algorithm) to a given RL problem,
the EA method search for a solution based on the idea of evolution by
natural selection. In the initialization step one algorithm or more is
created randomly or user-controlled. In every step of the iteration the
next generation of algorithms is created and evaluated. Most of the new
algorithms will be produced by recombination of two elder algorithms
with high received high evaluation. Few of the algorithm's substructures
are created by mutation of old substructures. After the new generation
is created every algorithms is evaluated, usually by simulating the
algorithms and ranking them according to their performance.
The rational behind recombination is taking the best from parents that
perform well comparing to other algorithms. The rational behind
mutation is to keep the exploring of new genes and policies. In the
mathematical representation of the problem this is a way to avoid local
maximums.
The pseudo code for a general EA:
procedure EA
begin
initialize P(t); t = 0;
evaluate structures in P(t);
while termination condition not satisfied do
begin
t = t + 1;
select P(t) from P(t-1);
alter structures in P(t);
evaluate structures in P(t);
end
end
Major design decisions are how to represent the space of policies using
chromosomes, and how to evaluate the population of algorithms. In
addition the user has to set different control parameters like population
size, recombination rate, mutation rate and parent selection rules.
Value function space case: Temporal Difference
TD is a method to predict the future reward. For every step the agent
has reached a reward-prediction is made and whenever an observation
is available the prediction is adjusted to better match the observation.
The main difference between TD and other RL methods is in TD
prediction changes not only according to the actual value that is found
at the end of the time segment, but also according to the more recent
predictions, which are assumed to be more accurate. A simple example
is the weather prediction – suppose a weatherman attempts to predict
on each day of the week whether it'll rain on Saturday. The conventional
approach is to compare each prediction to the actual outcome (whether
or not it rain on Saturday). In TD approach we compare each day's
prediction with the prediction of the following day. If on Monday we
predicted 50% it'll rain and on Tuesday 75% chance predicted, then a
TD method would increase the prediction for days similar to Monday.
Because TD methods use the gathered experience more efficiently they
tend to converge faster.
The formula for updating the prediction, which is also known as the
recursive Bellman equation, is V(st ) = V(st ) +  (V(st+1 ) - V(st ) + rt ) . We
update to prediction of each step V(s t ) according to its previous
estimation and the new temporal difference V(s t+1 ) - V(s t ) and the
immediate reward rt .  represents the learning rate.
In the Q-function viewpoint, where we evaluate every (state, action) pair
this can be written as Q(st , at )  Q( st , at )   (max Q( st 1 ,at 1 ) - Q( st ,at )  r ( st )) .
at 1
The decision to take maximal Q(st 1 ,at 1 ) , is justified by the desire to
increase convergence speed.
The case described above is known as the special case TD(0) , while
the general case TD( ) weights the recent n steps. In TD( ) calculate
the reward using the future reward of the following steps. We the future
reward could tell us more accurately whether our decisions were
correct, therefore the reward applied to a given decision would be
Rt n 
t  n 1

i  t 1
i t 1
ri , where 0    1 is the weight parameter. Alternatively it's
possible to use Rt n 
t n

i t 1
i t 1
ri   nVt  st  n  so the value of the next n-th
step is considered as well. With the new reward we can reevaluate the
state value V  s t  = V  s t  +   R t -V  s t   . The tradeoff of using TD( ) over
simple TD(0) is more demanding computation and extensive memory
consumption, against faster convergence and more accurate prediction.
The pseudo code for the basic TD( ) algorithm:
procedure TD( )
begin
Initialize V(s) arbitrarily for all e  S ; t = 0;
Repeat for each step of episode
Choose action a from the possible set of actions for state s.
Take action a, observe reward r and next state s'.
V (st )  V (st )   (V ( st 1 )  rt  V (st ))
s  s'
Until s is terminal step
End
Project Goals
The main goal of the project is to get familiar with ML, with focus on RL
algorithms, and how to use it for Artificial Intelligence (AI) purpose. Below is
a list of goals that were defined in the project:
 Literary review
o Read a set of selected papers and books on the theoretical part of
the project.
o Summarize the relevant material in a report.
 Selecting a suitable problem
o Finding a problem that could be solved via RL algorithms.
This problem should answer a few requirements:
 It should be feasible to formulate the problem as RL
problem.
 The problem should be feasibly implemented.
 It is preferential to find a problem that is also solvable
with non-RL techniques, in order to have a benchmark for
the RL solution.
 Solving the problem
o Learning the problem's environment.
o Designing a non-learning solution.
o Formulating the problem as RL problem.
o Designing learning solutions.
o Implementing the learning solutions.
 Investigating performance
o Running different simulations for different scenarios.
o Collecting results for later comparison.
o Producing graphs to ease the examination.
 Summary and conclusions
o Explaining the results
Project design
RoboCode environment
"RoboCode is an easy-to-use robotics battle simulator that runs across
all platforms supporting Java 2. You create a robot, put it onto a
battlefield, and let it battle to the bitter end against opponent robots
created by other developers. RoboCode comes with a set of pre-fab
opponents to get you started, but once you outgrow them, you can enter
your creation against the world's best in one of the leagues being
formed worldwide." – From IBM's RoboCode website.
The RoboCode platform supplies programmers with an environment of
tanks that battles each other on a plain, objects-less screen. The
programmer writes only the code that defines the behavior of the tank.
The platform provides different tank sensors and controllers. The
sensors are represented as a set of get functions (e.g. getEnergy(),
getHeading()) and a set of events (e.g. BulletHitEvent,
HitByBulletEvent). The programmer use this inputs to determine an
action and then outputs it to the game using action functions such as
fire(), turnLeft() etc.
A detailed manual explaining how to use RoboCode is attached in
appendix 2.
Problem formulation
RL aims to learn and optimize a mapping from states to actions by trial
and error interactions with the environment.
In order to implement RL the following should be formulated:
 States
 Actions
 Reward
 Q-function
Below are two approaches for the problem formulation. The first one is
detailed and intended to map the world with high accuracy, therefore
aiming to achieve
First attempt – detailed function
States
The robot's state is defined according to the following parameters:
Heading – the direction of the robot. 4 possible values {up / right / down
/ left}.
Gun heading – the direction of the robot's gun. 4 possible values {up /
right / down / left}.
Radar heading – the direction of the robot's radar. 4 possible values {up
/ right / down / left}.
Robot's location – the (x, y) coordinates of the robot. Under the
assumption the battlefield's dimensions are 800x600 we shall divide the
ground to 40 areas in the size of 40x15. 40 possible values.
Robot's energy – the amount of energy left. Instead of 100 different
values we shall group together every 10 sequential levels, resulting in
10 different possible values {>90, >80 … >10, >0}.
Enemy's location, relative angle – The place of the enemy relative to the
robot's heading. 4 possible values {up / right / down / left}.
Enemy's location, distance – The distance from the robot to the enemy.
By drawing imaginary radiuses we can use 4 different values for this
parameter {x =< 150, 150 < x =< 300, 300 < x =< 450, 450 < x} (values
are in units of pixels).
Enemy's energy – the amount of energy left to the enemy. 10 possible
values.
Was hit from bullet? – 2 possible values {was hit / wasn't hit} from bullet
in the last turn.
Was hit from collision with a wall? – 2 possible values {collided / didn't
collide} with a wall in the last turn.
Summing up, in this approach there are 4x4x4x40x10x4x4x10x2x2 =
16384000 different states.
Actions
The following list contains the possible actions:
Move, direction – 2 possible values {forward / backwards}.
Move, distance – 5 possible values {0 / 10 / 20 / 40 / 80}.
Turn – change the angle of the robot. 4 possible values {up / right /
down / left}.
Turn gun - change the angle of the gun. 4 possible values {up / right /
down / left}.
Turn radar - change the angle of the radar. 4 possible values {up / right /
down / left}.
Fire power – 5 possible values {0 / 0.5 / 1 / 2 / 3}. 0 means don't shoot.
Summing up, in this approach there are 2x5x4x4x4x5 = 3200 different
possible actions.
Reward
The objectives of the robot are defined by the reward for every action.
The reward is derived from the change in the robot's energy and the
enemy's energy. It can be calculated so:
Re ward  E me  Eenemy
Where Eme = the change in the robot's energy (current energy – last
turn's energy).
Eenemy = the change in the enemy's energy.
Problem
The problem in this formulation is the amount of space required to store
all this information. The Q-function table must contain a rating for every
pair of (State, Action), resulting in 16384000x3200 = 52428800000
values. This amount of values, assuming every value is represented as
float variable and consumes 4 bytes, demands 195GB gigabytes.
Database this big can't be contained in the memory in runtime and
therefore this level of detailing isn't practical.
The next attempt reduce the detail level in order to maintain a small
database, small enough to read it before the battle begins under
RoboCode's environment limitations (robots must finish their
initialization phase under a certain time constraint).
Second attempt – reducing space
States
The robot's state is derived only from the ratio energy of enemy according
energy of robot
to the following table:
State
Priority
Ratio
Very Dangerous
6
r >1.5
Dangerous
5
1.5  r > 1.2
Unsafe
4
1.2  r > 1
Normal
3
1  r > 0.7
Safe
2
0.7  r > 0.5
Very Safe
1
0.5  r
Actions
A predefined set of actions is available for the robot and the desired
action for each state is described in the following table:
Priority
Action
Appropriate for state
6
Very defensive
Dangerous, Very Dangerous
5
Defensive
Unsafe, Dangerous
4
Normal
Normal
3
Strong
Safe, Normal
2
Offensive
Safe
1
Very offensive
Very Safe, Safe
The robot has a basic behavior and the chosen action determines a few
parameters in each turn's behavior parameters.
Each turn the robot performs the following actions: move, turn, fire and
scan for other robots.
In safer state the robot will allow itself taking more aggressive actions,
meaning firing with more power and moving more frantically.
On the other hand, when more dangerous states are encountered the
robot will take more defensive actions, including reducing its fire power
and calming its movement.
When using more firepower the robot suffers bigger penalty for misses
(reflected in energy reduction).
When moving more frantically the robot improves its bullets dodging but
increases the chance of hitting walls and other robots.
Reward
The objectives of the robot are defined by the reward for every action.
In this design I chose to reward the robot for his actions according to the
following equation:
Re ward  E me  Eenemy
Where Eme = the change in the robot's energy (current energy – last
turn's energy).
Eenemy = the change in the enemy's energy.
In this approach the size of the database is 6x6x4bytes= 144bytes.
System architecture
In this project 4 robots were implemented.
Basic robot
The basic robot doesn't use any learning mechanism and has a fixed behavior.
This robot's actions are pre-determined and will remain so, no matter what
occurs in the battlefield.
QLearningBot
This robot uses a Q-function in order to determine which action is the best for
every state. In this project the Q-function is implemented with a state-action
table. The value in cell (S, A) represents the rating of action A for state S.
This robot algorithm is described in the following pseudo code:
1. Initialize Q-function table.
2. Find starting state S.
3. Repeat
3.1. Select best action A for state S.
3.2. Perform action A.
3.3. Get reward R.
3.4. Find new state S'.
3.5. Select best action A' for state S'.
3.6. Update Q-function according to the following equation
Q( S , A)  [1   ]Q( S , A)   [ R   Q( S ', A ')]
3.7. Step to next state S  S'.
4. Until battle ends.
TDLearningBot
This robot uses a Q-function in order to determine which action is the best for
every state. In this project the Q-function is implemented with a state-action
table. The value in cell (S, A) represents the rating of action A for state S.
Unlike QLearningBot this robot uses the TD algorithm. Therefore each (state,
action) pair value is derived from the reward given for a few more future steps,
not just the next step.
This robot algorithm is described in the following pseudo code:
1. Initialize Q-function table.
2. Find starting state S.
3. Repeat
3.1. Select best action A for state S.
3.2. Perform action A.
3.3. Get reward R.
3.4. Find new state S'.
3.5. Select best action A' for state S'.
3.6. Update Q-function according to the following equation
n
Q( Si , Ai )  [1   ]Q( Si , Ai )   [ R    k Q( Si  k , Ai  k )]
k 1
3.7. Step to next state S  S'.
4. Until battle ends.
Note regarding TD algorithm:
The algorithm implemented by TDLearningBot is a variation on the real TD
algorithm.
The real TD( ) algorithm is as follow:
1. Initialize Q-function table.
2. Find starting state S.
3. Repeat for each episode and each step in episode:
3.1. Take action a, observe reward r and next state s'
3.2. Choose a' from s' using some policy involving Q
3.3.   r   Q( s ', a ')  Q( s, a)
3.4. e( s, a)  e( s, a)  1
3.5. For all s,a:
3.5.1. Q( s, a)  Q( s, a)   e( s, a)
3.5.2. e( s, a)   e( s, a)
3.6. s=s'; a=a'
4. Until battle ends.
ExplorerBot
This robot is similar to TDLearningBot, except for the way it chooses the next
action. While TDLearningBot simply chooses the best action, ExplorerBot
chooses one of the top possible actions using some randomization factor.
Exploring mechanism
Every learning robot has an Exploration-Rate parameter. If for a given state,
more than "Exploration-Rate" actions has rating of zero (meaning this actions
weren't chosen for this state yet), a random action will be chosen in order to
explore new (state, action) pairs. This technique is used to avoid reaching local
maximums, and keep on searching for a global maximum.
Simulations results
This chapter describes the simulations that were run in the project.
In each simulation 2 robots battled for 100 rounds, where each round included
10 battles. These values were chosen after observing that every match up has
stabilized after no more than 80 rounds.
In order to maintain consistency, the following parameters were used for all the
battles:
The battlefield's dimensions = 800x600 (width x height).
Gun cooling rate = 0.1.
Inactivity time = 450.
The robots which are participating in the simulations are
 QLearning – This robot implements the Q-learning algorithm, meaning
it fills its Q-function table according to the immediate reward for the
action.
 TD1 – This robot implements the TD algorithm with n=1, meaning it
fills its Q-function table according to the reward for the 2 recent actions.
 TD2 - This robot implements the TD algorithm with n=2, meaning it
fills its Q-function table according to the reward for the 3 recent actions.
 Explorer – This robot implements the TD algorithm with n=1, meaning
it fills its Q-function table according to the reward for the 2 recent
actions. Unlike the rest of the learning robots, this robot doesn't pick the
best rated action for a given state, but chooses one of the topmost rated
possible actions.
 SampleBot – This robot represent an example robot with no learning
capabilities. This robot is used as a benchmark for the learning ability of
the learning robots.
 BaseBot – The learning robots are based on this robot. This robot is also
used as a benchmark for the learning capability of the learning robots.
Different scenarios were examined and simulated. The graphs in appendix 1
summarize the collected results. The graphs present the moving average of the
last 10 rounds' winning rate. For every number in the range [10...100] the
average number of winnings in the last 10 rounds was calculated. The graphs
present these 90 values, instead of the actual winning rate in every round in
order to decrease the disturbing noise.
Graph 1 presents the performance of the learning robots against SampleBot.
The learning curve is easily seen in the graph. The robots begin with winning
rate of 20% and end up with winning rate of 80%.
One can notice the final winning rate of Explorer is a bit higher compared to
the other robots (84% against 78%).
It's also possible to see the different learning rates. TD2 learns faster than the
other robots, while Explorer's learning takes more time than the rest.
Graph 2 presents the robots' performance against BaseBot.
As in graph 1, it's seen in graph 2 the learning rate of TD2 is faster than the
others, while it takes to Explorer more time than the others to reach stability.
Again, Explorer achieves better performance than the others (65% compared to
60%).
Unlike graph 1, in graph 2 there isn't a clear learning curve. The robots'
winning rate at start is around 40% and in the end their winning rate is around
60%, but still it's not the result one can expect from a qualitative learning
agent.
After the simulations above were performed, a few more experiments were
carried out.
The results of the 3rd simulation are presented in the following table:
Playing against
BaseBot
SampleBot
BaseBot
60%
50%
SampleBot
40%
80%
Learning from
First, 2 copies of QLearning were instantiated. One (QLearning1) was battled
1000 times against BaseBot, and the other (QLearning2) copy was battled 1000
times against SampleBot.
Then each of the copies was battled against the other non-learning robot, with
the learning mechanism turned off. Meaning QLearning1 was battled against
SampleBot and QLearning was battled against BaseBot, while each of them not
changing its Q-function table.
The only unambiguous conclusion from this experiment is that the robot
performs in its best against the robot from which it learned to play.
The last experiment included a fight between 2 learning robots that learned
from different robot. In order to carry out this experiment 2 copies of TD1 bots
were instantiated. The first copy (TD1A) was trained 1000 battles against
BaseBot, while the second copy (TD1B) was trained 1000 battles against
SampleBot.
After the training phase was completed, the two copies were battled for 100
battles, with their learning mechanism turned off.
The results of this experiment can be seen in appendix 1. It's clearly seen the
robot that learned from BaseBot defeats the robot that learned from SampleBot,
averaging winning rate of 65%.
Summary
This chapter presents the conclusions result from the simulations. A few
suggestions for future research and development are added in this chapter.
Conclusions
From the overall performance of the robots it is seen the robots do improve
their playing capabilities after learning by playing against other robots. This
result is encouraging and puts the efforts that were put in the project in a
positive light.
As we can see, in some cases – mainly the confrontation with BaseBot – the
learning ability of the robots is limited, apparently because of the superficiality
of the problem formulation.
By comparing the learning curves of the different robots we can conclude
Explorer reaches better performance, but also has a slower learning rate. This
conclusion makes sense as Explorer algorithms tend to explore more
thoroughly the Q-function, therefore has chance to find higher maximums and
avoid lower local maximums. Understandably, the extensive exploration takes
longer time than the non-exploring algorithms.
Another observed phenomenon is the effect of the parameter n on TD
algorithm. The hypothesis that bigger value for n enhances the learning rate of
the algorithm was confirmed.
The last experiment (confronting TDLearningBot against itself, after learning
from BaseBot and SampleBot) shows us that learning from superior opponent
leads to better performance.
Future work
This section describes some ideas for future work that could be done as a
continuation for this project.
 Reformulating the problem – by using detailed problem formulation the RL
algorithms could recognize and learn more subtle patterns. This should let
them improve their learning capabilities, by creating a Q-function with
better granularity.
 Usage of Artificial Neural Networks (ANN) – the chapter about problem
formulation demonstrated the problem of extending the size of the Qfunction table. By using ANN it could be possible to learn Q-function like
this without the exaggerated memory consumption. ANN aim to learn
patterns by modifying the weight of the connections between their neurons.
 Performing more simulations – it would be interesting to investigate the
robots' performance against different robots, and in different scenarios.
 Implementing EA algorithms – in this project only RL algorithms were
implemented. Implementing algorithms that search for solution in the space
of policies (contrary to the space of value function) could be another step in
this research.
 Extending the project to multi-participants battles – this project only deal
with 1-on-1 battles. Battles with more than 2 robots should be a much more
complicated problem to solve.
RoboCode user manual
Setting up the environment
The installer file can be downloaded from RoboCode's project site on
SourceForge - http://sourceforge.net/projects/robocode/
The command line to install RoboCode is java -jar robocode-setup.jar
After the installation a shortcut to run the environment will be created.
Creating a battle
Select Battle -> new (ctrl+N).
Add the robots you want from the list.
Define the Number of Rounds to be played in this battle.
In the other tabs it is possible to modify the battlefield's dimensions and the
values of Gun Cooling Rate and Inactivity Time.
Click Start Battle.
While the robots are battling it is possible to click on the name of each robot in
the right menu and a new window will be opened, displaying the output of this
specific robot (including output & error stream).
At the end of the battle a new windows will pop up with the summary of the
battle's results.
RoboCode environment supplies an editor in which one can write new robots,
but it is recommended to edit the robots in a more professional environment,
such as Eclipse.
Adding robots to RoboCode
In order to add new robots to RoboCode the class files should be copied to
<RoboCode home-dir>\robots\<robots' package name>.
The robots in this project belong to class intelligence and therefore should be
copied to <RoboCode home-dir>\robots\intelligence.
After placing the robots' *.class files in the right path, the new robots will
appear in the robots' list in New Battles screen.
Bibliography
1. Richard Sutton, Andrew G. Barto (1998). Reinforcement Learning: An
Introduction. MIT Press, Cambridge, MA, 1998
2. Leslie Pack Kaelbling, Michael L. Littman, Andrew W. Moore (1996)
Reinforcement Learning: A Survey
3. David E. Moriarty, Alan C. Schultz, John J. Grefenstette (1999) Evolutionary
Algorithms for Reinforcement Learning.
4. Richard Sutton. Learning to predict by the methods of temporal differences.
http://robocode.sourceforge.net/ - RoboCode's homepage on SourceForge.
5. http://robocoderepository.com/ - alternative homepage for RoboCode.
6. http://robowiki.net/ - wiki for RoboCode.
7. http://www.cs.xu.edu/csci170/02s/robocode/javadoc/index.html - Javadoc
API for RoboCode.
8. http://www.eclipse.org/ - Eclipse's homepage.
9. http://www-128.ibm.com/developerworks/java/library/j-robocode/ Introduction to RoboCode, part 1.
10. http://www-128.ibm.com/developerworks/java/library/j-robocode2/ Introduction to RoboCode, part 2.
11. Christos Stergiou, Dimitrios Siganos (1996) Introduction to Neural Networks.
12. http://www.jooneworld.com/ - JOONE, Java Object Oriented Neural Engine.