Relating Ant Colony Optimisation and Reinforcement Learning

Relating Ant Colony Optimisation
and Reinforcement Learning
Interim Report
<Name>
May 2007
MSc in Machine Learning and Data Mining
Supervisor: Tim Kovacs
1
2
3
4
5
6
INTRODUCTION .............................................................................................................2
ANT COLONY OPTIMISATION ....................................................................................3
2.1
Foraging Behaviour of Real Ants ..............................................................................3
2.2
The Design of Artificial Ants ....................................................................................4
2.3
The Ant Colony Optimisation Metaheuristic.............................................................5
2.4
ACO Problem.............................................................................................................6
2.5
Ant Colony Optimisation Algorithms........................................................................7
2.5.1
Ant System.........................................................................................................7
2.5.2
Ant Colony System ............................................................................................9
2.5.3
Hyper-Cube Framework for Ant Colony Optimisation ...................................10
REINFORCEMENT LEARNING...................................................................................12
3.1
The Reinforcement Learning Problem.....................................................................12
3.1.1
Optimality of Behaviour ..................................................................................13
3.1.2
The Markov Property.......................................................................................14
3.1.3
Bellman Optimality Equation ..........................................................................14
3.2
Solution Methods to the Reinforcement Learning Problem ....................................15
3.2.1
Dynamic Programming....................................................................................16
3.2.2
Monte Carlo Methods ......................................................................................17
3.2.3
Temporal-Difference Learning ........................................................................18
3.3
Exploration in Reinforcement Learning ..................................................................20
3.4
Multi-Agent Reinforcement Learning .....................................................................21
PREVIOUS WORK ON RELATING ACO and RL.......................................................23
4.1
Ant-Q .......................................................................................................................23
4.2
Pheromone Based Q-Learning (Phe-Q) ...................................................................24
4.3
AntHocNet Routing Algorithm................................................................................25
CONCLUSIONS..............................................................................................................26
REFERENCES ................................................................................................................27
1
1 INTRODUCTION
This report summarizes the background research for our project ‘Relating Ant Colony
Optimisation and Reinforcement Learning”. In order to investigate the relationship between
Ant Colony Optimisation (ACO) and Reinforcement Learning (RL) algorithms, we thought
we should first study the both fields independently. So the second and the third chapters of
this report correspond to our survey of ACO and RL fields. Although both fields have a lot of
different directions in themselves, we tried to focus on the fundamental ideas and the studies
we believed would be useful for our project. Since our project is a research-oriented project,
we did not exactly know what we are looking for in the two areas when we started our
background research. This is why this report, especially the next two chapters, might seem to
be quite general. Finally the last chapter corresponds to our survey of previous work in the
same area. In this chapter, we describe some studies of developing better algorithms
combining ACO and RL.
2
2 ANT COLONY OPTIMISATION
Ant Colonies are known to accomplish some difficult tasks, which are simply beyond the
capabilities of a single ant, by collaborating with each other. Ant colonies’ remarkable
achievements that emerge from their cooperation inspired the development of a new family
of computer algorithms called ‘Ant Algorithms’. One class of these algorithms is Ant Colony
Optimisation Algorithms (ACO) and we present a brief survey of them in this chapter. ACO
algorithms are specifically based on foraging behaviour of some ant species. Foraging
behaviour of the ants can be described in general as their ability to find the shortest paths
between the food sources and their nest. This cooperation among the ants in a colony led
researchers to develop novel algorithms to tackle problems whose solutions can be
formulated as a least cost path between an origin and a destination. Since most optimisation
problems have this property, they are very good candidates for being targeted by these
algorithms. The first ACO algorithm, Ant System, was proposed by Dorigo as a multi-agent
approximate approach that can produce good solutions in a reasonable time for combinatorial
optimisation problems [5]. He demonstrated the performance of this algorithm on Travelling
Salesman Problem (TSP) and since then TSP has become the most heavily used benchmark
problem to test the performance of other ACO algorithms proposed later. However they are
not only limited to optimisation problems, and there exist ACO Algorithms to solve some
Machine Learning problems as well.
2.1 Foraging Behaviour of Real Ants
Although ant species are almost blind, they can still communicate with the environment and
with each other by means of hormones they release. Especially some ant species, such as
Lasius Niger, use a special kind of hormone called pheromone to reinforce the optimum paths
between the food sources and their nest. To be more specific, these ants lay pheromone trails
on the paths they take and these pheromone trails act as stimuli since the ants are attracted to
follow the paths that have relatively more trail. Consequently an ant that has decided to
follow a path due to the pheromone trail on that path reinforces it more by laying its own
pheromone too. This process can be thought of as a self-reinforcing process since the more
ants follow a specific path, the more likely that path becomes to be followed by the ants in
the colony [8], [5], [9].
Nest
2
1
(a)
Food
Nest
2
1
Food
(b)
Figure 2.1: Double Bridge Experiments adapted from [8]. a) The two paths are of equal length. b) The path
below is 2 times as long as the other one.
The double-bridge experiments designed by Deneubourg and Goss as described in [8]
illustrate the foraging behaviour of the real ants very well. In these experiments, the nest and
3
the food sources are connected via two different paths and they examine the behaviour of the
ants by varying the ratio between the lengths of the paths as shown in figure 2.1 (a) and in
figure 2.1 (b).
In the first experiment, the two paths were of equal length as seen in figure 2.1 (a). They
observed that initially the ants chose between the two paths randomly since there was no
pheromone on either of the paths yet. However after a while, one of the two paths was
followed by a few more ants due to random fluctuations and as a result more pheromone
accumulated on that path reinforcing it much stronger. Eventually, they observed that the
whole colony converged to follow that same path as well. They ran the same experiment
many times and they showed that the convergence to either of the paths is actually equally
likely when the two branches are of equal length [9]. In the second experiment, they set the
ratio between the paths to 2, meaning that one of the branches was 2 times as long as the
other one. Initially, the ants again chose either of the two paths randomly. However the ants
that have chosen the short path arrived at the food source faster and they could start their
return back to the nest earlier. Consequently, pheromone accumulated faster on the shorter
path reinforcing it more as time passed and they observed that a great majority of the ants
converged to the short path.
To see what happens if a shorter path is added after the ants get conditioned to use a longer
one, Goss et al. designed another experiment in which the colony was presented with only
one path at the beginning and a shorter alternative was offered after convergence [8]. They
observed that the majority of the ants kept following the longer branch reinforcing it more
and the shorter one got never discovered. The reason for this stagnation is the fact that
pheromone evaporates too slowly compared to the rate it accumulates so the real ants can’t
forget the suboptimal path to discover the shorter one. Obviously this is an undesirable
characteristic, so we need to solve this issue in the artificial ant colonies, as we will describe
in the next section.
Furthermore, as described above, real ants lay pheromone both on their forward and
backward paths. It is discussed in [8] that the ants can’t find the shortest paths if they lay the
pheromones only on their forward paths or only on their backward paths. So it is crucial for
real ant colonies to release pheromone during both finding the food source and returning to
the nest. However as we will see in the next section, the artificial ants do not have to deposit
hormones in both directions.
2.2 The Design of Artificial Ants
The experiments with real ants showed that ant colonies are able to find the shortest paths
between the nest and the food source by making probabilistic decisions based on the
pheromone trails they detect in the environment. ACO algorithms use ant like agents called
artificial ants. Artificial ants inherit most of the properties of real ants. Here we can list the
characteristics of artificial ants inherited from real ants as described in [4], [7]:
High quality solutions emerge from the cooperation among the ants in the colony.
Similarly artificial ants construct their solutions collaboratively by sharing their
experience on the quality of solutions they have generated so far.
Both artificial and real ants modify the environment by depositing pheromones along the
paths they follow. These pheromone trails form the basis of the communication between
4
the individual members of the colony, so it plays a leading role in the utilization of
collective experience.
All the members of an artificial ant colony pursue the same goal as their natural
counterparts do. In ACO, this goal is, in general, finding the shortest path to a goal state
from an origin.
Both artificial and real ants build their solutions iteratively. They don’t jump and they can
only move to one of the adjacent states according to a local stochastic transition policy.
However artificial ants need to be equipped with additional capabilities, which are not found
in real ants, in order to increase their efficiency and effectiveness in solving difficult
problems in engineering and computer science, which we intend to apply ACO to as
described in [5], [4], and [7].
Artificial ants have memory to store the path they followed while constructing their
solutions. Thanks to this memory, artificial ants don’t have to deposit any pheromone
until they complete constructing their solution. They determine the amount of pheromone
according to the quality of their solution and they retrace the path in their memory to
deposit it. This means that the paths belonging to better solutions, receive more
pheromone.
The local stochastic transition policy they apply when deciding on the path to follow
consist of not only pheromone trail information but also problem specific heuristic
information. Real ants are almost blind, they can only sense the pheromone deposits in
the environment.
Artificial ants make their decisions and transitions to their next state in discrete time
steps.
Exploration is encouraged by the help of pheromone evaporation to prevent the colony
from getting stuck in a suboptimal solution whereas in real ant colonies pheromone
evaporation is too slow to be a significant part of their search mechanism.
2.3 The Ant Colony Optimisation Metaheuristic
Most of the time it is straightforward to state combinatorial optimisation problems in formal
terms, yet we cannot suggest any algorithm that is able to solve them to optimality in
polynomial computational time. As a result, some approximate methods that are driven by
problem-specific heuristics have lately become popular, since they are proven to give
reasonably good solutions in a short time. For example, some constructive algorithms can
employ a greedy heuristic where the solution is built up gradually by taking the action that, at
the time, best improves the partial solution under construction. However these kinds of naïve
heuristic methods have some disadvantages. The most problematic issues about them are that
the variety of solutions they produce can be very limited and that they can easily get stuck at
local optima. There are methods called metaheuristics proposed to solve these problems. A
metaheuristic stands for a general framework that guides a problem specific heuristic.
Simulated annealing is an example of a metaheuristic that guides local search heuristics to
escape local optima [8].
Ant Colony Optimisation is a metaheuristic proposed to solve hard optimisation problems. It
offers a framework in which a colony of ants construct solutions for a specific optimisation
problem concurrently, evaluate their solutions according to the problem definition and share
their experience by depositing pheromone along the paths they have taken. The most
important feature of ACO metaheuristic is the accumulation of the collective experience of
5
the whole colony in the form of pheromone trails on the paths and better solutions become
achievable as a result of this cooperation.
procedure ACOMetaheuristic
ScheduleActivities
ConstructAntsSolutions
UpdatePheromones
DaemonActions //optional
End-ScheduleActivities
End-procedure
Figure 2.2: The ACO metaheuristic in a high-level pseudo-code taken from [8].
ACO metaheuristic is composed of 3 main stages [8]: ConstructAntsSolutions,
UpdatePheromones and DaemonActions. Each stage can be briefly described as follows:
ConstructAntSolutions is where the artificial ants construct their solutions. This procedure
implements a stochastic transition policy, which is a function of pheromone trail, problemspecific heuristic and the constraints defined by the problem. It controls the ants’ moves to
one of the adjacent states allowed in their vicinity by applying this policy. Once the ants have
completed their solutions, they calculate the quality of their solution, which will be used by
the UpdatePheromones procedure.
UpdatePheromones is where the pheromone trails are adjusted based on the latest
experiences of the colony. There are 2 different kinds of updates done here. One kind of
update is decreasing the trails as a result of pheromone evaporation. This encourages
exploitation and prevents stagnation. Another kind of update is increasing the trails by
depositing new pheromone trails on the paths used in the solutions of the ants in the previous
stage. The amount of pheromone to deposit is decided according to the quality of the
particular solutions that each path belongs to. The paths that don’t belong to any solution
don’t receive any pheromone. The paths used in many solutions and/or in better solutions
receive more pheromone. However the variation of pheromone trails and how much the
search will be biased towards the best solution found so far is an implementation decision.
DaemonActions is an optional stage where extra enhancements to the original solutions
constructed by the ants could be done. One example of a daemon action is applying a local
search starting from the solution constructed by each ant. Another daemon action might be to
allow the ant that constructed the best solution to deposit additional pheromone on its path.
We will see a similar strategy used in Elitist ACO algorithm in section 2.5.
It is an implementation decision how to order and schedule these 3 stages and the activities in
each one of them. For example ants can construct their solutions completely parallel or in a
sequential manner, or DaemonActions can be done earlier than UpdatePheromones.
ScheduleActivities in the pseudo-code above refers to these kinds of arrangements, which
are entirely up to the designer.
2.4 ACO Problem
Since the artificial ants build the solutions constructively by adding one component at a time,
the problems best tackled by ACO algorithms are the ones for which we can define a
constructive heuristic and/or a constructive procedure to build up solutions. For this reason
6
combinatorial optimisation problems are very good candidates to be formulated as ACO
problems and solved by ACO algorithms [8].
The following are the characteristics that should be defined for a problem to be an ACO
problem as presented in [8] and [4]:
There exists a finite set of components such that C = {c1 , c2 , c3 ,K, cn }.
There is a set of constraints Ω defined for the problem to be solved.
The states of the problem can be described in the form of finite-length sequences of
components such that q = ci , c j ,K, cu ,K . Let Q denote the set of all sequences q and
~
let Q denote the set of feasible sequences in Q satisfying the constraints Ω .
There exists a neighbourhood structure defined between the states. For q2 to be a
neighbour of q1, q1 should be reached from q2 in one valid transition between the last
component of q1 and the first of component of q2.
~
There exists a set of candidate solutions S such that S ⊆ Q (also S ⊆ Q if we don’t allow
unfeasible solutions to be constructed at all).
~
There exists a non-empty set of optimal solutions S* such that S * ⊆ Q and S * ⊆ S .
There exists an objective function f(S) to evaluate the cost of each solution in S.
There may also be an objective function associated to states to be able to calculate the
partial solutions under construction.
With the above characteristics, an ACO problem can be represented as a graph G = (C, E)
where C is the set of nodes and E is the set of edges connecting these nodes. The states in S
correspond to the paths in G. The edges in E correspond to valid transitions between the
states in S. We may associate transition costs to the edges explicitly. Pheromone trails are
associated to either the components or the edges, yet the latter is more common [4].
Behaviour of the artificial ant in this graph can be depicted as follows. Each ant in the colony
searching the minimum cost path starts from a component c in G. Each ant has a memory to
remember its path. It keeps the sequence of components it has followed so far. This path
determines the ant’s current state. It gradually builds its solution by making transitions from
its current state to one of the states in its neighbourhood according to its stochastic transition
policy. As mentioned earlier, stochastic transition policy is a function of pheromone trail,
problem-specific heuristic and the constraints defined by the problem. When the solution is
complete, the artificial ant uses its memory to evaluate its solution via the objective function
and to retrace its path to deposit pheromone.
2.5 Ant Colony Optimisation Algorithms
In this section, a selection of ACO algorithms will be described. We will start with the first
ACO algorithm, Ant System then we will describe its extensions by pointing out the
problematic issues that arise and how they are better dealt with by these extensions.
2.5.1 Ant System
Ant System (AS) is the first ACO algorithm proposed by Dorigo et al. in [5]. In their study,
they used the travelling salesman problem (TSP) as a benchmark problem. So we will use
TSP terminology while describing the algorithm, yet the concepts described here are
7
applicable to any ACO problem. As a matter of fact, Dorigo et al. showed in [5] that ACO
framework is flexible and robust enough by applying it to quadratic assignment problem
(QAP) and job shop scheduling problem (JSP).
AS is a very simple ACO algorithm. It implements the two main stages of the general
framework shown in figure 2.2: ConstructAntsSolutions and UpdatePheromones.
Solution Construction
In AS, there are m artificial ants, which are located at m random cities. It is empirically
showed in [5] that the maximum performance is achieved when m ≈ n where n is the number
of cities. Each ant applies a stochastic transition policy to decide on its next move. Ant k at
city i can choose to move to city j with the following probability:
pijk
=
[τ ij ]α [ηij ]β
∑l ∈ N ik [τ il ]
α
[ηil ]
β
,
if j ∈ N ik
(2.1)
In the formula above, ηij stands for the heuristic value specified according to the problem to
be solved. In the TSP case, ηij is equal to 1 d ij . N ik is the feasible neighbourhood of ant k at
city i and τ ij is the quantity of the pheromone on the path between the city i and j. α and β
are the parameters which is used to set the relative importance of the pheromone trail and the
heuristic value. As α approaches 0, pheromone trails become less important and the ants
tend to choose the closest cities and the search becomes very much like a greedy search. As
β approaches 0, heuristic values are ignored and ants consider only the trails when deciding
the path to follow. This leads to stagnation at the first good solutions found by the colony.
Pheromone Update
As mentioned before, pheromone updates are done in 2 steps in ACO. In the first step all the
pheromone trails are decreased by a constant rate ρ , where 0 < ρ ≤ 1 , due to evaporation.
Equation 2.2 below implements the pheromone evaporation in AS.
τ ij ← (1 − ρ )τ ij
∀(i, j ) ∈ E
(2.2)
In the second step, the pheromone quantities, ∆τ ij , to be laid on each edge (i, j) is calculated
according to equation 2.3. In this equation ∆τ ijk corresponds to the quantity of pheromone
deposited by ant k and Lk stands for the cost of the solution found by ant k.
1 Lk , if edge (i, j) is in the solution of ant k;
k
k 
∆
τ
,
where
∆
τ
=
∑ ij
ij 
0,
otherwise;
k =1
m
∆τ ij =
(2.3)
As we can see, the edges used in shorter tours and/or used by more ants receive more
pheromone trails. We finally combine pheromone evaporation and pheromone deposit steps
into one equation as follows:
τ ij ← (1 − ρ )τ ij + ∆τ ij
∀(i, j ) ∈ E
(2.4)
8
Now we will briefly describe some direct improvements to Ant System: Elitists AS, Rank
Based AS, Max-Min AS and Best-Worst AS.
Elitist AS and Rank-Based AS adapt some concepts used in Genetic Algorithms, namely
elitisms and rank based selection. In Elitists AS, we keep the best tour found so far and while
the pheromone updates are executed for each arc (i, j), the arcs belonging the best-so-far tour
gets extra pheromone. This can be done by adding the term e∆τ ijbs , to the summation in the
equation (2.4) where e is the weight given to the best-so-far tour and ∆τ ijbs is equal to ∆τ ijk
given in equation 2.3 when k is the ant that has been constructing the best-so-far tour. We can
consider this additional pheromone trail deposited on the best-so-far tour to be a daemon
action of the pseudo-code given in Figure 2.2. In Rank-Based AS, the solutions constructed
by the ants are sorted in increasing order and only the edges belonging to the first l − 1 best
solutions of the last iteration and the best-so-far solution (i.e. the best solution found since the
start of the algorithm, not necessarily in the last iteration) receive new pheromone deposit
where l is a user-defined parameter of the algorithm. The pheromone trails are weighted
with the rank of the solution. The quantity of pheromone to deposit for the best-so-far
solution becomes l × 1 Lbs , and the quantity for the rank r solution becomes (l − r ) ×1 Lr .
In Max-Min AS, pheromone evaporation occurs on all edges after all ants construct their
solutions, however pheromone deposit is only done for the edges of the best-so-far solution
or the best solution of the last iteration. To prevent the search from stagnation, it implements
two other modifications from AS. 1) It imposes the quantity of pheromone trails on each
edge to be in the range [τ min ,τ max ] . The best value for τ max is agreed to be 1 ρLbs as
explained in [8] so τ max increases as shorter paths are found by the colony. And τ min is set to
τ max a where a is a parameter. 2) The pheromone trails are set to τ max on all edges
initially; therefore ants behave more explorative in the beginning. After a while, the search
becomes biased towards the shortest tours found by the ants so far. However if the colony
cannot find any improved solution in a predetermined number of iterations, then pheromone
trails are reinitialised.
Best-Worst AS (BWAS) is proposed by Cordon et al. in [3]. BWAS deposits extra
pheromone on the edges belonging to the best-so-far solution as a daemon action like ACS,
but it also keeps the worst solution to apply extra evaporation on its edges. The pheromone
trails are reinitialised to their initial value when the colony converges and stops producing
improved solutions. To encourage exploration more, BWAS also makes use of mutation,
which is a concept in Evolutionary Computing. The pheromone trails on each edge can be
mutated with a probability, Pm.
2.5.2 Ant Colony System
Ant Colony System (ACS) differs from AS in 3 ways [6]. 1) It uses a transition rule that
gives a higher priority to exploitation than exploration. 2) Pheromone update (including both
deposit and evaporation) is only applied to the best-so-far tour. 3) Another kind of updating,
which can be called ‘local pheromone update’, is done after each transition on any edge.
Solution Construction
ACS differs from AS in solution construction by its transition rule, which is given below
where q is a random variable uniformly distributed in [0, 1].
9
β
arg max
k {[τ ij ][η ij ] } if q ≤ q
0 (exploitation)
j
∈
N
i
s=
S ,
otherwise (exploration)
(2.5)
According to this transition rule, ant k at city i chooses to move the city j with the best edge
in terms of pheromone trail and heuristic value with q0 probability, therefore it exploits the
current knowledge of the colony. Otherwise it selects a random city in its neighbourhood,
therefore it explores. Another difference in the transition rule from AS, is that there is no
explicit weight parameter α for τ ij . It is in fact set to constant 1, since the best value of α in
AS is shown to be 1 in [5]. Moreover we should note that this transition policy is very similar
to ε -greedy policy of RL, which we will describe in Chapter 3.
Global Pheromone Update
After all the ants in the colony construct their solutions, only the tour corresponding to the
best solution so far receives a pheromone update. As a result, the search is biased towards the
vicinity of the best solution so far. The equation (2.6) implements the global pheromone
update:
τ ij ← (1 − ρ )τ ij + ρ∆τ ij ,
∀(i, j ) ∈ T bs , where ∆τ ij = 1 Lbs
(2.6)
In the equation above, T bs stands for the best solution so far and Lbs stands for the cost of that
solution. The parameter ρ again can be thought of as the pheromone evaporation rate in AS,
but it also weights the pheromone to be deposited in ACS. In other words, the new
pheromone trail becomes a weighted average of the old pheromone trail and the pheromone
to be deposited [8]. Moreover, this update rule imposes a dynamic limit to the pheromone
trails, meaning that they cannot exceed 1 Lbs , which is raised during the execution as shorter
paths are discovered. One advantage of ACS over AS is that this limit, along with the local
update rule we will describe next, prevents the search from convergence and the ants can
continue to look for better paths as long as the algorithm is run [8], [6]. Another advantage is
that pheromone update becomes less expensive since only the best solution is retraced [8].
Local Pheromone Update
In ACS, the ants do instant updates to the pheromone trails of each edge they visit while
constructing their solutions according to the following rule given in [8]:
τ ij ← (1 − δ )τ ij + δτ 0
(2.7)
Due to local update, every time an ant visits a specific edge, its pheromone trail is decreased,
so it becomes less desirable for the colony. This causes the ants to visit the edges not used
recently and motivates exploration and increases the variety of the solutions produced by the
colony. In equation (2.7), τ 0 stands for the initial value for the pheromone trails and the trails
cannot be decreased below this value. In [6], the experiments suggest the best value of τ 0 for
TSP problem to be 1 (nLnn ) where n is the number of cities and Lnn is the tour found by the
nearest-neighbour heuristic.
2.5.3 Hyper-Cube Framework for Ant Colony Optimisation
The Hyber-Cube Framework for AS limits the pheromone quantities to the interval [0, 1].
Dorigo et al. discuss in [1] that this framework has 2 advantages. 1) When it is applied to the
same problem but with different objective values, it behaves in the same way, which is not
10
the case in classical ACO algorithms. 2) It makes the theoretical analysis of ACO algorithms
easier.
A decision variable is associated to each edge (or to each component depending on the
problem definition) in graph G that the colony performs the search on. It can take values
{0, 1}. If we imagine a d dimensional hyper cube, where d is the total number of edges
(therefore the total number of decision variables), then each corner of this cube corresponds
to a solution construction for the particular problem, although they are not all feasible
solutions. The set of feasible solutions in the hyper-cube is a subset of the corners as shown
in the example in Figure 2.2. And when we limit the pheromone trails to be in the range
r
[0, 1], then the pheromone vector, τ = (τ 1 ,τ 2 ,K,τ d ) , moves in the shaded area of Figure 2.2
during the search of the colony [8], [1].
(0, 1, 1)
(1, 1, 1)
(0, 0, 1)
(0, 0, 1)
r
(0, 1, 0)
(0, 0, 0)
τ
(1, 1, 0)
(1, 0, 0)
Figure 2.2: Adapted from [1]. (0, 0, 0) and (0, 1, 1) and (1, 1, 0) are feasible solutions. Pheromone vector
moves in the shaded area.
Pheromone Update
In Hyper-Cube framework, the pheromone updates are done by taking the weighted average
of old pheromone trail and new pheromone deposit just as it is done in ACS. Equation 2.8
depicts the update rule used by Hyber-Cube framework.
m
τ ij ← (1 − ρ )τ ij + ρ ∑ ∆τ ijk
(2.8)
k =1
The update rule in Equation 2.8 enables us to keep pheromone trails in a specific range, but
we also have to change the way we calculate ∆τ ijk to keep the trails in the range [0, 1] as in
equation (2.9) [8].
 1 Lk
, if edge (i, j) is in the solution of ant k

∆τ ijk =  ∑ m (1 Lh )
h =1

otherwise
0,
(2.9)
As seen above, in order to normalize the quantity of pheromone trails to be deposited on each
edge by ant k, the cost of the tour found by ant k is divided by the total cost of the tours found
by the whole colony.
11
3 REINFORCEMENT LEARNING
Reinforcement Learning (RL) is an unsupervised learning approach where an agent learns
how to reach a goal by interacting with an environment and the environment evaluates the
quality of its actions and feedbacks to the agent in the form a numerical reward. This kind of
feedback is called evaluative feedback and it does not tell the agent which action is the best to
take in a certain situation as in supervised learning. The agent should try other possible
actions and should learn the action that yields the most reward itself.
RL agents can sense the characteristics of their environment, which denote the state of the
agent, and they can choose actions influencing their environment, which will lead them to
their next state. They pursue an explicit goal, so their actions should eventually take them to a
goal state. Every action is rewarded by the environment according to the criterion of whether
it explicitly leads the agent to a goal state or not. However, in most cases, the goal state can
be attained only after a sequence of actions, as a result delaying the reward. So an RL agent
should be able to handle delayed rewards by employing a mechanism to back-propagate this
reward to its preceding actions.
Another problem an RL agent faces is the trade-off between exploration and exploitation. As
mentioned above, the evaluative feedback coming from the environment does not inform the
agent about the best action to take in a certain state, so the agent should explore possible
actions and exploit the one that yields relatively more reward. Since it can’t do the both at the
same time, it should have a way of deciding when to explore and when to exploit.
Furthermore balancing exploration and exploitation should be dealt with differently in
different contexts. For example, if the environment is dynamic, the agent should ensure its
exploration rate does not decay over time; otherwise it can explore more at the beginning and
might explore very little or not at all later.
3.1 The Reinforcement Learning Problem
As stated in [18], any method that can provide a solution to the Reinforcement Learning
problem is accepted to be a Reinforcement Learning method; hence the definition of RL
problem is central to the RL Framework.
The RL problem can be defined as the problem of an agent interacting with a complex
environment trying to maximise its long-run reward over a sequence of discrete time steps as
shown in Figure 3.1.
reward rt
state st
rt+1
st+1
Time step t
Agent
action at
Environment
Time step t+1
Figure 3.1: Reinforcement Learning Framework from [18]
12
Formally the RL problem is composed of a set of environment states S, a set of agent’s
actions A, and a set of numerical reward signals R.
States and state transitions are the elements of the environment and the agent has no direct
control on them. It can only affect them indirectly via the actions it takes; different actions
result in different states. The agent can sense some or all aspects of its environment, and the
state of an agent can be considered as the representation of these aspects. The state
information is also desired to summarise all past experiences to a certain degree so that the
agent can determine its next action based on its current state confidently. (This property of
the states will be further discussed under the title ‘The Markov Property’ below.) Each state,
except for the terminal states (i.e. goal state and failure state), is a decision point for the agent
to select one of the possible actions allowed in that state. According to the action at ∈ A( st )
(where A( st ) is the set of possible actions at state st ) taken by the agent and the current state
st ∈ S at time step t, next state st +1 ∈ S is determined by the environment. State transition
function of the environment does not have to be deterministic; it can be probabilistic,
meaning that a certain action at a certain state can lead the agent to different next states
probabilistically.
Rewards are the numerical evaluation of the agent’s actions by the environment. Like states,
the agent has no direct control on the rewards. The rewards should be determined carefully in
the environment. The actions of the agent should be rewarded only if they are directly related
to achieving the goal state. Otherwise the agent might learn to maximise its cumulative
reward in the long run without reaching the goal state at all [18]. Moreover, rewards can be
discouraging as well as encouraging. The environment can give a negative reward to an
action if the agent will fail to reach the goal state because of that action so the agent can
consider avoiding taking the same action at the same state the next time. As a final note, if the
problem is non-stationary, meaning the problem changes over time, then the probability
distribution of the rewards can change over time too. In such a case, it is wiser to weight the
recent rewards more when calculating the value of a state (State values will be introduced in
‘Bellman Optimality Equation’ below and we will see the solution methods which weight the
recent rewards more in the next section).
3.1.1 Optimality of Behaviour
The agent follows a policy to decide on its action according to its current state. This policy is
in the form of a stochastic function, π t ( s, a ) that determines the probability of selecting
action a at state s. The policy of the agent makes use of its past experiences and the agent has
a way of changing its policy according to its new experiences so as to achieve optimal
behaviour that maximises its expected cumulative reward over time. However it is not very
obvious how to define cumulative reward, which we will call return hereafter.
One way of defining return is simply as the sum of the rewards until a final time step T as
shown in equation 3.1.
T
Rt = rt +1 + rt + 2 + K + rT = ∑ rt +i +1
(3.1)
i =0
This model of return is called finite-horizon model and it can work well if there is a specific
final step as in episodic tasks. But in continuous tasks, there is no final step so this sum goes
13
to infinity. To prevent this, a discount factor is introduced to equation 3.1. According to this
new approach, which is called infinite-horizon model, the agent gives more importance to its
immediate rewards than the rewards far away in the future as depicted in equation 3.2.
∞
Rt = rt +1 + γrt + 2 + γ 2 rt + 3 K = ∑ γ i rt +i +1,
0 ≤ γ ≤1
(3.2)
i =0
When γ is equal to 0, the agent considers only its immediate reward. On the other hand,
when it is 1, it becomes equal to 3.1. Since equation 3.2 is more flexible due to the fact that
parameter γ can be varied according to the problem and the method, the solution methods
that we will introduce here will use equation 3.2 to calculate the expected return for the states
at each time step.
3.1.2 The Markov Property
The probability of transition to the state st+1 producing the reward rt+1 from the state st by
taking the action at in a sequential decision process comes the probability distribution shown
in 3.3 for all possible s′ , r and all possible past events.
Pr (st +1 = s′, rt +1 = r | st , at , rt , st −1, at −1,K , r1, s0 , a0 )
(3.3)
If the states of this sequential decision process have the Markov Property then s ′ , r depends
on only the current state st and the current action at . In this case we can express the
probability distribution generating the transitions in the decision process as follows:
Pr (st +1 = s′, rt +1 = r | st , at )
(3.4)
Decision Processes whose states satisfy the Markov Property are called Markov Decision
Process (MDP). Therefore, in an MDP process, the probability of transition to next state s ′
given the current state s and the current action a is Psas ′ = Pr (st +1 = s ′ | st = s, at = a ) and the
expected reward yielding from this transition is Rsas ′ = E (rt +1 | st = s, at = s, st +1 = s ′) .
3.1.3 Bellman Optimality Equation
The value of a state V π (s ) is defined as the expected return that will be achieved starting
from that state s and following the current policy π of the agent. In the same vein, the value
of a state and action pair Q π ( s, a ) is defined as the expected return that will be achieved
starting from state s with action a and following the current policy thereafter. We can
formally define V π (s ) and Q π ( s, a ) in a MDP where Eπ is the expected value following the
policy π :
 ∞

V π ( s ) = Eπ {Rt | st = s} = Eπ  ∑ γ i rt + i +1 | s = s 
i = 0

t
(3.5)
14
and
 ∞ i
Q ( s, a ) = Eπ {Rt | st = s, at = a} = Eπ  ∑ γ rt +i +1 | st = s, at = a
i =0
π



(3.6)
As shown above, the value functions V π (s ) and Q π ( s, a ) depend on the policy the agent
follows. In the RL problem the agent tries to find the optimal policy π * that maximizes the
value functions. Bellman showed in MDPs, that an optimal policy π * exists such that
*
V π ( s ) ≥ V π ( s ) for all s ∈ S and any π , so we can define optimal value functions as
V * ( s ) = max(V π ( s )) and Q * ( s, a) = max(Q π ( s, a)) for all s ∈ S .
π
π
These V * ( s) can also be calculated with the Bellman Optimally Equation for V*, which is
derived in [18] as shown in 3.7.
 ∞
*
V * ( s ) = max (Q π ( s, a )) = max Eπ *  ∑ γ i rt + i +1 | st = s, at = a
a∈ A( s )
a
i = 0
∞

= max Eπ * rt +1 + γ ∑ γ i rt + i + 2 | st = s, at = a
a

i =0
{
= max E rt +1 + γV * ( st +1 ) | st = s, at = a
a
[
= max ∑ Psas ′ Rsas ′ + γV * ( s ′)
a
}






]
(3.7)
s′
Similarly, we can show that Bellman Optimality Equation for Q* is as follows:


Q* (s, a) = ∑ Psas′  Rsas′ + γ max Q * ( s ′, a ′) 
a′


s′
(3.8)
Bellman Optimality Equation actually represents a system of equations; one for each state
s ∈ S and it can be solved mathematically although it is computationally expensive when S is
large. In addition there are some algorithmic methods that can find approximate values to the
actual values iteratively.
3.2 Solution Methods to the Reinforcement Learning Problem
In this section, we will introduce three different solution methods for RL problem. These
methods can be divided into two categories: the model-based methods that require the
complete model of the environment and the model-free methods that can learn from online
experience via trial and error search. The first method we will introduce is Dynamic
Programming and it is a model-based method. Then we will describe two model-free
methods, which are Monte Carlo and Temporal Difference methods.
15
3.2.1 Dynamic Programming
Dynamic Programming (DP) is a method for finding the optimal values of the states
discussed in the previous section, given the complete markovian model of the environment.
In general, DP methods iteratively estimate the values of the states by making use of Bellman
Optimality Equations shown in equations 3.7 and 3.8 and over time the estimated values
approach the actual optimal values for Bellman Optimality Equations. Although DP methods
are not very flexible to be used in many real-world problems due their unrealistic requirement
of complete markovian model of the environment, they establish the basis for other methods,
which are more flexible. Here we will describe two popular DP methods to solve MDPs:
Policy Iteration and Value Iteration given in [18]. Note that we will assume that the policy is
deterministic in the description of these methods.
Policy Iteration
Policy Iteration is composed of two phases: policy evaluation and policy improvement. In the
first phase, the value of each state following the current policy is calculated. In the latter one,
the policy is improved based on the present state values.
In order to evaluate the current policy, we can write the Bellman Equation for an arbitrary
policy π as an update rule shown in equation 3.9, which can be derived in the same way as
Bellman Optimality Equation. In the policy evaluation phase, this rule is used to update the
values of the states iteratively until no significant change occurs in the values of any state.
[
V π ( s ) ← ∑ Psπs ′( s ) Rsπs(′ s ) + γV π ( s ′)
]
(3.9)
s′
In the policy improvement phase, the present policy is revised depending on the state values
calculated in the previous phase by applying the following rule shown in equation 3.10. If the
policy after modification is the same as the previous policy, it stops and returns the last
policy. Otherwise it goes back to policy evaluation phase with the last policy and keeps
executing policy evaluation and policy improvement phases until convergence.
[
π ′( s) ← arg max Q π ( s, a) = arg max ∑ Psas ′ Rsas ′ + γV π ( s ′)
a
a
]
(3.10)
s′
Value Iteration
Value Iteration combines policy evaluation and policy improvement phases into a single
update rule as follows:
[
V ( s ) ← max ∑ Psas ′ Rsas ′ + γV ( s ′)
a
]
(3.11)
s′
The state values are modified by this rule until convergence as in policy evaluation phase of
Policy Iteration. After convergence to the optimal state values, it executes rule 3.10 once to
generate the optimal policy associated with these optimal state values.
16
As a final remark, two aspects of Dynamic Programming are worth mentioning. The first one
is that it uses full backups during its updates meaning it considers all state transitions from
each state since it knows the complete model of the environment. Later we will introduce
sample backups as opposed to full backups for the problems for which we cannot have the
full model. The second interesting aspect of Dynamic Programming is that the value
estimates are done on the basis of other estimates. As seen in equations 3.9, 3.10 and 3.11,
the value estimate for the current state is updated based on the values of successor states. This
is called bootstrapping and it is used in Temporal Difference Learning as well.
3.2.2 Monte Carlo Methods
Unlike Dynamic Programming, Monte Carlo methods do not require complete model of the
environment. They can learn from online experiences where only a sample of state transitions
and a sample of rewards from the environment are available. Therefore Monte Carlo methods
use sample backups. Another distinguishing feature of Monte Carlo methods from Dynamic
Programming is that they do not bootstrap. Instead, they update the value of each state visited
with the actual return received after that state so they need to wait before doing any update
until the final state in order to determine the actual returns. This makes them robust to the
problems caused by non-markovian environments. However Monte Carlo methods are only
suitable for episodic tasks because they require the existence of a final state.
Monte Carlo methods need to learn Q*(s, a) values in order to learn the optimal policy.
Otherwise the optimal policy cannot be derived from only V*(s) values since the model of the
environment might not be available. (As it can be seen in equation 3.10, transition
probabilities and expected rewards should be known). Monte Carlo methods, like policy
iteration of dynamic programming, repeat computing Q(s, a) values for the current policy and
then improving the policy until the policy converges to the optimal policy.
The update rule of a general Monte Carlo method to find Q π ( s, a ) values can be described as
follows. The returns that are determined at the end of each episode are kept in a distinct list
for each state action pair and the value of each state action pair is calculated as the average of
all returns in the corresponding list. The return can generally be calculated as shown in
equation 3.2 with γ = 1 (since the task is not continuous), and it simply becomes the sum of
the rewards received until the end of the episode. After updating the values, the policy can be
improved as seen in the pseudocode of the basic algorithm from [18]. To make sure that
every state action pair is visited, the method uses exploring starts, and if enough number of
episodes is sampled, this algorithm can converge to the optimal policy.
Randomly Initialize Q(s, a) and the current policy
Returns(s, a) ← empty list
While termination condition is not met
Generate an episode using exploring starts and the current policy
For each pair s, a appearing in the episode
Calculate return R following the pair s, a
Append R to Returns(s, a)
Q(s, a) ← average(Returns(s, a))
For each s in the episode
π ( s ) ← arg max a Q ( s , a ) (i.e. greedy policy)
Figure 3.2: The pseudocode for a basic Monte Carlo algorithm adapted from [18]
17
3.2.3 Temporal-Difference Learning
Temporal-Difference (TD) Learning methods combine some properties of Dynamic
Programming and Monte Carlo Methods: They can learn from online experience without
requiring the model of the environment like Monte Carlo methods and they update value
estimates of the states based on the estimates of successor states (i.e. bootstrapping) as in
Dynamic Programming methods [18]. TD methods use the difference in temporally
successive value predictions on the same state as an error in the prediction and update the
state values so as to remove this error over time, whereas traditional methods in Machine
Learning use the difference between actual outcome and prediction [17]. TD-error and the
update rule for an arbitrary state s are expressed as in the equations 3.12. In the update rule,
α denotes the learning rate (a.k.a. step size parameter) in the range [0, 1]. The learning rate
should be tuned such that it lets the estimates smoothly converge to the actual values without
making drastic changes in the estimates, which will lead to an unstable policy.
TDError ( s ) = r + γV ( s ′) −
1424
3
new prediction
V{
(s) ,
old prediction
(3.12)
V ( s ) ← V ( s ) + α ⋅ TDError ( s )
TD methods can be divided into two groups as single-step TD (a.k.a. TD(0)) and TD( λ ). In
single-step TD, the state values are updated based on the values of the states just one step
further. For example TD-error shown in equation 3.12 is a single step error. On the other
hand, TD( λ ) updates can be based on the states and the rewards that are more than one step
away. (Update rule of TD( λ ) methods is more complicated and will be detailed later.) As a
matter of fact, single-step TD methods are a special case of multi-step TD methods where
λ = 0 , as we will see later. Before we discuss TD( λ ) method, we will demonstrate two onestep TD methods: Sarsa and Q-learning.
Sarsa
Sarsa is an on-policy TD method meaning that it updates the values for the policy that it is
actually using at the time of the update. Like Monte Carlo methods, Sarsa has to learn Q(s, a)
values since it does not have a model to derive the policy from only V(s) values. The figure
3.3 shows the pseudocode for the Sarsa algorithm.
Randomly initialise Q(s, a)
Repeat for each episode
Initialise the current state s
Choose action a at current state s according to the policy
While s is not terminal state
Take action a, observe reward r and next state s′
Determine next action a′ at the next state s′ according to the policy (e.g. ε − greedy )
[
Q ( s , a ) ← Q ( s , a ) + α r + γQ ( s′, a′) − Q ( s , a )
]
s ← s′; a ← a′;
Figure 3.3: The pseudocode for Sarsa adapted from [18]
Q-Learning
Q-Learning is an off-policy method meaning that it updates the values based on the action
that gives the maximum value (i.e. directly learns Q* instead of learning Q π first), ignoring
18
the current policy. To illustrate this better, consider the pseudocode for the algorithm given in
Figure 3.4. As can be seen in the pseudocode, the agent uses a ε − greedy policy1, but it
updates the current value estimate based on the action that gives the maximum value at the
successor state instead of the action that will be taken at the successor state by following the
current policy.
Randomly initialise Q(s, a)
Repeat for each episode
Initialise the current state s
While s is not terminal state
Choose action a at the current state s according to the policy (e.g. ε − greedy )
Take action a, observe reward r and next state s′
[
Q ( s , a ) ← Q ( s , a ) + α r + γ max a′ Q ( s′, a′) − Q ( s , a )
s ← s′;
]
Figure 3.4: The pseudocode for Q-learning adapted from [18]
TD( λ ) and Eligibility Traces
TD( λ ) methods update the estimated values based on the weighted average of k-step returns
where k varies between 1 and n (n is the number of steps from the current state to the final
state). Equation 3.13 shows the k-step return ( Rtk ) at time t for an arbitrary k.
k
Rtk = ∑ γ i −1rt + i + γ k Vt + k ( s )
(3.13)
i =1
TD( λ ) distributes the weights among the returns in a way such that the closer returns get
more weight. Each k-step return except the last one, which is the actual return coming from
the final state, is weighted by (1 − λ )λk −1 and the last one is weighted by λn . Note that this
sequence of weights sums up to 1 for any n. Equation 3.14 shows this averaged return Rtλ
used in TD( λ ).
Rtλ =
n −1
∑ (1 − λ )λk −1Rtk + λn Rt
(3.14)
k =1
TD( λ ) then updates the state values as shown in equation 3.15.
V ( st ) ← V ( st ) + α ⋅ ( Rtλ − V ( st ))
(3.15)
The equations 3.13, 3.14 and 3.15 describe the forward view of the algorithm, and they can’t
be directly implemented since they require the agent to know the future rewards. In the
implementation, each state has a variable called eligibility trace (denoted by et (s ) ) as shown
in equation 3.16 and when a TD-error at time t (denoted by δ t ) is obtained, it is back
propagated through all the states previously visited and their state values are updated based
on their eligibility trace and the TD-error value as shown in equation 3.17. This is called the
1
ε − greedy policy means that the agent selects the greedy action with probability 1 − ε and a random action
with probability ε .
19
backward view of TD( λ ) and it is proved to be approximately equivalent to the forward view
in [18].
γλet −1 ( s )
et ( s ) = 
γλet −1 ( s ) + 1
if s is the current state
otherwise
V ( st ) ← V ( st ) + αδ t et ( s )
(3.16)
(3.17)
Eligibility traces can be used to find the optimum policy by learning Q(s, a) values. The most
straightforward way of doing this is making the Sarsa algorithm use eligibility traces. This
version of Sarsa is called Sarsa( λ ) and its pseudocode is given in Figure 3.5. However
Q( λ ) becomes problematic since it has to update the state values based on the maximum of
next state values. If the agent does not follow a greedy heuristic, it cannot ensure that any n
step backup where n>1 will correspond to maximum possible return [20]. One solution called
Watkin’s Q( λ ) suggest reinitialising eligibility traces to 0 whenever a non-greedy action
selected, but then we cannot get much benefit from using eligibility traces.
Randomly initialise Q(s, a).
Initialise e(s, a) to 0 for all s, a.
Repeat for each episode
Initialise the current state s
Choose action a at current state s according to the policy
While s is not terminal state
Take action a, observe reward r and next state s′
Determine next action a′ at the next state s′ according to the policy (e.g. ε − greedy )
δ ← r + γQ ( s ′, a ′) − Q ( s , a )
e( s , a) ← e( s, a ) + 1
for all s, a
Q ( s , a ) ← Q ( s , a ) + αδe ( s , a )
e ( s , a ) ← γλe ( s , a )
s ← s′; a ← a′;
Figure 3.5: The pseudocode for Sarsa( λ ) adapted from [18]
3.3 Exploration in Reinforcement Learning
As mentioned earlier, there is no supervisor in RL learning; an RL agent must explore its
environment in order to see which actions are better than the others [18], [20]. However how
to explore is not straightforward. Here we will briefly introduce some commonly used
methods in RL area. The methods are mostly based on heuristics and there is no single best
way of doing it for every RL problem.
1. ε-greedy action selection: This method selects a random action (i.e. exploration) with
probability ε and the greedy action based on the current action value estimates (i.e.
exploitation) with probability 1-ε.
2. Soft-max action selection: This method is similar to ε-greedy method, but this selects the
explorative action according to Boltzmann distribution as shown in 3.18 instead of
selecting it totally randomly.
P(a) =
e Q( a) / τ
∑b =1e Q(b) / τ
n
(3.18)
20
In equation 3.18 above, the probability of selection action a based on Boltzman distribution is
calculated where τ is a user defined parameter called temperature [18].
3. Optimistic initial values: According to this method, action values are initialised to
optimistic values. When the agent does not get what he expects on the action he tried, the
values for those actions will decrease and he will be tempted to explore the untried
actions more. However with this method exploration rate drops towards the end, so this
method does no perform well in dynamic environments [18].
4. Frequency based exploration: This method keeps how many times each state is visited.
Based on this statistic, it biases the explorative action selection to the action that will take
the agent to the least frequently visited state [2].
5. Recency based exploration: This method keeps for each state how much time has passed
since that state was last visited. Then it can bias the explorative action selection to the
action that will take the agent to the least recently visited state [2].
3.4 Multi-Agent Reinforcement Learning
Although we have so far discussed RL as a single-agent system, the nature of RL shows
potential to be applied to multi-agent systems. Here we will focus on the part of multi-agent
RL where the agents cooperate with each other because cooperative multi-agent learning is
more relevant to the subject of this dissertation, whereas the general form of multi-agent
learning also includes agents acting against each other.
Extension of RL to multi-agent RL has some difficulties [2]. One such difficulty is that for
any individual agent, the environment will not be markovian because any other agent that
learns and changes its behaviour over time will be an element of the state of that
environment. However it can be argued that the markovian environment assumption of RL is
already hard to maintain for realistic problems in the single-agent case too. Another difficulty
is that the state space will grow larger as the number of agents increase. One possible
approach to overcome the huge state-space problem is function approximation. We might
also consider ways of formulating the problem where the agents are not an aspect of the state
definition; in other words we can try to disregard the presence of other agents.
Formal model of multi-agent RL is based on Markov Game model [2]. The framework of
Markov Games is more flexible than Markov Decision Processes since it allows multiple
adaptive agents with mutual or competing goals [13]. In a Markov game, the agents have a
joint policy h=(h1,…, hn) where n is the number of agents and hi denotes the policy of the
agent i. At each time step t, the agents take a joint action ut=[u1,t,…, un,t] at the state st where
ui,t is the action taken by the agent i following the policy hi. Then the environment changes its
state to st+1 from st with the probability P(st+1 | ut, st) and each agent i receives a reward ri,t+1
with probability P(ri,t+1|st+1, ut, st). The value of a state s for an agent i under the joint policy
h is defined as the expected return to that agent when the joint policy h is followed by the
agents in the environment as shown in equation 3.19. We can write the value of state and
joint action pair in a similar manner as shown in equation 3.20.
 ∞

Vih ( s ) ← E h  ∑ γ k ri, t + k +1 | xt = x 
k = 0

(3.19)
 ∞

Qih ( s, u ) ← E h  ∑ γ k ri, t + k +1 | xt = x, u k = u 
k = 0

(3.20)
21
If the agents share the same goal, therefore the same reward function, then they can cooperate
to maximise the total reward. Such a cooperative Markov Game model is called multi-agent
Markov Decision Process (MMDP) [13]. We can solve an MMDP by treating the multi-agent
system as a single-agent like the Team-Q algorithm given in [13].
The study by Tan in [19] presents the benefits of cooperation in multi-agent RL systems. In
this study, three cases of cooperation among the agents are analysed. In the experiments, a
grid-world environment where some preys wander is used and the task of the RL agents is to
catch these preys. Each RL agent can sense the rectangular field around itself whose
dimensions are pre-specified (i.e. state of the environment perceived by the agent). We will
briefly describe these cases in the following paragraph.
In the first case discussed in [19], the agents cooperate via sharing their sensations. It uses
two types of RL agents: the hunter and the scout. The scout always tells its action and its state
to the hunter so if there is a prey in the vicinity of the scout, the hunter can determine the
relative location of the prey to itself based on the relative location of the scout to itself and
the prey to the scout. Tan shows that if the visual field of the scout is large enough, the hunter
and the scout catch a prey in less number of steps than independent agents do. In the second
case, the agents accomplish the tasks themselves but they share their episodes or policies by
(1) using and updating the same decision policy or (2) exchanging their individual policies at
regular intervals. Tan shows that, independent agents and cooperating agents converge to the
same optimum number of steps to catch the prey, but the cooperating agents converge faster
than the independent agents. The last case Tan studied is the joint task situation where all the
hunters must catch the prey together. In this case it is shown that if RL agents share
sensations, they can learn to catch a prey in less number of steps than independent agents do.
As a final remark, there are some studies in the literature using a population of RL agents, but
the agents act independent of each other and they learn their own policies. Afterwards their
policies and experiences are utilized by other strategies to find the optimum policy. For
example, the algorithm suggested in [14] uses a population of RL agents learning
independently and a technique similar to Genetic Algorithms to utilize the solutions found by
different RL agents.
22
4 PREVIOUS WORK ON RELATING ACO and RL
There are some studies based on the relationship between ACO and RL methods such as the
ones presented in [11], [15], [10], [16] and [12]. In this chapter, we will briefly describe AntQ algorithm studied in [11], Phe-Q studied in [15] and AntHocNet algorithm studied in [10]
in the following sections. In [16], methods for improving Ant Colony System (ACS) by
making use of RL update are discussed and an algorithm called Q-ACS is proposed. However
this algorithm is the same as Ant-Q algorithm, which was suggested earlier than ACS by
Dorigo in [11], so this study is actually not novel yet it has good comments on multiagent
reinforcement learning. In [12], an ant algorithm called ARML-TDE is proposed as an
alternative to Ant-Q and ACS based on TD error concept of Q-learning, but we think the way
it uses TD-error is confusing and conflicts with the way it is used in Q-learning. Equation 4.1
shows how TD-error is used in ARML-TDE Q value updates:
Q( s, a ) ← (1 − α ) ⋅ Q( s, a ) + α ⋅ TDError


TDError = r + γ max Q( s ′, a ′) − Q( s, a )
 a′

(4.1)
4.1 Ant-Q
Ant-Q is important since it is the first attempt to enhance an ACO algorithm by making use
of the commonalities between the Ant System and Q-learning.
This algorithm is proposed by Dorigo et al to improve the performance of Ant System in
[11]. The major difference of Ant-Q from Ant System is its pheromone update rule given in
the equation 4.1 adopted from Q-learning.
τ ij ← (1 − α ) ⋅ τ ij + α ⋅  ∆τ ij + γ ⋅ max l ∈ N k τ jl 
j


∀(i, j ) ∈ E
(4.2)
This update rule resembles the update rule of Q-learning in two aspects. First, it updates the
pheromone value of the transition (i, j) based on the pheromone value of the successive
transition (j, l). Second, it uses TD error to fix the pheromone amount associated to the
current edge with the learning rate α and discount rate γ . As described earlier ∆τ ij are
calculated based on the solution quality, so the value of ∆τ ij for all i and j will be 0 while the
ants apply the update rule 4.1 during their solution construction process. To compensate for
this, the update rule 4.1 is applied once more when the solution construction is complete, but
this time, the value of the next transition (i.e. max l∈N k τ jl ) is considered to be 0. As another
j
difference from Ant System, in the implementation given in [18], ∆τ ij is calculated for only
the best tour (i.e. iteration best or global best) so the second update is only applied to the
edges of the best tour.
The experiments described in [18] showed that Ant-Q algorithm can find better solutions than
Ant System algorithm and it doesn’t converge to a single tour so the ants can keep looking
for better solutions. Despite these promising results, this algorithm was abandoned when Ant
Colony System (ACS) was developed (see section 2.5.2) [8]. In fact Ant-Q can be thought of
23
an ACS algorithm where τ 0 = max l ∈ N k τ jl since the update of the ants during the path
j
construction described above corresponds to the local update of ACS and the update after the
construction corresponds to the global update of ACS. Moreover, ACS was shown to be
simpler and more efficient since we get the same performance as we do in Ant-Q by setting
τ 0 to a very small value in ACS [8].
4.2 Pheromone Based Q-Learning (Phe-Q)
Phe-Q learning proposed by Monekosso et al. in [15] is a multi-agent Q-learning algorithm
with an additional term called belief factor in its update rule. Phe-Q learning combines RL
and ant behaviour since the belief factor is a function of pheromone trail in the states and it
tells how significantly the agent believes in the trail.
In Phe-Q, the update rule tries to maximise both the belief factor and state action pair values.
Belief factor of a state-action pair is defined as the ratio between the actual total pheromone
trails at the possible neighbouring states when that action is chosen, and the maximum
possible total as given in equation 4.2.
B ( s, a ) =
∑ s ′∈N φ ( s ′)
∑ s ′∈N φmax ( s ′)
s ,a
(4.3)
s ,a
In the equation 4.2, Ns, a denotes the set of successor states given that action a is chosen at
state s, φ ( s ) denotes the pheromone trail at the state s and φ max ( s ) denotes the maximum
pheromone trail state s can get2. The values of φ ( s ) are updated based on aggregation,
evaporation, and diffusion. Aggregation (via new pheromone deposit) and evaporation is
used in the ACO algorithms we discussed in Chapter 2, but diffusion is an extra factor used in
pheromone updates of the states in Phe-Q. The pheromone at a state diffuses to its
neighbouring states with a rate φ d that is inversely proportional to the distance between that
state and its specific neighbour [15].
The equation 4.3 gives the update rule used by Phe-Q algorithm:


Q( st , a ) ← (1 − α )Q( st , a ) + α rt + γ max[Q( st +1 , a ′) + ξB( st +1 , a ′)]
a′


(4.3)
As seen above, the only difference between Phe-Q update rule and Q-learning update rule is
the belief factor denoted by B(s, a). ξ is a sigmoid function of time whose value increases as
the number of agents who successfully complete the task. In paper [15], a detailed analysis of
determining the optimal parameters of this algorithm is given.
The grid-world experiments in [15] shows that communicating Phe-Q agents with optimal
parameter settings outperform non-communicating Q-learning agents in the speed of
learning.
Pheromone trails ( φ ) values in Phe-Q are restricted to be in a user defined range [Tmin, Tmax], so
Tmax although it is not explicitly stated so in the paper [15].
2
φmax must be
24
4.3 AntHocNet Routing Algorithm
AntHocNet is an algorithm proposed for Mobile Ad Hoc Networks to update the routing
tables in the network with the most cost-efficient route information [10]. Since the nodes
move, join and leave at any time, the links are very dynamic so the algorithm should be very
efficient in terms of adapting to the changes in the environment. So it is discussed in [10] that
AntHocNet is inspired by ACO algorithms developed for wired networks but also it is
equipped with some other mechanisms to overcome the difficulties that can arise in a very
dynamic environment.
AntHocNet algorithm integrates bootstrapping mechanism of RL methods in a novel way.
The algorithm has two ways to action: a reactive way and a proactive way. Its reactive action
is such that a node starts to collect information if a destination is requested on that node but it
has no corresponding routing information. Its proactive action is such that it tries to keep the
routing information up to date to the changes in the network as long as the communication
goes on between the origin and the destination.
The reactive part uses ACO agents to produce a good initial path. Each node keeps the
i
in the table of node i contains
routing information in pheromone tables. An entry Tnd
pheromone value which estimates the goodness of reaching destination d via neighbour n.
The ants apply the rule given in the equation 4.4 to decide on the neighbouring node to visit
next during the route construction to the destination d.
Pnd =
i β
(Tnd
)
∑ j∈N di (T jdi ) β
,
β ≥1
(4.4)
After the construction, the cost of a route can be calculated based on many measures such as
number of hops, signal quality, congestion, etc. The pheromone update rule is very similar to
the global update rule of ACS given in the equation 2.6.
In the proactive action part, the nodes periodically send proactive forward ants that act like
reactive ants to explore better paths and/or the check the validity of existing paths. However,
the ants may not be able to catch up with the speed of the changes in the network since they
cannot update any pheromone until they construct a complete route (Note that the ants’
learning is very much like Monte Carlo path sampling). To counteract this issue, the
proactive action part contains another process called pheromone diffusion and bootstrapping.
Pheromone diffusion is realized by making the nodes to broadcast periodic messages that
contain information about their pheromone tables including the best pheromone value to each
destination d they know about. The node i that has received such a message from its
neighbour j, makes an estimate of the cost of the route to each destination d listed in the
message, over its neighbour j based on the best pheromone value informed by j. This
bootstrapped estimate replaces the estimate in the pheromone table if an entry exists for it. If
such an entry does not exist, this means that the corresponding path for the bootstrapped
estimate hasn’t been explicitly sampled by the ants yet. In this case, the pheromone table is
not updated, because the path might not be safe to be used to send real data before actually
checked. Nevertheless the estimated value is kept separately, because proactive forward ants
consider both bootstrapped and regular pheromone value during their route constructions.
25
5 CONCLUSIONS
We have observed that the overall search mechanisms of ACO and RL algorithms are very
similar. As described in this report, ACO algorithms employ a colony of artificial ant-like
agents each of which construct a candidate solution to the optimisation problem at hand and
after the solution construction is completed, each ant lays some amount of pheromone on the
paths they have taken. The amount of pheromone to be deposited by each ant is determined
based on how good its solution is. The collective experience of the colony is built up in the
form of pheromone trails and it is utilised by the individual ants while they are constructing
new solutions in the following iterations of the algorithm. Similar to ACO, RL algorithms
employ RL agent/s who try to learn the optimal policy that will maximise the overall reward
in the long run. The experiences of the agents are built up as state/state-action values. These
values indicate how much reward in total the agent expects to receive until the final step
based on the previous experiences; and in each iteration of the algorithm, the policy is
improved based on these values. Moreover, we observed that both algorithms have
exploration-exploitation trade-off due to the nature of their search mechanisms.
We have also observed the differences between ACO and RL. One of them is that RL
requires states to have Markov Property theoretically but ACO does not have such a
requirement. This mainly comes from the fact that state/state-action values in RL are the
estimates of the total reward that will be received afterwards and in order to be able to reach a
good approximation of actual values by applying the simple Bellman update equation, we
need to ensure that the Markov Property holds. Whereas in ACO, the pheromone trail
associated to a transition indicates how likely that transition belongs to the optimum tour
independent of other transitions done earlier or to be done later. This is why the transitions
used by many ants and/or part of less costly tours receive more pheromone. Another
difference is that ACO updates the pheromone values after it constructs a solution and RL
methods (except for Monte Carlo methods), can update state/state-action values based on the
values of next states before actually observing the actual overall reward (i.e. bootstrapping).
Since Monte Carlo methods like ACO methods do not bootstrap, they are less affected by the
violation of Markov Property. Furthermore, RL algorithms can be applied to a variety of
different problems that can be formulated as a RL problem. On the other hand, most studies
of ACO in the literature are about its application to optimisation problems. Finally RL
methods are originally single agent methods, and they have been later extended to multiagent methods as well. However ACO methods need to be multi-agent due to their biological
inspiration that good solutions to complex problems can emerge from cooperation between
simple agents.
26
6 REFERENCES
[1] Blum, C., and Dorigo, M. The Hypercube framework for Ant Colony Optimization.
In:IEEE Transactions on Systems, Man, and Cybernetics, Vol. 34, No. 2, April 2004.
[2] Buşoniu, L., Schutter, B., and Babuska R. Learning and Coordination in Dynamic
Multiagent Systems. Technical Report submitted to Delft University of Technology, Delft
Center for Systems and Control, Netherlands, October 2005.
[3] Cordon, O., Herrera, F., Fernandez de Viana, I., and Moreno, L. A New ACO Model
Integrating Evolutionary Computation Concepts: The Best-Worst Ant System. In: Proc. of
ANTS’2000. From Ant Colonies to Artificial Ants: Second International Workshop on Ant
Algorithms, Brussels, Belgium, September 7-9, pp. 22-29, 2000.
[4] Cordon, O., Herrea, F., and Stützle, T. A Review on The Ant Colony Optimization
Metaheuristic: Basis, Models and New Trends. In: Mathware and Soft Computing. Vol. 9,
pp.141-175, 2002.
[5] Dorigo, M., Maniezzo, V., and Colorni, A. Ant System: Optimization by a Colony of
Cooperating Agents. In: IEEE Transactions on Systems, Man, and Cybernetics. Vol. 26, No.
1, 1996.
[6] Dorigo, M., and Gambardella, L. M. Ant Colony System: A Cooperative Learning
Approach to the Travelling Salesman Problem. In: IEEE Transactions on Evolutionary
Computation, Vol.1 No.1, pp.53-66, April 1997.
[7]Dorigo, M., Di Caro, G., and Gambardella, L. M. Ant algorithms for discrete optimisation.
In: Artificial Life 5(2), pp.137-172, April 1999.
[8] Dorigo, M., and Stützle, T. Ant Colony Optimization, MIT Press, Cambridge, MA, 2004.
[9] Dorigo, M., Birattari, M., and Stützle, T. Ant Colony Optimization: Artificial Ants as a
Computational Intelligence Technique. In: IEEE Computational Intelligence Magazine,
November 2006.
[10] Ducatelle, F., Di Caro, G., and Gambardella, L. M. An Analysis of the Different
Components of the AntHocNet Routing Algorithm, In: Proc. Of ANTS’2006. From Ant
Colonies to Artificial Ants: Fifth International Workshop on Ant Algorithms, Brussels,
Belgium, September 4-7, pp.37-48, 2006.
[11] Gambardella, L. M. and Dorigo, M. Ant-Q: A Reinforcement Learning Approach to the
Traveling Salesman Problem. In: Proc. ML-95, 12th Int. Conf. Machine Learning. Palo Alto,
CA: Morgan Kaufmann, pp. 252–260, 1995.
[12] Lee, S. Multiagent Reinforcement Learning Algorithm Using Temporal Difference
Error,
In: Advances in Neural Networks, Lecture Notes in Computer Science, Springer Berlin, 2005.
27
[13] Littman, M. L., Value-function Reinforcement Learning in Markov Games. In: Journal
of Cognitive Systems Research, Vol.2, pp.55-66, 2001.
[14] Miagkikh, V. and Punch, W. F. An Approach to Solving Combinatorial Optimization
Problems Using a Population of Reinforcement Learning Agents. In: Genetic and
Evolutionary Computation Conference, pp. 1358-1365, 1999.
[15] Monekosso, N., and Remagnino, P., The Analysis and Performance Evaluation of the
Pheromone-Q-learning Algorithm. In: Expert Systems, Vol.21, No.2, pp.80-91, May 2004.
[16] Sun, R., Tatsimu, S., Zhao, G., Multiagent Reinforcement Learning with an Improved
Ant Colony System. In:IEEE Transactions on Systems, Man, and Cybernetics, Vol.3,
pp.1612-1617, 2001.
[17] Sutton, R. S. Learning to Predict by the Methods of Temporal Differences. In: Machine
Learning, Vol.3, No.1, pp. 9-44, August 1988.
[18] Sutton, R. S., and Barto, A. G. Reinforcement Learning: An Introduction. MIT Press,
Cambridge, MA, 1998.
[19] Tan, M. Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents. In:
Readings in Agents, M. N. Huhns and M. P. Singh, Eds. Morgan Kaufmann Publishers, San
Francisco, CA, pp.487-494, 1998.
[20] Wyatt, J. Reinforcement Learning: A Brief Overview. In: Perspectives on Adaptivity
and Learning, eds. I.O. Stamatescu, pp. 243–264. Springer, New York, 2002.
28
Relating Ant Colony Optimisation
and Reinforcement Learning
Project Plan
May 2007
MSc in Machine Learning and Data Mining
Supervisor: Tim Kovacs
1 Introduction
Based on the conclusions from our background research, which can be found in
‘Conclusions’ section of the interim report, we have reviewed the plan in our proposal. This
report describes our latest plan of how we intend to carry the project to completion over the
summer. As stated in our proposal, in this project we want to determine in what ways Ant
Colony Optimisation (ACO) and Reinforcement Learning (RL) relate to each other and to use
our findings to create better algorithms. In section ‘Aims and Objectives’, we state the
objectives we determined to achieve our main goals. In section ‘Initial Design and
Specification’, we first describe what we have done so far and we divide the remaining work
into smaller tasks. Then we give a brief specification and an estimate of completion time in
weeks for each task.
2 Aims and Objectives
The primary aim of this project is to study the relationship between ACO and RL in terms of
their similarities and differences theoretically and also experimentally on a chosen
benchmark optimisation problem/s. We anticipate that realising our primary aim will help us
analyse the strengths and weaknesses of both methods better so we will be able to develop a
hybrid algorithm as an improvement to the existing algorithms for the chosen benchmark
problem/s as our secondary aim. Note that our secondary aim also entails discussion of the
quality of these improvements such as how much they will be effective in the solution of real
life problems akin to the chosen benchmark problem.
Here follows our specific objectives to accomplish these aims:
1. Study ACO and RL methods closely and identify the significant studies in the
literature related to the relationship between ACO and RL.
2. Designate a benchmark optimisation problem/s that is suitable for illustrating and
comparing the properties of both methods and pick a few existing ACO and RL
algorithms that are most relevant to the problem and implement them.
3. Setup experiments to test the performance of the algorithms implemented in 2.
Analyse and discuss the results.
4. Based on the results of 3, look for ways of improvement and implement them.
5. Setup experiments to test the performance of the algorithms implemented in 4.
Analyse and discuss the quality of the improvements.
6. Write-up the dissertation based on our findings in 1-5 and prepare for the
presentation.
By the successful completion of the first three objectives listed above, we will have
accomplished the primary aim of this project. The most desirable outcome for our secondary
aim is to introduce an improved algorithm based on the ideas inherited from RL and ACO
methods. Since there are some successful studies such as [7] and [3] that suggest such hybrid
algorithms, we believe this is not an unrealistic target, but it is risky. Therefore, even if we
cannot propose a significant improvement in time, we will try to discuss why the suggested
improvements fail to outperform existing methods. Although this is a less desirable outcome,
this will be reasonably sufficient to fulfil our secondary aim.
2
3 Initial Design and Specification
3.1 Progress Made
We have so far completed surveying the literature about RL and ACO methods and we have
examined some existing studies related to the relationship between the two areas. We have
documented our survey in the interim report. This corresponds to our first objective listed in
the ‘Aims and Objectives’ section above. Furthermore, as a result of this survey, we have
determined some requirements for the benchmark problem we will use in our tests:
1. Our basic requirement is that we should be able to formulate it as an ACO and a RL
problem.
2. We want it to be a good representative of the optimisation problems that need to be
dealt with in real life.
3. We want it to be a fresh problem that has not been studied extensively by the two
areas so that there will be room for improvement.
4. We would like to compare different RL and ACO methods especially in terms of how
quickly and efficiently they can utilise the experience to find good solutions and how
they balance exploration and exploitation to avoid getting stuck in suboptimal
solutions. Therefore we want the benchmark problem to be suitable for illustrating
and comparing these properties of the algorithms.
According to these requirements, we have decided to choose a dynamic TSP problem as our
benchmark problem. Since there are many ACO algorithms and some RL algorithms
proposed for static TSP problems, there already exist TSP formulations for both ACO and RL
that we can start with. So our first requirement is satisfied. Most of the optimisation problems
encountered in real life are dynamic meaning that the solution to the problem changes over
time. For this reason, choosing a dynamic optimisation problem will satisfy our second
requirement, as well. Furthermore, applying ACO methods to solve dynamic optimisation
problems is a new trend in the ACO area [2]. Since existing ACO algorithms cannot
effectively adapt to the changes in dynamic environments and they often require to be
restarted, ACO researchers lately proposed some improvements as seen in [1], [5], and [4].
On the other hand, applying RL methods to a dynamic optimisation problem, especially
dynamic TSP, has not been studied much either. Therefore dynamic TSP problem satisfies
our third requirement, too. Moreover, dynamic environments are very suitable for testing
whether the algorithms explore enough to avoid getting stuck in a suboptimal solution when
the optimal solution changes and how quickly they can adapt to the new environment. Finally
it satisfies our fourth requirement. Dynamic TSP also has the following advantages of being a
TSP problem:
1. TSP problems are easy to understand and formulate although they are difficult to
solve.
2. There is no benchmark data for dynamic optimisation problems, but there is a
significant amount of static TSP benchmark data that can be used to create our
benchmark dynamic TSP as discussed in [9]
3
3.2 Work to Be Done
The rest of the work can be divided into the following tasks:
1. Find or create Dynamic TSP problems from available benchmarked TSP problems.
2. Research about existing RL solutions to Dynamic TSP and implement a
representative selection of them.
3. Research about existing ACO solutions to Dynamic TSP and implement a
representative selection of them.
4. Run experiments and do comparative evaluations for the algorithms implemented in
2 and 3.
5. Develop a new algorithm/s for Dynamic TSP combining ideas from ACO and RL
and doing implementations.
6. Run experiments and doing comparative evaluations for the algorithms implemented
in 2, 3, and 5.
7. Write up the dissertation.
Here follows a brief explanation of the tasks 1-6:
Task-1: Finding or Creating Dynamic TSP problems
Dynamic optimisation problems include a scenario that determines the sequence of events
changing the environment over time and there is a difficulty in identifying the most suitable
scenario according to the particular testing purposes [9]. Since the scenarios are mostly
dependent on the problem at hand, it is hard to find a ready benchmark. However there are
some studies on Dynamic TSP in the literature that we can borrow ideas from. Following are
some basic methods proposed recently for generating a benchmark for Dynamic TSP:
1. Keeping a spare pool of cities and exchanging the cities between the problem and the
spare pool in specific intervals. [5] uses this method.
2. Changing the cost of edges in specific intervals by making sure that this change will
affect the global solution. [4] uses this method.
As discussed in [9], the second method is harder to do effectively than the first one, because
totally random changes might be insignificant meaning that they do not change the local
optimum and then we cannot prove the adaptability of the algorithm; the path it learned
before the change will still be good after the change. So we have to make sure that the
changes will be significant as done in [4] if we decide to employ the second method.
Moreover, a generalised benchmark generation algorithm, which can be thought of as a
combination of some available methods including the method 1 and 2 given above, is
proposed in [9]. We believe this might be useful as well in creating our benchmark problem.
We expect this task to last for 1 week.
Task-2: Implementing RL solution for Dynamic TSP
There are no existing RL solutions to Dynamic TSP that we know of. However there are
some applications of RL to static TSP as the one described in [6]. We need to do further
research about RL solutions to optimisation problems; especially to the TSP problem at this
stage and we will make the implementation based on the most suitable one we could find. We
expect this task to take 1.5 weeks.
4
Task-3: Implementing ACO solution for Dynamic TSP
There are many ACO solutions to static TSP in the literature and they can be easily applied to
Dynamic TSP as well. Also there are some improved ACO solutions for Dynamic TSP like
the ones given in [5] and [4]. We will implement a selection of such algorithms at this task.
We expect this task to take 1.5 weeks.
Task-4: Tests and Comparison of ACO and RL solutions on Dynamic TSP
This task consists of the preparation of some tests and the evaluation of the performance and
the adaptability of ACO and RL solutions implemented in tasks 3 and 4 above. We expect
this to last for 2 weeks. However there is a risk of a fault in the previous tasks 1, 2, and 3
being revealed during the tests. So we have kept the implementation of the algorithms rather
short and have given this task 1 extra week in case we need to go back. As a result, we plan
this task to take 3 weeks.
Task-5: Developing a novel algorithm combining ideas from ACO and RL
In this task, we will look for ways of developing a better algorithm based on the results of the
previous tasks. As mentioned in the aims and objectives section, there is a risk of not being
able to introduce significant improvements at the end of this task and the next task; in that
case we will try to determine the reasons for it. We expect this task to take 2 weeks.
Task-6: Tests and Evaluation of the novel algorithm
In this task we will run similar tests to the ones in task 4. Based on the results of the tests, we
need to go back to task 5 and review our algorithm. Moreover if time allows, we may
implement one of the hybrid algorithms in literature (assuming it will be applicable to
dynamic TSP) and compare the performance of our algorithm against that algorithm as well.
Including the revision of the algorithm, we expect this task to take 3 weeks.
3.3 Previous Work
We have described some of the previous work in the interim report. Two of them, which are
Pheromone-Q in [7] and AntHocNet routing in [3], seem to be very promising. However in
Pheromone-Q, the tests were done in a grid-world environment where there was no
dynamism so we do not know how this algorithm will perform in a Dynamic environment
like Dynamic TSP. AntHocNet was developed for finding efficient routes in a continuously
changing wireless ad hoc network, and the algorithm is very specific to the problem as it
includes some simultaneous message exchanges between the devices in the network. So it
might not be possible to apply this algorithm to Dynamic TSP problem as is although this
algorithm is more closer to what we intend to do. Moreover the previous MSc. Thesis done in
2004 [8] proposes a novel hybrid algorithm, but this algorithm is developed for a static TSP
problem and it also lacks a detailed analysis of the results.
There are also some ACO algorithms improved specifically for Dynamic TSP in the literature
like the ones proposed in [1], [4], and [5] but none of them is based on the combination of RL
and ACO.
5
4 Timescale
Task
11/06
18/06
25/06
02/07
09/07
16/07
23/07
30/07
06/08
13/08
20/08
27/08
03/09
10/09
17/09
24/09
01/10
1) Finding or creating
dynamic TSP problems
2) Implementing RL
solution
3) Implementing ACO
solution
4.a) Test & Comparison
of ACO and RL
4.b) Revision of ACO
and RL implementations
5) Developing a novel
algorithm
6.1) Test & Evaluation
of novel Algorithm
6.2) Revision of novel
algorithm
7) Write Dissertation
6
5 References
[1] Angus, D. and Hendtlass, T. Dynamic Ant Colony Optimisation. In: Applied Intelligence
Vol.23, No.1, pp.33-38, July 2005.
[2] Angus, D. The Current State of Ant Colony Optimisation Applied to Dynamic Problems,
Technical Report submitted to Faculty of Information & Communication Technologies
Swinburne University of Technology, Melbourne, Australia, 2006
[3] Ducatelle, F., Di Caro, G., and Gambardella, L. M. An Analysis of the Different
Components of the AntHocNet Routing Algorithm, In: Proc. Of ANTS’2006. From Ant
Colonies to Artificial Ants: Fifth International Workshop on Ant Algorithms, Brussels,
Belgium, September 4-7, pp.37-48, 2006.
[4] Eyckelhof, C. J., Snoek, M. Ant Systems for a Dynamic TSP: Ants Caught in a Traffic
Jam, In: Proc. of ANTS’2002. Brussels, Belgium, pp.88-99, 2002.
[5] Guntsch, M., Middendorf, M., Schmeck, H. An Ant Colony Optimization Approach to
Dynamic TSP. In: L. Spector et al. (eds.) Proceedings of the Genetic and Evolutionary
Computation Conference, San Francisco, CA: Morgan Kaufmann Publishers, pp.860-867,
2001.
[6] Miagkikh, V. and Punch, W. F. An Approach to Solving Combinatorial Optimization
Problems Using a Population of Reinforcement Learning Agents. In: Genetic and
Evolutionary Computation Conference, pp.1358-1365, 1999.
[7] Monekosso, N., and Remagnino, P. The Analysis and Performance Evaluation of the
Pheromone-Q-learning Algorithm. In: Expert Systems, Vol.21, No.2, pp.80-91, May 2004.
[8] Pournos, P. A mapping of Ant Algorithms to the Reinforcement Learning Framework,
MSc Thesis submitted to Department of Computer Science, University of Bristol, 2004
[9] Younes, A., Calamai, P., and Basir, O. Generalized benchmark generation for dynamic
combinatorial problems. In: Proceedings of the 2005 Workshops on Genetic and
Evolutionary Computation (Washington, D.C., June 25 - 26, 2005). GECCO '05. ACM Press,
New York, NY, pp. 25-31, 2005.
7

Download Report

Relating Ant Colony Optimisation and Reinforcement Learning

Paperzz.com

Your Paperzz