A New Learning Technique for planning in Cooperative Multi

A New Learning Technique for planning in
Cooperative Multi-agent Systems
Walid Gomaa a,b, Mohamed A. Khamis a,∗
a
Cyber-Physical Systems Lab, Egypt-Japan University of Science
and Technology (E-JUST), New Borg El-Arab City, Alexandria 21934, Egypt
b
Department of Computer and Systems Engineering, Faculty of Engineering, Alexandria University, Egypt
Abstract
One important class of problems in multi-agent systems is that of planning or sequential decision making. Planning is the
process of constructing an optimal policy with the objective of reaching some terminal goal state. The key aspect of multiagent planning is coordinating the actions of the individual agents. There are three major approaches for coordination:
communication, pre-imposed conventions, and learning. This paper addresses multi-agent planning through the learning
approach. To facilitate this study a new taxonomy of multi-agent systems is proposed that divides them into two main
classes: cooperative multi-agent systems and competitive multi-agent systems. This paper focuses on planning in the
cooperative setting with some extensions to the competitive setting. A new reinforcement learning-based algorithm is
proposed for learning a joint optimal plan in cooperative multi-agent systems. This algorithm is based on the decomposition
of the global planning problem into multiple local matrix games planning problems. The algorithm assumes that all agents
are rational. Experimental studies on some grid games show the convergence of this algorithm to an optimal joint plan. This
algorithm is further extended to planning in weakly competitive multi-agent systems (a subclass of competitive multi-agent
systems). This new version of the algorithm also assumes rational behavior of the agents in all problem stages except in
some situations where it is believed that a deviation from the rational behavior gives better results. These two versions of the
algorithm assume using a lookup table to hold the problem data. This representation is infeasible in practical applications
where the problem size is prohibitive. So, a third version of the algorithm is proposed which uses a two-layer feed-forward
Artificial Neural Network to represent the problem data. The learning rule of this version is a combination of reinforcement
learning and BACKPROPAGATION (a learning technique for feed-forward artificial neural networks).
Keywords:
Multi-agent systems, reinforcement learning, feed-forward Artificial Neural Network, cooperative multi-agent systems,
competitive multi-agent systems.
1. Introduction
MASs aims at creating a system that interconnects separately developed agents, thus enabling the ensemble to
function beyond the capabilities of any singular agent in the set-up [28]. The field of MASs has gained much
interest in recent years because of the increasing complexity, openness, and distributivity of current and future
systems. Research in MASs dates back to the late 1970's where the field was mostly explanatory in nature. The
main research issues were: distributed interpretation of sensor data, organizational structuring, and generic
negotiation protocols [30]. Like the rest of AI, the field has matured and is now focusing on formal theories of
negotiation, distributed reasoning, multi-agent learning, communication languages [30], and multi-agent
planning. The first MASs applications appeared in the mid-1980's and increasingly cover a variety of domains,
ranging from manufacturing, air traffic control, and information management. Some representative examples of
these applications are: Distributed Vehicle Monitoring (DVMT), ARCHON, OASIS, and WARREN [37].
The key aspect of multi-agent planning is coordinating the actions of the individual agents. Two major
approaches have been used to attack the coordination problem: communication and pre-imposed conventions.
Research in Distributed Artificial Intelligence (DAI) has been mainly concerned with these approaches. DAI
assumes that the agents cooperate to achieve some common goal or accomplish some task. Unfortunately,
communication and pre-imposed conventions are not sufficient to meet today's and future MASs applications
which have the following unique characteristics: (1) The relation between the agents is not just fully cooperative
as assumed by DAI. A broader spectrum of MASs exists covering the range from fully cooperative to fully
competitive agents. (2) MASs must deal with dynamic, open, non-stationary, unmodeled, and highly distributed
environments and must do that in an efficient way. Even the number of agents and their capabilities need not be
held fixed. (3) Communication bandwidth is a scarce resource. Communication delays may also affect the
system effectiveness especially in real-time systems. So the goal is to minimize the communication between the
agents. (4) As the number of agents in a MAS increases, the search space grows exponentially making the
requirements of storage and computational power infeasible. So a more compact representation of the search
space is required so that similar states are represented by just one state.
The only feasible way to handle such challenges is through Machine Learning (ML) techniques and in
particular Reinforcement Learning (RL). The RL paradigm may be the best way to address the above challenges
since it has the following distinguishing features: (1) RL algorithms do not require a pre-knowledge of the
environment's model. (2) RL provides continual adaptive online learning algorithms where learning is
interleaved with operation. Contrast this with other forms of learning such as supervised and unsupervised
learning where learning is done offline, that is the learning phase is completely separated from the operation
phase of the system. (3) The main class of problems best handled by RL is the class of multi-stage problems.
The planning problem fits completely within this class. (4) RL has been rigorously studied in the domain of
single-agent systems during the past ten years. This provides a strong base for extension to the MASs domain.
(5) RL provides algorithms that learn in an incremental step-by-step manner. This implies efficiency both in
terms of required storage and computational power. (6) RL can be combined with supervised learning to provide
generalization techniques which are necessary for compact representation of the search space. This combination
is sometimes referred to as neurodynamic programming.
The most popular RL algorithm that has been used in single-agent systems and as a basis for RL extension
to MASs is Q-learning. So far in practice, most people still use single-agent Q-learning for learning in MASs
[18]. However, during the past few years some multi-agent Q-learning algorithms have been proposed. For
example, [21] extends Q-learning to zero-sum MASs. [17] further extends the work of [21] to the case of
general-sum MASs that have certain restrictions. [39] uses single-agent Q-learning in the multi-agent setting
giving the agents the facility to communicate. The purpose of communication is to share sensation, experience,
and learned knowledge. [4] conjectures that single-agent Q-learning will lead to equilibrium strategies in the
fully cooperative MASs.
The work in this paper represents another attempt to extend the Q-learning algorithm to the multi-agent
setting. To do this Q-learning is combined with work in Game Theory (GT). The major objectives of this paper
are: (1) To provide a taxonomy of MASs. MASs is a complex research domain, the only feasible way to study it
is to classify MASs according to some criteria. (2) To define the planning problem in MASs and identify its
distinguishing features from planning in single-agent systems. (3) To show the lack of convergence guarantee of
single-agent Q-learning when used in the multi-agent setting. This justifies the need of a new formulation of the
multi-agent setting and the necessity of modifying the single-agent Q-learning. (4) To provide a mathematical
formulation of the multi-agent planning problem (Multi-Agent Markov Decision Process, MMDP). (5) To
provide formal definitions of all subclasses of matrix games (a special type of MASs). (6) To formulate the
planning problem in cooperative MASs (a subclass of MASs) based on the decomposition of the global planning
problem into multiple local matrix games planning problems (Cooperative Multi-agent Markov Decision
Process, CMMDP). (7) To propose a new learning algorithm for solving the planning problem in cooperative
MASs. The new algorithm is an extension of Q-learning that is based on CMMDP. (8) To perform experiments
to examine the CMMDP as a new model of cooperative multi-agent planning and to verify the convergence of
the new algorithm. (9) To address the problem of combinatorial explosion of the problem size in practical
applications by combining the new Q-learning based algorithm for cooperative MASs with generalization
techniques from supervised learning. (10) To provide new concepts such as the taxonomy of MASs, new models
such as CMMDP, and new algorithms that may be of importance to the Game Theory community. (11) To give
a review of the RL paradigm and its basic elements.
Preliminary results of this work have been published in [43]. In this paper, we provide a more detailed
description and a detailed survey for the state-of-the-art work. This paper is organized in seven Sections as
follows: Section 1 begins with a general introduction to MASs, followed by explaining the motivation behind
the research done in this paper. Section 2 reviews RL. A precise and simple mathematical formulation of the RL
problem, the Markov Decision Process (MDP), is given. Section 3 discusses MASs. It begins by defining MASs
and the research motivations behind it. The preference of Machine Learning techniques and in particular RL to
solve the coordination problem is justified. Section 4 extends the single-agent Q-learning algorithm (introduced
in Section 2) to the cooperative multi-agent setting. Section 5 presents the experimental work done in this paper.
Three main experiments are performed: grid game 1, grid game 2, and grid game 3. These games represent,
through different settings, instances of strongly cooperative, weakly cooperative, and weakly competitive
MASs. The main objective of these experiments is the verification of the theoretical results concluded in Section
3 and 4. Section 6 presents generalization where a function approximator, a parameterized functional form, is
used to represent a joint action. It proposes using a feedforward Artificial Neural Network to implement the
function approximator with gradient descent as the technique for learning the parameter vector. In Section 7,
some concluding remarks and suggestions for future work are presented.
2. Reinforcement Learning
2.1 Basic definitions
Definition 1: Artificial Intelligence: Artificial Intelligence (AI) may be defined as the branch of computer
science that is concerned with the automation of intelligent behavior [23].
Definition 2: Machine Learning: Machine Learning studies methods for developing computer systems that can
learn from experience. These methods are appropriate in situations where it is impossible to anticipate at
design time exactly how a computer system should behave. The ultimate behavior of the system is determined
both by the framework provided by the programmer and the subsequent input/output experiences of the program
[11].
Definition 3: Supervised Learning: In a supervised learning system, also called learning by a teacher, the
learner is provided with a set of example pairs <input,output> for training. The second component of this pair
represents the desired output for the particular input given by the first component.
Definition 4: Unsupervised Learning: In an unsupervised learning system, also called learning without a
teacher, the learner is provided only with a set of instances or inputs as training examples. The system is not
given any external indication as to what the correct responses should be nor whether the generated responses
are right or wrong. Statistical clustering methods, without knowledge of the number of clusters, are examples of
unsupervised learning [3].
2.2 Reinforcement Learning
Definition 5: Reinforcement Learning: Reinforcement learning is learning what to do, how to map situations
to actions, so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most
forms of machine learning, but instead must discover which actions yield the most reward by trying them. In the
most interesting and challenging cases, actions may affect not only the immediate reward, but also the next
situation and, through that, all subsequent rewards. These two characteristics, trial-and-error search and
delayed reward, are the two most important distinguishing features of reinforcement learning [35].
2.3 The MDP Framework
Definition 6: Markov Decision Process: A Markov decision process (MDP) is a 4-tuple (S,A,T,R) where
i. S is a discrete state space (a finite set of environment states).
ii. A is the agent's action space (the set of actions available to the agent).
iii. T is the state transition function T: S x A x S  [0,1]. It defines a probability
distribution over next states as a function of the current state and the agent's action. T
can also be defined as T: S x A  PD(S) where PD(S) is the set of discrete probability
distributions over the set S.
iv. R is the reward function R: S x A   ( is the set of real numbers). It defines the
expected immediate reward received when selecting an action from the given state.
Solving MDPs consists of finding a stationary policy or plan : S  PD(A) where PD(A) is the set of
discrete probability distributions over the set A. This mapping is such that the agent can achieve his goal. A
detailed discussion of the basic elements of the MDP framework is given in the following subsections. At each
of a sequence of discrete time steps (or successive stages of a sequential decision-making process [34]), t =
0,1,2,3,…., the agent perceives the state of the environment st, and selects an action at. In response to the action,
the environment changes to a new state st+1, and emits a scalar reward rt+1. The dynamics of the environment
are stationary and Markovian, but are otherwise unconstrained [34]. Stationarity means that the model of the
environment, that is the transition and reward functions, do not change over time. The environment is
Markovian if the state transitions are independent of any previous environment states or agent's actions. In a
formal way, the environment is Markovian if the following equation is satisfied at every time step t:
Prst 1  s'| st , at , st 1 , at 1 ,s0 , a0   Prst 1  s' | st , at  (1)
At each time step, the agent implements a mapping from states to probabilities of selecting each possible
action. This mapping is called the agent's policy or plan. The agent's goal is to maximize the total amount of
rewards he receives over the long run. Let the sequence of rewards received after time step t be rt+1, rt+2, rt+3,….
In general, the goal is to maximize the return Rt which is defined as some specific function of the reward
sequence. In the simplest case the return is just the sum of the rewards:
Rt  rt 1  rt 2  rt 3  ...  rT
(2)
, where T is a final time step [35].
According to this approach, the agent tries to select actions so that the sum of the discounted rewards he
receives over the future is maximized. In particular, he chooses action at to maximize the discounted return:

Rt  rt 1  rt  2   rt 3  ...    k rt  k 1
2
(3)
k 0
where [0,1], called the discount factor or the discount rate.
2.4 Value Functions
2.4.1 State-value function
A policy  is a mapping from states sS and actions a  A to the probability (s,a) of taking action a in state s.
Informally, the value of a state s under a policy , denoted V(s), is the expected return when starting in state s
and following policy  thereafter. For MDPs, V(s) can be defined formally as:
 k

V ( s)  E Rt st  s  E   rt  k 1 st  s  (4)
 k 0


, where E{} denotes the expected value given that the agent follows policy , and t is any time step. Note that
the value of the terminal state, if any, is always zero. The function V is called the state-value function for policy
 [35].
2.4.2 Action-value function
Similarly, the value of taking action a in state s under a policy , denoted, Q(s,a), can be defined as the
expected return if starting from state s, taking action a, and thereafter following policy :
Q (s, a)  Ert 1 st  s, at  a E Rt 1 st  s, at  a
 k

 R( s, a)  E   rt k 1 st  s, at  a  (5)
 k 1

The function Q is called the action-value function for policy  [35].
2.4.3 Bellman equations
A fundamental property of value functions is that they satisfy particular recursive relationships. For any policy 
and any state s, the following condition holds between the value of state s and the value of its possible successor
states:



V  ( s)    ( s, a)  R( s, a)    T ( s, a, s' )V  ( s' )  
aA 
s 'S




(6)
Eq. 6 is called Bellman equation for V. It expresses a relationship between the value of a state and the
values of its successor states [35]. See Appendix A.1 for the derivation of Eq. 6. Similarly, given a policy  the
following equation holds between the value of any state-action pair (s,a) and the values of possible successor
states:


Q ( s, a)  R( s, a)    T ( s, a, s' )V  ( s' ) ,

s 'S
V  ( s' )    ( s' , a' )Q ( s' , a' )

(7)
a 'A
Eq. 7 is the Bellman equation for Q.
2.5 Optimal Value Functions
Solving a reinforcement learning (an MDP) task means, roughly, finding a policy that achieves a lot of reward
over the long run. A precise definition of an optimal policy can be as follows. Value functions define a partial
ordering over policies. A policy  is defined to be better than or equal to a policy ' if its expected return is
greater than or equal to that of ' for all states [35]. Formally:
   '  V  (s)  V  ' (s) s  S
(8)
, where  denotes the relation better than or equal to. There is always at least one policy that is better than or
equal to all other policies [35]. This is an optimal policy. Although there may be more than one, all optimal
policies are denoted *. They share the same state-value function, called the optimal state-value function,
denoted V*, defined as follows:
V *  max V 
(9)
 
, where  is the space of all possible policies. Optimal policies also share the same optimal action-value
function, denoted Q* defined as follows:
Q*  max Q
(10)
 
For any state-action pair (s,a), this function gives the expected return for taking action a in state s and thereafter
following an optimal policy [35]. Because V* is the state-value function for a policy, it must satisfy the
condition given by Bellman Eq. 6. Because it is the optimal state-value function, however, V* can be written in a
special form without reference to any specific policy. This is the Bellman equation for V* or the Bellman
optimality equation. This equation expresses the fact that the value of a state under an optimal policy must equal
the expected return of the best action from that state:




V * ( s)  max  R( s, a)    T ( s, a, s' )V * ( s' )  (11)
aA
s 'S


Eq. 11 is the Bellman optimality equation for V* (see Appendix A.2 for its derivation). The Bellman optimality
equation for Q* is:
*
*


Q ( s, a)  R( s, a)    T ( s, a, s' )V ( s' ) ,
s 'S
V * (s' )  max Q* (s' , a' )
a 'A
(12)
2.6 Approaches for solving the RL problem
2.6.1 Dynamic programming
Dynamic programming refers to the class of algorithms that can be used to compute optimal policies given the
model of the environment. The environment's model is fully described by giving the state transition function T
and the reward function R. In such cases where the model is known, learning can be performed offline on a
simulated environment. Classical DP algorithms are of limited utility in reinforcement learning both because of
their assumption of a known model and because of their computational expense, but they are still very important
theoretically [35]. The key idea of DP algorithms is to turn the Bellman optimality equations (Equations 11 and
12) into assignment statements, that is, into update rules for improving approximations of the desired value
functions. The next subsection gives an example of a DP algorithm.
Initialize V(s) arbitrarily, e.g., V(s) = 0 sS
Repeat
0
For each sS do:
v  V (s )


V ( s)  max  R( s, a)    T ( s, a, s' )V ( s' ) (13)
aA
s 'S


  max( , v  V ( s ) )
until < (a small positive number)
Output a deterministic policy, , such that:


 ( s)  arg max  R( s, a)    T ( s, a, s' )V ( s' )
aA

s 'S

(14)
Algorithm 1: Value iteration
One way to find an optimal policy is to find the optimal value function. It can be determined by a simple
iterative algorithm called value iteration. Algorithm 1 gives the complete value iteration algorithm. In this
version of the algorithm iterations are stopped once the value function changes by only a small amount in a
sweep. Note the use of Bellman optimality equation, Eq. 11, as an assignment statement, Eq. 13, to improve the
current approximation of the optimal value function. Value iteration can be shown to converge to the optimal
value function V* given that each state is visited infinitely often. Updates of the value function given in
Algorithm 1 are known as full backups since they make use of information from all possible successor states.
This is contrasted with sample backups which are critical to the operation of model-free methods such as Qlearning which will be discussed shortly.
2.6.2 Monte Carlo methods
Unlike the DP approach, Monte Carlo methods do not assume knowledge of the environment's model. They
require only experience (sample sequences of states, actions, and rewards from actual or simulated interaction
with the environment). At each interaction an experience tuple is formed (s,a,s',r), where st = s, at = a, st+1 = s',
and rt+1 = r. It is assumed that the sequence of these tuples is divided into subsequences called episodes or
trials. Each episode terminates in a final goal state at which the total return is computed. Monte Carlo methods
are ways of solving the RL problem based on averaging sample returns. It is only upon the completion of an
episode that value estimates and policies are changed. Monte Carlo methods are thus incremental in an episodeby-episode sense rather than in a step-by-step sense [35].
Initialize, sS, aA:
Q(s,a)  arbitrary
(s)  arbitrary
Returns(s,a)  empty list
Repeat
Policy Evaluation:
Generate an episode using the current policy  and an appropriate exploration strategy.
For each pair s,a appearing in the episode:
R  rt 1  rt  2    rT | t is the first time step st  s, at  a  (15) ,
where T is the final time step of the episode at which the goal state is reached.
Append R to Returns(s,a)
Q(s, a)  average(Re turns(s, a))
(16)
Policy Improvement:
For each state s in the episode:
 (s)  arg max Q(s, a)
aA
Until termination
Algorithm 2: MC policy iteration
(17)
In MC policy iteration one maintains both an approximate policy and an approximate value function. The
value function is repeatedly altered to more closely approximate the value function of the current policy, and the
policy is repeatedly improved with respect to the current value function. These two phases, called policy
evaluation and policy improvement, are continually alternating until reaching a fixed policy. Thus the sequence
of monotonically improving policies and value functions can be obtained as shown in Fig. 1:
Figure 1: The operation of MC policy iteration
, where E means policy evaluation and I means policy improvement. Once a policy  has been improved using
its action-value function Q to yield a better policy ', Q' can be computed and used again to yield an even
better policy ''. This constitutes a sequence of monotonically improving policies and action-value functions.
2.6.3 Temporal difference learning
Temporal difference (TD) learning is the most important approach for solving the RL planning problem. It gives
the RL technique its wide applicability in real practical problems. TD methods have two main features: (1)
Learning is done directly from raw experience without a model of the environment's dynamics. (2) Updating
estimates is based in part on other learned estimates, without waiting for a final outcome as MC methods do.
Updates are done on a step-by-step incremental basis.
TD methods use raw experience to estimate the value function. If a state st is visited at time t, a TD method
updates its estimate V(st) based on what happens after that visit at the next time step t+1. At time t+1 it
immediately forms a target and makes a useful update using the observed reward rt+1 and the estimate V(st+1).
The simplest TD method, known as TD(0), is
V (st )  V (st )   rt 1  V (st 1 )  V (st ) (18)
, where [0,1] is the learning rate. The update rule in Eq. 18 is known as a sample backup rule since it is based
on a single sample successor state rather than on a complete distribution of all possible successor states.
One of the most important breakthroughs in reinforcement learning was the development of the TD algorithm
known as Q-learning [41] (Cited in [20]). It is defined by the following update rule:

Qt 1 (st , at )  Qt (st , at )   st ,at rt 1   max Qt (st 1 , a)  Qt (st , at )
aA

(19)
In this case the learned action-value function, Q, approximates Q*, the optimal action-value function,
independently of the policy being followed. Although the policy has an effect in that it determines which stateaction pairs are visited and updated, all that is required for convergence is that all pairs continue to be updated
sufficiently enough. The learning rate  should be decayed appropriately so it is in general a function of the
state-action pair (st,at). Algorithm 3 shows Q-learning in procedural form [35]. The Q-learning algorithm is
guaranteed to converge to the optimal Q-values with probability 1 if the following conditions hold [29]: (1) The
environment is stationary and Markovian, that is the environment can be modeled as a Markov Decision
Process. (2) A lookup table is used to store the Q-values. (3) Every state-action pair continues to be visited (the
issue of exploration). (4) The learning rate is decreased appropriately over time.
Initialize Q(s,a) arbitrarily
Repeat
From the current state s, select an action a, using policy derived from the current
approximation of Q, or just explore a random action.
Take action a, and observe the environment's response r and s'.
Update Q(s,a) based on this experience as follows:

Q(s, a)  Q(s, a)   r   max Q(s' , a' )  Q(s, a)
a'

(20)
s  s'
Until termination
Algorithm 3: Q-learning
Section 4 extends the single-agent Q-learning algorithm to the cooperative multi-agent setting, the new
algorithm will be called Extended-Q. In Section 5 Extended-Q is further modified to manipulate some weakly
competitive MASs as well as cooperative ones.
3. Multi-agent Systems
3.1 Definitions
Definition 7: Agent: An agent can be defined as an entity, such as a robot, a component of a robot, or a
software program, with sensation, actions, and goals, situated in a dynamic, stationary and Markovian
environment, and is capable of acting in a self-interested autonomous fashion in this environment. The way he
acts is called his behavior.
Definition 8: Multi-agent Systems: Multi-agent systems (MASs) is the emerging sub-field of AI that views
intelligence and the appearance of intelligent behavior as the result of a complex structural arrangement of
several autonomous independent interacting agents [2].
Definition 9: Agent Policy or Plan: A policy is a set of decision rules each at every state the agent may
encounter. The policy is said to be non-deterministic if the decision rules are probability distributions over the
agent's actions. If all the probability distributions assign a probability one to just one action then the policy is
said to be deterministic.
Definition 10: Rationality: An agent in a multi-agent system is said to behave rationally if he always follows
an optimal policy given the stationary joint policy of the other agents. (A stationary policy is a policy that does
not change over time.)
Definition 11: Ideal Optimality: The ideal optimal value of an agent in a multi-agent system is the possible
maximum payoff the agent can gain. An agent in a multi-agent system is said to behave ideally optimal or
achieve ideal optimality if when interacting an infinite amount of time (repeated initiation and termination) he
can always gain the ideal optimal value or in other words he can always reach the terminal goal state and he
reaches it with the finite possible minimum number of steps.
The notion of a state in Definition 9 means the state of the external environment as defined in Section 2.
Precise definitions of what is meant by an agent, an environment, the interface between them, and agent's policy
and goal were given in Section 2. The current paper assumes that all agents in a MAS are rational. The
property of being rational will be elaborated in terms of the Nash equilibrium concept in Section 4. The current
paper also assumes that any sort of communication or negotiation among the agents is forbidden. An important
note from Definition 11 is that gaining maximum payoff is equivalent to reaching the terminal goal state in the
minimum number of steps. MASs can be classified according to their joint ideal optimality and joint policy.
This classification is shown in Fig. 2.
Figure 2: A taxonomy of MASs
The main characteristic of cooperative MASs is that all agents can together achieve ideal optimality if and
only if they are rational, and they can do that by following a stationary joint deterministic policy. This implies
that rationality is compatible with ideal optimality in cooperative MASs. Strongly cooperative MASs are
cooperative MASs which have the following property: if at least one agent is not rational, then none of the
agents in the system can achieve his ideal optimality. In other words, if agent i is achieving his ideal optimality,
this implies that all other agents are also achieving their ideal optimality. They all have the same payoff
function, or at least the payoff functions are directly proportional. Mostly, the agents in this class work as a team
to achieve a single global goal. This class of MASs naturally arises in task distribution [4]. For example, a user
might assign some number of autonomous mobile robots, or perhaps software agents, to some task, all of which
should share the same payoff function (namely that of the user) [4]. A company or an organization may also be
modeled as a strongly cooperative MAS. Another example is the soccer team where all agents have the shared
goal of winning the game.
Weakly cooperative MASs are cooperative MASs which have the following property: there exists a joint
policy by which some agents can achieve their ideal optimality while the others cannot. All agents are strongly
cooperative only over a portion of their joint action space, while there may be conflicts between two or more
agents over other portions. Said another way, the payoff functions are identical or at least directly proportional
only over a portion of the agents' joint action space. Through this portion all agents can achieve ideal optimality.
Mostly each agent in this class has his own independent goal to achieve but there is no conflict between the
goals of different agents. An example is the traffic system where each vehicle is an agent with the goal of
reaching his destination completely safe.
The main characteristic of competitive MASs is that given all agents are rational one or more agents
cannot achieve ideal optimality because of the inherent conflict between some or all the agents in this class. For
agent i to behave ideally optimal, the cost for that will be paid by other agents who will sacrifice their payoffs
and this cannot happen since the agents are rational. This means that rationality is not compatible with ideal
optimality in competitive MASs. So all what the agents can do is to achieve real optimality.
Definition 12: Real Optimality: The real optimal value of an agent in a multi-agent system is the minimum
level of expected payoff the agent guarantees given that all agents are rational. The real optimal value of the
agent is less than or equal to his ideal optimal value. An agent in a multi-agent system is said to achieve real
optimality if when interacting an infinite amount of time (repeated initiation and termination) and given that all
other agents are rational he can always by acting rationally guarantee a minimum level of expected payoff or in
other words he guarantees reaching the terminal goal state a minimum expected number of times and he
reaches it with the expected minimum number of steps.
Weakly competitive MASs are competitive MASs which have the following property: all agents can
achieve real optimality by following a stationary deterministic joint policy. Examples are soccer or chess games
that end in draw between the two teams or players. Strongly competitive MASs are competitive MASs which
have the following property: all agents can achieve real optimality only by following a stationary nondeterministic joint policy. An example is a basketball game, a tennis game, or a ping pong game, where one of
two teams must win. Another example is the soccer world cup championship where only one of many teams
wins the cup while the others lose it. This paper mainly studies cooperative MASs which will be defined
formally in Section 4. In Section 5 some aspects of weakly competitive MASs are investigated.
3.2 Planning in MASs
One important class of problems in MASs is that of multi-agent planning (or multi-agent sequential decision
making).
Definition 13: Single-Agent Planning: Planning for a single agent (single-agent sequential decision making)
is a process of constructing an optimal policy with the objective of reaching some terminal goal state given the
agent's capabilities and the environmental constraints [37,31].
Definition 14: Multi-agent Planning: Multi-agent planning (multi-agent sequential decision making) consists
of multiple single-agent planning problems, however, each agent must also consider the constraints that the
other agents' activities place on an agent's choice of actions, the constraints that an agent's commitments to
others place on his own choice of actions, and the unpredictable evolution of the environment caused by other
unmodeled agents.
The key aspect of multi-agent planning is coordinating the actions of the individual agents so that the goals
are achieved efficiently [4]. Without coordination each agent ignores the existence of other agents and as a
result will plan and act separately as if it were a single-agent system. The resulting joint plan in this case may be
suboptimal or even diverge from the desired goals. Since all agents are rational, they are interested in this
coordination since jointly optimal action is individually optimal for each agent [4]. Section 2 introduced
reinforcement learning for single agent systems. It described the Q-learning algorithm. Section 4 extends this
algorithm to the cooperative multi-agent setting. Section 5 further extends it to manipulate some weakly
competitive MASs as well as cooperative MASs.
4. Extended-Q Algorithm
4.1 Single-Agent MDP in MASs
One of the conditions that must hold for the single-agent Q-learning to converge is that the environment can be
modeled as a Markov Decision Process. This implies that the environment dynamics must be stationary. That is
the transition and reward functions do not change over time. These functions are functions of the environment
state as perceived by the agent, for example his position within the environment, and the agent's action, for
example the movement in any one of the four directions east, west, north, and south. This condition is very well
satisfied in single-agent systems. The agent is an active learning adaptable entity that can affect the environment
in an arbitrary way. So the transition and reward functions become functions of an environment state that
includes the positions of all agents within the environment and the joint action of all agents. This new fact turns
the environment non-stationary to each agent. To recover stationarity each agent must be aware of the existence
of other agents and coordinate his action with them. Hence Normal-Q must be modified.
There exist multiple levels of awareness an agent may have about the other agents. This paper proposes four
levels of awareness where each level includes all the preceding ones plus an additional feature of its own. These
levels are as follows: Level-1: Each agent considers the other agents as part of the environment state. For the
example given above, the environment state will consist of all agents' positions within the environment. Level2: Each agent considers the payoff functions of the other agents. This implies that each agent considers the
actions available to the other agents. Level-3: Each agent models the behaviors of the other agents. This means
that each agent tries to model or predict the learned plans of the others. Level-4: Each agent tries to model the
internal state of the other agents.
The absence of awareness will be referred to as level-0. Level-4 may be needed when the agents within the
same MAS use different learning algorithms. The work in this paper assumes that all agents use the same
learning algorithm so level-4 is ignored. Level-3 may be needed in two cases: (1) some agents are not rational.
In such case the best response of the remaining agents may not be the rational behavior. However, to identify
this irrationality, agents' behaviors must be modeled. (2) All agents are rational, but there may exist multiple
optimal joint plans to follow. Assuming no communication, there must be a way so that all agents could
coordinate on the same joint plan. The only way to do this is through modeling each other's behavior.
Behavior modeling is called fictitious play in the terminology of Game Theory [6,9,26,12]. This paper
assumes that all agents are rational (with little deviation in Section 5) so the first case is irrelevant. The problem
that appears in the second case is solved by using a pre-imposed lexicographic convention so that when there are
multiple plans each agent knows exactly (assuming cooperative MASs and some weakly competitive MASs
which are investigated in Section 5) which plan to choose without the need to model others' behaviors. So level3 is ignored. The work done in this paper considers only level-2 of awareness.
Given the above discussion, it is conjectured that: Normal-Q algorithm, which assumes level-0 awareness,
is not guaranteed to converge in multi-agent systems so it should not be used in such setting. This conjecture is
further justified through the following example.
4.1.1 An illustrative example
Consider a MAS with two agents A and B. Each uses the Normal-Q algorithm to learn his optimal policy. This
means that each agent is unaware of the other one (level-0 awareness). Consider the situation depicted in Fig. 3.
At a certain moment in time the environment becomes in state s. Agent A decides to take action a1 while B takes
action b1. This joint choice of actions turns out to be beneficial (as immediate and future consequence) to agent
A. So A updates his Q-entry for the state-action pair (s,a1) accordingly. At a later moment in time the
environment returns back to state s, agent A decides to exploit a1, which previously proved to be good, while B
decides to explore a new action b2. Unfortunately, this joint choice of actions proves (as immediate and future
consequence) to be very harmful to A, he receives high penalties. Agent A responds by changing the same Qentry for (s,a1) accordingly. This oscillation of Q(s,a1) continues as A is not aware of what B is doing. The
overall consequence is that the Q-function will oscillate leading to the oscillation of the agents' policies and
hence no convergence occurs.
Figure 3: Agent A is confused about the significance of his action a1
As mentioned before the essential difference between planning in MASs and in single-agent systems is the
issue of coordination. Without coordination, the agents may converge to sub-optimal plans or even diverge.
Coordination requires that each agent has at least level-2 of awareness, that is be aware of the actions of others.
Normal-Q algorithm lacks this level of awareness.
4.2 Multi-agent Markov Decision Process (MMDP)
Naturally, the first thing to start with is to try to formulate the multi-agent planning problem. To accomplish this
MDP model of the single agent systems is extended to the multi-agent setting. The extended model is called
Multi-agent Markov Decision Process (MMDP). It is defined as follows:
Definition 15: Multi-agent Markov Decision Process: A multi-agent Markov decision process (MMDP) is a
5-tuple (n, S, A, T, R) where n is the number of agents. S is a discrete state space (a finite set of environment
states). A is the agents joint action space, A = A1 x A2 x …x An where Ai is the set of actions (also called
strategies) available to agent i. A may also be called the set of joint action (strategy) profiles. T is the state
transition function T: S x A x S  [0,1]. It defines a probability distribution over next states as a function of the
current state and the agents joint action. T can also be defined as T: S x A  PD(S) where PD(S) is the set of
discrete probability distributions over the state space S. R is a set of reward functions
Ri in1
for all agents
where Ri: S x A   ( is the set of real numbers). It defines the expected immediate reward received by agent
i when all agents select a joint action from the given state. R can also be defined as one global reward function
R: S x A  n, where the ith component of the reward vector defines the expected immediate reward received by
agent i.
The agents-environment interaction is depicted in Fig. 4. At time t the agents sense the environment state st,
and accordingly take joint action (at1,at2,…,ati,…,atn). The environment responds with the new state st+1 and the
reward vector (rt+11,rt+12,…,rt+1i,…,rt+1n).
4.3 Solving Multi-agent Markov Decision Process
Solving MMDPs consists of finding a joint stationary policy or plan : S  PD(A) where PD(A) is the set of
discrete probability distributions over the joint action space A. This mapping is such that each agent can achieve
his goal.
Figure 4: The agents-environment interaction
The solution concept of the MMDP planning problem assumed in this paper combines ideas from
two fields: (1) Reinforcement Learning: where the multi-agent planning problem is a multi-step modelfree problem so incremental experience-based learning is required. (2) Game Theory: where the concept
of matrix games is used to reformulate the multi-agent planning problem and to provide a new meaning
to the Q-function. The Nash equilibrium solution concept of matrix games provides the essence of the
agents' joint action selection at each state. RL has already been introduced in Section 2. The next section
introduces matrix games.
4.4 Matrix Games (MGs)
Definition 16: Matrix Game: A matrix game is a 3-tuple (n, A1…n, P1…n) where n is the number of agents
(players). Ai is the action space of agent i (the set of actions, also called strategies, available to agent i). A = A 1
x A2 x … x An is the joint action space. Pi is agent i's payoff function Pi: A   (where  is the set of real
numbers). It represents the utility of agent i given all agents' actions. It is assumed that all agents act
simultaneously.
The agents select actions from their available sets with the goal of maximizing their payoffs which depend on
all agents' actions. These are often called matrix games since the Pi functions can be written as n-dimensional
matrices [6]. A matrix game can be modeled as an MMDP with the state space S containing only one state and
the reward function Ri of agent i corresponding to the payoff function Pi of the same agent. (If the game is
played repeatedly, then maximizing the immediate reward is the same as maximizing the total cumulative
discounted reward since the single state makes transitions only to itself.) So matrix games are single-state
MASs, therefore the taxonomy of MASs can be applied to them. The ideal optimal value for agent i in a matrix
game is defined as follows:
Pi  max Pi (a )
*
aA
(21)
The following subsections give representative examples for the different classes of matrix games.
4.4.1 The coin matching game
There are two agents (n = 2), each has two available actions: head or tail, that is A1 = A2 = {H,T}. If both agents
make the same choice, (H,H) or (T,T), then agent 1 wins a dollar from agent 2. If they are different then agent 1
loses a dollar to agent 2. The payoff matrix for both agents are shown in Fig. 5. The first component of the
ordered pairs represents the payoff of agent 1 (the row agent) while the second component represents the payoff
of agent 2 (the column agent) [6]. This game is an example of a competitive matrix game and specifically
strongly competitive as will be shown later (strongly competitive single-state MAS). The ideal optimal value of
the game for each agent, as defined by Eq. 21, is 1 but there is no joint action profile (a joint deterministic
policy) that can achieve this ideal optimal value for both agents.
Figure 5: Agents' payoffs in the coin matching game
4.4.2 The prisoner's dilemma
Two well-known crooks are captured and separated into different rooms. The district attorney knows he does
not have enough evidence to convict on the serious charge of his choice, but offers each prisoner a deal. If just
one of them will turn state's evidence (i.e. rat on his confederate), then the one who confesses will be set free,
and the other sent to jail for the maximum sentence. If both confess, they are both sent to jail for the minimum
sentence. If both exercise their right to remain silent, then the district attorney can still convict them both on a
very minor charge. Fig. 6.a shows the payoff matrix of this situation.
Figure 6: The prisoner's dilemma matrix game, a) a particular instantiation, b) the general form
Here n = 2 and A1 = A2 = {C, C} (for not confess and confess). In the payoff matrix in Fig. 6.a, the
payoffs are taken such that the most disagreeable outcome (the maximum sentence) has value 0, and the next
most disagreeable outcome (minimum sentence) has value 1. Then being convicted on a minor charge has value
3, and being set free has value 4. The prisoner's dilemma game belongs to a general subclass of matrix games
that have the payoff structure shown in Fig. 6.b where y > x > v > u [29]. All matrix games in this subclass are
competitive and specifically weakly competitive as will be shown later (weakly competitive single-state MAS).
The ideal optimal value of the game for each agent, as defined by Eq. 21, is y but there is no joint action profile
(a joint deterministic policy) that can achieve this optimal value for both agents.
An interesting property about the prisoner's dilemma subclass of matrix games is that the action a2 for the
row agent and b2 for the column agent are dominating actions (y > x and v > u), so each agent may choose to
play this action without thinking about what the other agent will do or what his payoffs are. The resulting joint
action (a2,b2) will yield a payoff v to each agent. Later in this Section this joint action profile is shown to
represent the only possible rational play for both agents.
4.4.3 Cooperative games
Consider the matrix games shown in Fig. 7. In both games n = 2, A1 = {a1,a2}, and A2 = {b1,b2}. Both agents can
together achieve their ideal optimal value 5 in both games by adopting the deterministic joint strategy profile
(a1,b1). This strategy profile is rational as will be shown later. So these two examples represent cooperative
matrix games.
Figure 7: Cooperative games, a) strongly cooperative, b) weakly cooperative
Both agents' payoff functions shown in Fig. 7.a are directly proportional over the whole joint action space
(actually they are the same), so it’s a strongly cooperative matrix game. This is not the case in Fig. 7.b which
represents a weakly cooperative matrix game.
4.5 Nash Equilibrium Solution of Matrix Games
Solving a matrix game means the computation of optimal strategies for all agents. An agent's strategy is called a
pure strategy if he chooses his action deterministically, whereas an agent's strategy is called a mixed strategy if
he selects his action non-deterministically according to some probability distribution over his action space. The
Nash equilibrium concept is a formal definition of the assumption followed throughout this paper that all agents
in a MAS are rational. In this section the definition is given for a special type of MASs: matrix games. In the
next section the Nash equilibrium concept is extended to cooperative MASs by decomposing them into
interdependent matrix games.
Definition 17: The Best-Response Function: For a matrix game, the best-response function for agent i, BRi(i), is the set of all, possibly mixed, strategies for agent i that are optimal given that the other agents choose the
possibly mixed joint strategy -i [6].
Definition 18: Nash Equilibrium: Nash equilibrium (best-response equilibrium) is a collection of strategies
(1,…, i,…, n), called a strategy profile, for all n agents with:
i i  BRi (i )
(22)
, that is each agent strategy is a best response to the others joint strategy.
Nash equilibrium can also be defined more elaborately in terms of the payoff functions as follows:
Definition 19: Pure Nash Equilibrium: An n-tuple of pure strategies (1*,...,i*,...,n*) is a pure Nash
equilibrium strategy profile if:
i :  i  Ai :
Pi ( 1 ,..., i ,... n )  Pi ( 1 ,..., i ,... n ) (23)
*
*
*
*
*
Definition 20: Mixed Nash Equilibrium: An n-tuple of mixed strategies (1*,..., i* ,..., n*) is a mixed Nash
equilibrium strategy profile if:
i :  i   i :

a1A1
*
*
*
1
i
n
1
n

(
a
)


(
a
)


 1
i
n (a ) Pi (a , , a ) 
a n An
  
a1A1
*
1
(a1 ) i (ai ) n (a n ) Pi (a1 ,, a n ) (24)
*
a n An
, where i is the set of all mixed strategies of agent i, that is the set of all discrete probability distributions over
the action space of agent i.
So a Nash equilibrium is a collection of strategies for all agents such that each agent's strategy is a best-response
to the other agents' strategies. So no agent can do better by changing his strategy given that the other agents
continue to follow the equilibrium strategy. What makes the notion of equilibrium compelling is that all matrix
games have a Nash equilibrium, although there may be more than one.
Theorem 4.1: Given any finite n-player matrix game, there exists at least one Nash equilibrium strategy profile
(Nash 1951 Cited in [31]).
Finite games are games where the action spaces of all agents are finite. It is very important to note that a pure
Nash equilibrium can be thought of as a mixed Nash equilibrium where each one of the probability distributions
that make up the joint strategy profile assigns a probability 1 to just one action. But in this paper a clear
distinction between the notions of pure and mixed (or deterministic and non-deterministic) is crucial. This
means that a mixed profile is strictly non-deterministic. Unless otherwise stated, for the rest of this Section
assume that for any two-agent matrix game the row agent is called agent A and the column agent is called agent
B. The following subsections identify the nature of the Nash equilibrium solution concept in each subclass of
matrix games by applying it to the examples given in Section 4.4.
4.5.1 Nash equilibrium in competitive games
Fig. 8.a shows the payoff matrix of the coin matching game, a competitive matrix game introduced above.
Clearly, there is no pure Nash equilibrium solution to this game. But according to Theorem 4.1, there must be at
least one mixed Nash equilibrium profile. One such profile is shown in Fig. 8.a which is ((0.5,0.5),(0.5,0.5)). To
verify that this profile is equilibrium assume that both agents adopt the joint profile ((p,1-p),(0.5,0.5)). Agent A
wants to find the best value for p so as to maximize his expected payoff. The expected payoff of A given this
joint profile is:
EP( A)  0.5 p  0.5 p  0.5(1  p)  0.5(1  p)  0
So whatever the value of p is, agent A can achieve his maximum expected value of zero, so A's mixed
strategy A = (0.5,0.5) is a best response to B's mixed strategy B = (0.5,0.5). In the same way B = (0.5,0.5) can
be shown to be a best response to A = (0.5,0.5). So the mixed joint strategy profile ((0.5,0.5),(0.5,0.5)) is a
Nash equilibrium profile.
Figure 8: Nash equilibrium in competitive games, a) the coin matching, b) the subclass of prisoner's dilemma.
For this game, although both agents are rational they cannot achieve their ideal optimality. So the coin
matching game is a competitive matrix game. All what each agent can do is to achieve a real optimal value (a
minimum expected level) of 0 and he can do it only by following a mixed equilibrium profile (there is no pure
equilibrium profiles) so it is a strongly competitive matrix game. Fig. 8.b shows the general payoff matrix of the
prisoner's dilemma game (y > x > v > u). There is only one pure Nash equilibrium solution: (a2,b2). This rational
profile cannot achieve the ideal optimal value y to any agent. So this game is a competitive matrix game.
However, through this pure equilibrium profile each agent guarantees a real optimal value of v. So the subclass
of the prisoner's dilemma matrix games is weakly competitive. An interesting question is whether the prisoner's
dilemma matrix game has a mixed Nash equilibrium strategy profile. To examine this, assume that both agents
adopt the mixed profile ((p,1-p),(q,1-q)). Agent A wants to find the best value for p so as to maximize his
expected payoff. The expected payoff of A given this joint profile is:
EP( A)  pqx  p(1  q)u  q(1  p) y  (1  p)(1  q)v
 pq( x  u  y  v)  p(u  v)  q( y  v)  v
 p(qx  qu  qy  qv  u  v)  q( y  v)  v
 p[q( x  y)  (1  q)u  (1  q)v]  q( y  v)  v
 p[q( x  y)  (1  q)(u  v)]  q( y  v)  v
(25)
The only term where p appears in Eq. 25 is the first term where it is a factor. This term is negative since y >
x and v > u. Agent A's goal is to try to maximize this term by playing with p. Since p[0,1] then the best value
for p is clearly zero which eliminates the negative effect of this term. Following the same line of reasoning it
can be shown that given the profile ((p,1-p),(q,1-q)), the best value that agent B can assign to q is also zero. The
above results lead to the following conclusions: (1) There does not exist any mixed Nash equilibrium strategy
profiles in the subclass of the prisoner's dilemma matrix games. (2) Any agent can always do better by playing
his second action (a2 for agent A and b2 for agent B) regardless of the strategy of the other agent. This is selfjustified since, as pointed out above, the second action for each agent is a dominating action.
Some interesting questions arise about the class of competitive matrix games or competitive MASs in
general. For example, if in a competitive matrix game there are multiple Nash equilibrium strategy profiles,
which profile(s) is(are) the best in terms of giving the highest real optimal value for all agents? how to find such
profile(s)? and how can all agents coordinate on the same one of these profiles? Another important question is:
is the Nash equilibrium concept (rational agents) the best solution concept for solving competitive matrix
games? The prisoner's dilemma subclass of games shows a weakness in this concept. The non-Nash equilibrium
profile (a1,b1) gives a better real optimal value x for both agents. Actually, such profiles as (a1,b1) are called
Pareto Optimal. Pareto optimality is an evaluative principle according to which: the whole community of agents
becomes better off if one agent becomes better off and none worse off. Other interesting questions are
specifically about the weak competitive subclass: can there be any mixed Nash equilibrium strategy profile in a
weakly competitive matrix game? As shown above, the prisoner's dilemma subclass of games does not have
mixed equilibria. In Section 5 another example of a weakly competitive game is shown to have only pure
equilibria. If there is any mixed equilibrium in weakly competitive games, may it give a better real optimal
value for all agents? All of these questions need investigation.
4.5.2 Nash equilibrium in cooperative games
Fig. 9 shows Nash equilibrium solutions in a strongly cooperative matrix game. As shown in Fig. 9.a, there are
multiple pure strategy profiles that can easily be verified to be Nash equilibria; these are: (a1,b1), (a2,b2), and
(a3,b3). The first and third profiles give the ideal optimal value of the game, 5, to both agents, while the second
profile gives the real optimal value of 1 to both agents. On the other hand, Fig. 9.b, shows a mixed equilibrium
profile: ((0.5,0.0,0.5),(0.5,0.0,0.5)). To verify that this profile is equilibrium assume that both agents adopt the
profile ((p1,p2,p3),(0.5,0.0,0.5)). Agent A wants to find the best values for p1, p2, and p3 so as to maximize his
expected payoff. The expected payoff of A given this joint profile is:
EP( A)  2.5( p1  p3 )
(26)
Figure 9: Equilibrium solutions in a strongly cooperative matrix game, a) pure equilibria, b) mixed equilibrium.
So for A to maximize his expected payoff then p1 + p3 must equal its maximum value of 1. So A's mixed
profile A = (p,0.0,1-p) is a best-response to B's mixed profile B = (0.5,0.0,0.5). Hence A = (0.5,0.0,0.5) is a
best-response to B's mixed profile B = (0.5,0.0,0.5). By the same way, B = (0.5,0.0,0.5) can be shown to be a
best-response to A = (0.5,0.0,0.5). Hence, the joint mixed strategy profile ((0.5,0.0,0.5),(0.5,0.0,0.5)) is a Nash
equilibrium with value 2.5 which is a real optimal value.
Note that both agents in the game shown in Fig. 9 can achieve their ideal optimality if and only if they are
rational. And they can do it by following a joint pure equilibrium profile. So it is a cooperative matrix game. If
one agent is not rational, that he does not follow any equilibrium profile, then neither agent can achieve his ideal
optimality (all ideal optimal profiles for both agents are equilibrium profiles). So it is a strongly cooperative
matrix game.
To complete the discussion of cooperative matrix games, an example of a weakly cooperative game is
shown in Fig. 10. Fig. 10.a shows two pure equilibrium strategy profiles: (a1,b1) and (a3,b3), which can easily be
verified. Both of these pure equilibrium profiles are ideal optimal for both agents with value 5. Fig. 10.b shows
the mixed equilibrium profile ((0.5,0.0,0.5),(0.5,0.0,0.5)). To verify that this profile is equilibrium assume that
both agents adopt the profile ((p1,p2,p3),(0.5,0.0,0.5)). Agent A wants to find the best values for p1, p2, and p3 so
as to maximize his expected payoff. The expected payoff of A given this joint profile is:
Figure 10: Equilibrium solutions in a weakly cooperative matrix game, a) pure equilibria, b) mixed equilibrium.
EP( A)  2.5 p1  2.5 p2  2.5 p3  2.5( p1  p2  p3 )
 2.5[ p1  (1  p1  p3 )  p3 ]
 2.5[2( p1  p3 )  1]
(27)
So for A to maximize his expected payoff then p1 + p3 must equal its maximum value of 1. So A's mixed
profile A = (p,0.0,1-p) is a best-response to B's mixed profile B = (0.5,0.0,0.5). Hence A = (0.5,0.0,0.5) is a
best-response to B's mixed profile B = (0.5,0.0,0.5). By the same way, B = (0.5,0.0,0.5) can be shown to be a
best-response to A = (0.5,0.0,0.5). Hence, the joint mixed strategy profile ((0.5,0.0,0.5),(0.5,0.0,0.5)) is a Nash
equilibrium with real optimal value 2.5. Note that both agents in the game shown in Fig. 10 can achieve their
ideal optimality if and only if they are rational. And they can do it by following a joint pure equilibrium profile.
So it is a cooperative matrix game. However, some ideal optimal profiles of any agent is not ideal optimal for
the other. For example, the joint profile (a1,b2) is ideal optimal for agent A only. These profiles are not
equilibrium. So it is a weakly cooperative matrix game. It can be concluded that the above two cooperative
matrix games have multiple pure and mixed Nash equilibrium solutions, some of them are ideal optimal for both
agents and the others are real optimal for both agents.
4.6 MGs Taxonomy: A Formal Definition
Since matrix games are special type of MASs (single-state MASs), the MASs taxonomy given in Section 3 can
be applied to them. In this section each subclass of MGs is defined formally according to the notions given in
Section 3 and the above discussion. Let n be the number of agents involved in a cooperative matrix game. Each
agent is identified by an idL = {1,2,…,n}. Let

*
*
i  L : Ti  
a  A : a  arg max Pi (a )
a A


(28)
, that is Ti is the set of all ideal optimal joint actions for agent i, the joint actions that enable agent i to receive
his ideal optimal value of the game. Let D be the set of pure Nash equilibrium strategy profiles. Then a
definition of cooperative matrix game can be formulated as follows.
Definition 21: Cooperative Matrix Game: a cooperative matrix game is a matrix game with the following
property:
a1*  T1 : a2*  T2 : : an*  Tn : b *  D :
a
*
1
 a2*    an*  b *

(29)
that is there is a pure Nash equilibrium strategy profile by which all agents can receive their ideal optimal
values.
Definition 22: Strongly Cooperative Matrix Game: A strongly cooperative matrix game is a cooperative
matrix game with the following property:
30
T1  T2    Tn  T  D
that is all agents have the same ideal optimal strategy profiles. These profiles belong to the set of pure Nash
equilibrium strategy profiles.
Definition 23: Weakly Cooperative Matrix Game: A weakly cooperative matrix game is a cooperative matrix
game with the following property:
i : j : i  j   Ti  T j 
(31)
that is there are some ideal optimal profiles for at least one agent that are not ideal optimal for at least one of
the other agents.
Definition 24: Competitive Matrix Game: A competitive matrix game is a matrix game with the following
property:
a  A : a  T1   a  T2     a  Tn 
(32)
that is there does not exist a common ideal optimal profile by which all agents can achieve their ideal
optimality.
Definition 25: Weakly Competitive Matrix Game: A weakly competitive matrix game is a competitive matrix
game with the following property:
D
(33)
that is there exists at least one pure Nash equilibrium strategy profile.
Definition 26: Strongly Competitive Matrix Game: A strongly competitive matrix game is a competitive
matrix game with the following property:
D
(34)
that is there does not exist any pure Nash equilibrium profile. The matrix game has only mixed equilibria.
4.7 Solving Cooperative MGs
Solving a cooperative matrix game means finding one and the same ideal optimal pure Nash equilibrium
strategy profile. But there may exist many such profiles causing the problem of equilibrium selection. All agents
must coordinate on the same equilibrium profile, otherwise the resulting joint action may not be equilibrium.
For example, in Fig. 9.a if agent A coordinates on the first ideal optimal pure equilibrium (a1,b1), while B
coordinates on the second one (a3,b3), then the actual joint action taken will be (a1,b3) which is not equilibrium
and each agent receives only 0. The problem of equilibrium selection can be solved by modeling the behavior
(assuming repeated play of the matrix game) of other agents (fictitious play), that is each agent has level-3
awareness. But as mentioned before, this paper assumes that all agents have only level-2 awareness so fictitious
play is not possible. So the problem of equilibrium selection is solved by adopting a pre-imposed convention
called the lexicographic convention [4] which is described in the following subsection.
4.7.1 The lexicographic convention
This is a simple convention that can be applied quite generally. The basis of this convention is that the system
designer gives all agents the ability to identify each other. Three assumptions allow this convention to be
imposed: (1) The set of agents is ordered. (2) The set of actions available to each agent i is ordered. (3) These
orderings are known by all agents.
Given this information, the lexicographic convention works as follows so that all agents can coordinate on
the same pure equilibrium strategy profile: Each agent i extracts all ideal optimal pure equilibrium strategy
profiles, those profiles that satisfy Eq. 29. In each of these profiles the actions of the low order agents come first.
Then each agent i sorts these profiles lexicographically, that is starting from the actions of the low order agents
the profiles with the low order actions come first. Finally, each agent i adopts the ith component of the first
profile as his action. Algorithm 4 outlines the procedure for solving a cooperative matrix game for agent i.
Extract all ideal optimal pure equilibrium strategy profiles (profiles that satisfy Eq. 29).
Sort these profiles lexicographically according to the pre-imposed convention by the system designer.
Adopt the ith component of the first profile, ā1[i], as agent i action decision.
Play action ā1[i].
Algorithm 4: Solving a cooperative MG for agent i
For example, in the cooperative matrix game shown in Fig. 9.a each agent extracts the ideal optimal pure
equilibrium profiles (a3,b3) and (a1,b1), where agent A precedes agent B in his order. Then each agent sorts these
profiles lexicographically yielding the ordered list (a1,b1) and (a3,b3), where action a1 precedes a3 in its order.
Both agents then coordinate on the first equilibrium profile (a1,b1). Agent A adopts a1 as his decision and B
adopts b1 as his decision. Algorithm 4 is extended in Section 5 to handle some weakly competitive matrix games
as well as cooperative ones.
4.8 Cooperative Multi-agent Markov Decision Process (CMMDP)
It is mentioned above that a matrix game can be modeled as a single-state MMDP. This idea can be reversed,
that is a state in an MMDP can be modeled as a matrix game. So an MMDP can be redefined as a set of
interdependent matrix games. This new view of an MMDP is formalized for cooperative MASs as follows.
Definition 27: Cooperative Multi-agent Markov Decision Process (CMMDP): A Cooperative Multi-agent
Markov Decision Process (CMMDP) is a 5-tuple (n, G, A, T, R) where n is the number of agents. G = M i i 1 ,
m
is a set of interdependent cooperative matrix games as defined by Eq. 29 (m corresponds to the number of
states, that is |S|). A is the agents joint action space, A = A1 x A2 x …x An , where Aj is the set of actions
available to agent j. T is the probability transition function between the matrix games, T: G x A x G  [0,1]. It
defines a probability distribution over next matrix games to be played as a function of the current matrix game
and the agents joint action. R is the joint reward function, R: G x A  n. It defines a vector of expected
immediate rewards for all agents when they take a particular joint action in a particular matrix game. Each
matrix game MiG is a 3-tuple (n, A1…n , P1n ), where
i
Pj is defined as follows:
Pji is agent j's payoff function in MG Mi, Pji : A   .

m
P (a )  R j (a )    T ( M i , a , M k )O j ( M k )
i
j

(35)
k 1
k
O j (M k )  max Pj (a )
(36)
aA
 is the discount factor. Eq. 36 defines the ideal optimal value of matrix game Mk for agent j.
In CMMDP the environment's dynamics or the environment's model consists of: (1) the agents' payoff
functions of the matrix games and (2) the transition probabilities between the matrix games. Both of these
components are assumed to be unknown in the cooperative multi-agent planning problem. The Bellman
optimality equation for Q* has to be modified to satisfy level-2 awareness for cooperative MASs. The new
cooperative Bellman optimality equation for Q* is as follows:
m

Q j ( si , a )  R( si , a )    T ( si , a , sk )V j ( sk )
*
k 1
V j ( sk )  max Q j (sk , a ' )
*
*
a 'A
*

(37)
(38)
, where Qj* is the optimal action-value function of agent j, and Vj* is the optimal state-value function of agent j.
The first parameter of Q* is the environment's state which must now express the environment from the
viewpoint of all agents. The second parameter ā is a vector of all agents' actions. An important remark to make
here is about the similarity between the two Equation pairs: (35,36) and (37,38). Qj* has now a new
interpretation, it estimates the payoff functions of agent j in the different matrix games of the planning problem.
And so does Vj* estimate the values of these matrix games.
4.9 Solving CMMDP
The above definition of CMMDP represents a mathematical framework for cooperative MASs. In cooperative
MASs all agents can achieve their ideal optimality. It is divided into two subclasses: strongly cooperative MASs
and weakly cooperative MASs. In strongly cooperative MASs all agents have the same or at least proportional
payoff functions. Mostly, the agents in this class work as a team to achieve a single global goal. This subclass of
MASs naturally arises in task distribution [4]. For example, a user might assign some number of autonomous
mobile robots, or perhaps software agents, to some task, all of which should share the same payoff function
(namely that of the user) [4]. A company or an organization may also be modeled as a strongly cooperative
MAS. Another example is the soccer team where all agents have the shared goal of winning the game. In weakly
cooperative MASs all agents are strongly cooperative only over a portion of their joint action space, while there
may be conflicts between two or more agents over other portions. Said another way, the payoff functions are
identical or at least directly proportional only over a portion of the agents' joint action space. Through this
portion all agents can achieve ideal optimality. Mostly each agent in this subclass has his own independent goal
to achieve, but there is no conflict between the goals of different agents. An example is the traffic system where
each vehicle is an agent with the goal of reaching his destination completely safe.
Solving a CMMDP means finding a global joint policy that is ideal optimal for all agents, that is a joint
policy by which all agents can achieve their ideal optimality. The proposed solution exploits the previous
definition of CMMDP as follows: the global task of finding an ideal optimal joint policy can be decomposed
into finding ideal optimal solutions (ideal optimal joint actions) to the individual local cooperative matrix
games that comprise the CMMDP. Each cooperative matrix game is solved independently (see Section 4.7 for
how to solve cooperative MGs). This greedy approach of solving the joint planning problem in cooperative
MASs is self-justified since the agents' payoffs in any particular matrix game, as defined by Equations 35 and
36, take into account the consequences of future matrix games, that is they make the dependencies between the
different matrix games locally and immediately available. So it is conjectured that the global deterministic joint
policy that is formed by merging the local ideal optimal deterministic joint policies (resulting by solving the
individual local cooperative matrix games according to the Nash equilibrium concept) is ideally optimal for all
agents and at the same time it is a global Nash equilibrium. An extended form of Q-learning (Equation 37 and
38) is used to estimate the payoff functions of all agents in the cooperative MGs that comprise the CMMDP.
For every agent j initialize Qj(s,ā) arbitrarily (ā is ordered according to Assumption 1 in the lexicographic
convention)
Repeat
Based on the current state s decide the next action ai by taking one of the following two options:
a) Exploit the current learned estimation of the optimal Q-functions of all agents and solve the
cooperative matrix game M corresponding to the state s. Adopt the ith component of the
solution profile as agent i's action decision.
b) Explore a random action.
Take action ai, and observe the environment's response: next state s' and all agents' rewards rj's.
Update the Q-functions of all agents based on this experience as follows:
j Q j (s, a )  Q j (s, a )   rj  V j (s' )  Q j (s, a )
(39)
V j (s' )  max Q j (s' , a ' )
(40)
a 'A
s  s'
Until termination
Algorithm 5: The Extended-Q algorithm for agent i
Algorithm 5 outlines the above procedure for solving the planning problem in cooperative MASs with
respect to agent i. This algorithm is called Extended-Q. Note that in strongly cooperative MASs, the payoff (Q-)
functions are the same or at least directly proportional over the whole joint state-action space. As a result, each
agent can estimate only his payoff (Q-) function saving a great deal of storage. In this case each agent observes
only his reward.
5. Experimental Results
5.1 Grid Game 1
5.1.1 Rules of the game
Fig. 11 depicts the game board and agents' possible actions. Two agents, the blue agent B and the red agent R,
start from the lower left and lower right corners, trying to reach their goal cells. The goal cells are at the upper
right and upper left corners. An agent can move only one cell at a time and in four possible directions: AB = AR
= {east, west, north, south}. If the two agents attempt to move into the same cell or try to switch their cells, a
collision occurs and they are bounced back to their cells. The game ends either when at least one agent reaches
his goal cell or when a specified amount of time steps expires and no agent succeeds in reaching it. The game
has two settings. The strong cooperation setting where both agents act as a team and gain positive rewards only
when they reach the goal cells simultaneously. And the weak cooperation setting where any agent gains his
positive reward when he reaches his goal cell regardless of the status of the other agent.
Figure 11: Grid game 1
5.1.2 The state space
The state space S is defined as a subset of S' (according to the rules mentioned above):
S '  {s | s  (lB , lR ), li  ( xi , yi ), i {B, R}}
(41)
, where li defines the x- and y-coordinate of agent i with respect to the Cartesian coordinate system defined by
the game board itself as shown in Fig. 11. The state transitions are deterministic which means that the current
state and agents' joint action will uniquely determine the next state. This definition of the state space is applied
to the remaining grid game experiments.
5.1.3 The lexicographic convention
All grid game experiments obey the same pre-imposed lexicographic convention which is as follows: (1)
Agents: (B, R). (2) Actions: (East, West, North, South). (3) The above orderings are known by both agents.
These orderings are also applied to the remaining grid game experiments.
5.1.4 Parameters
5.1.4.1 Agents' rewards
 Strong cooperation setting: each agent gets 100 only when both agents reach their goal cells
simultaneously.
 Weak cooperation setting: an agent gets 100 as soon as he reaches his goal cell regardless of
the status of the other agent.
 Each agent gets 0 if reaching other positions without colliding.
 Each agent gets -1 in case of collision.
The reward function as defined above is chosen similar to the reward function in similar experiments performed
in [18].
5.1.4.2 Learning parameters
 Learning rate = 0.999
 Discount factor = 0.999
 Exploration strategy: Simple with exploration probability 0.4
 Initialization: sā Q(s,ā)  0.0
The learning rate is chosen so high so as to speed up the learning process and it remains constant during the
whole learning process. Note that this fixation of the learning rate does not affect the convergence of the Qfunction since the environment is completely deterministic unlike the case of grid game 3 where monotonically
decreasing learning rate is necessary to prevent oscillation of the Q-function in a non-deterministic environment.
Generally in planning problems the discount factor is chosen high to reflect the main interest of the agent to
reach a terminal future goal state. So typical values for the discount factor are greater than or equal to 0.8. Here
0.999 is chosen and proved successful for converging to an ideal optimal plan. But it has to be noted that large
discount factors make the distinctions between nearly equal states very minute, and for limited precision these
distinctions may be ignored causing the agent to get confused and may deviate from converging to the ideal
optimal plan. Also an important feature for the success of the generalization techniques (see Section 6) is that
the distinctions between the states are magnified so as to reduce the effect of the induced errors of the function
approximator on the correct estimation of the states values [40]. So the discount factor has to be carefully
chosen.
A simple exploration strategy is chosen with exploration probability 0.4 which proved good for such
configuration of the game. Small values for the initialization of the Q-function are typically preferred so 0 is
chosen as the initial value for all the experiments in this Section.
5.1.4.3 Experiment parameters
To gain statistically reliable results, convergence is tested over a number of trials where each trial involves two
phases: a learning phase and a test phase. In the learning phase the agents play a specified number of games for
training. Each game is initialized to the configuration shown in Fig. 11. Then a test game is played to verify the
convergence of the algorithm, that is both agents reach their goal cells successfully and in an optimal way.
 Number of trials = 30
 Number of training games in the learning phase of each trial = 5000
 Maximum length of game = 50 (this limit is reached when no agent succeeds in reaching his
goal cell)
Various values for the number of training games and maximum game length are tried until 5000 and 50
respectively are found to be the minimum best values. The criterion used is that: both agents converge to the
same ideal optimal plan in all the 30 trials (known through the test games). The number of trials are chosen to be
30 to guarantee that the results obtained are reliable (hinted by theorems in probability theory and in particular
The Central Limit Theorem).
5.1.5 Game solution
Some of the solutions to grid game 1 are shown in Fig. 12. These solutions are applicable to both strong and
weak cooperation settings. Note that the agents' strategies are deterministic and optimal, they reach their goal
cells only in four steps which is the minimum possible number of steps given the agents' capabilities. This
multiplicity of solutions indicate the existence of multiple pure ideal optimal global Nash equilibrium strategies.
In this experiment , in both settings, the agents converge successfully to the joint plan shown in Fig. 12.a in all
trials. This particular solution is chosen due to the nature of the pre-imposed lexicographic convention.
Figure 12: Some solutions to grid game 1
5.1.6 Payoffs at the initial state
To get a better understanding of grid game 1 and its emergent solution, the matrix game corresponding to the
initial state is investigated in both settings of the game.
5.1.6.1 Strong cooperation
Fig. 13 shows the payoff matrix for both agents in the strong cooperative setting at the initial state of the game
((0,2),(2,2)). The ideal optimal value for both agents is 99.7. This value indicates that the ideal optimal path to
win the game for both agents takes only four steps (99.7 = 100 * 0.9993). All the rectangled entries are the pure
Nash equilibrium strategy profiles. There are three ideal optimal equilibrium profiles, these are (sorted
lexicographically): (E,N), (N,W), and (N,N). These profiles correspond to the initial movements shown in Fig.
12. Due to the lexicographic convention, both agents coordinate on the first profile (E,N).
Since the non-equilibrium profile (E,W) leads to a collision which is penalized, it has the lowest value 98.6.
This value is interpreted as follows: 98.6 = -1 + 100 * 0.9994, where -1 is the penalty of collision and 4 indicates
that the game can take at least five steps to be won: one step caused by the collision delay and four necessary
steps to reach the goal cells from the current positions of the agents. All other non-equilibrium profiles and nonideal optimal equilibrium ones lead to one step time delay and have value 99.6 = 100 * 0.9994.
Figure 13: Initial state payoff matrix (strong cooperation setting)
An important note to make here is that the non-equilibrium profiles (except (E,W)) and the non-ideal
optimal equilibrium profiles have the same value 99.6. But there is an important difference between these two
groups of profiles: the first group causes one agent to be ahead of the other while the second group causes the
same delay for both agents. The excess step caused by the first group of profiles is irrelevant since the game in
this setting is strongly cooperative (both agents have the same payoff functions and the same ideal optimal
strategy profiles), both agents act as one team to achieve a common goal so if one agent is ahead of the other,
sooner or later he will have to delay himself so that both agents arrive at their goal cells simultaneously,
otherwise the game ends with both agents getting nothing.
5.1.6.2 Weak cooperation
Fig. 14 shows the payoff matrix for both agents in the weak cooperation setting. The ideal optimal value for
both agents is 99.7. In this case there are only three pure Nash equilibrium strategy profiles: (E,N), (N,W), and
(N,N). These profiles are also ideal optimal for both agents. Due to the lexicographic convention, both agents
coordinate on the first profile (E,N). The profile (E,W) still has the lowest value 98.6 since it leads to a collision
which is penalized. The ideal optimal profiles shared by both agents are the only pure Nash equilibrium profiles.
Other profiles (except (E,W)) either lead to delay for both agents or one agent being step ahead of the other one.
The second case has a value 99.7 for one agent and 99.6 for the other, that is it is ideal optimal for only one
agent. This is because the game in this setting is weakly cooperative (both agents' ideal optimal profiles are not
the same), each agent has his own independent goal to achieve (reaching his goal cell) regardless of the status of
the other agent. So if one agent is step ahead he will go straight to his cell (acting in self-interested manner)
winning the game which in turn is lost by the other agent. An analogy to this situation can be drawn from the
traffic system. Suppose it is red light for vehicle A and green for vehicle B. Suppose that vehicle A moves while
B does not, then no crash occurs and A reaches his destination safely. Of course this is not an equilibrium
strategy since there is no guarantee that B will behave such nonsensically.
Figure 14: Initial state payoff matrix (weak cooperation setting)
5.2 Grid Game 2
5.2.1 Rules of the game
Fig. 15 depicts the game board and agents' possible actions.
Figure 15: Grid game 2
Again two agents, the blue B and the red R, are trying to reach their goal cells. A collision occurs if both
agents either try to occupy the same cell or try to switch their cells. In this case, both agents are penalized and
bounced back to their cells. This definition of a collision is central to the level of difficulty of this game. The
gray cells are forbidden cells, that is the agents cannot move into them (if an agent tries to move into a forbidden
cell, he will be bounced back to his original cell). The game ends either when at least one agent reaches his goal
cell or when a specified amount of time steps expires and no agent succeeds in reaching it. The game has two
settings. In the first setting, both agents act as a team and gain positive rewards only when they reach their goal
cells simultaneously. This is the strong cooperation version. In the second setting, any agent gains his positive
reward as soon as he reaches his goal cell regardless of the status of the other agent. This will turn out to be a
weak competitive version of the game.
5.2.2 Parameters
5.2.2.1 Agents' rewards
 Strong cooperation setting: each agent gets 100 only when both agents reach their goal cells
simultaneously.
 Weak competition setting: an agent gets 100 as soon as he reaches his goal cell regardless of
the status of the other agent.
 Each agent gets 0 if reaching other positions without colliding.
 Each agent gets -1 in case of collision.
The reward function as defined above is chosen similar to the reward function in similar experiments
performed in [18].
5.2.2.2 Learning parameters
Same as in grid game 1 except the following:
 Exploration strategy: Simple with exploration probability 0.8
This particular choice of the exploration probability is justified in Section 5.2.3.1.
5.2.2.3 Experiment parameters
Same as in grid game 1 except the following:
 Number of training games in the learning phase of each trial = 30000
 Maximum length of game = 200 (this limit is reached when no agent succeeds in reaching
his goal cell)
The choice of these particular values is justified in Section 5.2.3.1.
5.2.3 Game solution
5.2.3.1 Strong cooperation
In the strong cooperation setting both agents act as a team where they get their positive rewards only by
reaching their goal cells simultaneously. In all trials both agents succeed in learning an optimal joint coordinated
plan to reach their goal cells. Fig. 16 depicts this plan. Both agents take ten steps to reach their goal cells, which
is the minimum possible number of steps for this game. The game structure implies that both agents have very
limited choice of good actions. Nearly in all cells most of the actions lead to a collision either with the board
boundary, the forbidden cells, or with the other agent. This results in both agents being trapped in a small region
of the game board. So many random moves are required to get out of this trap and discover the way to the goal
cells. This explains the high exploration probability of 0.8. This value is the smallest value tried that led to
convergence in all trials. Also this explains the huge number of training games required 30,000.
Figure 16: Successful joint plan in grid game 2 (strong cooperation setting)
5.2.3.2 Weak competition
In the second setting of grid game 2, any agent gains his positive reward only when he reaches his goal cell
regardless of the status of the other agent. In all trials of the game both agents converge to the situation shown in
Fig. 17, each agent moves just one step to the horizontal neighboring cell and then remains in this cell
indefinitely. The situation depicted in Fig. 17 will be called draw since the game ends with both agents get zero
(neither one reaches his goal cell). If any agent x reaches his goal cell (gaining 100), then x is said to win the
game. If any agent x does not reach his goal cell while the other wins the game, then x is said to lose the game.
These three notions, draw, win, and loss, are used only for the sake of the following discussion, they do not
imply anything about the game structure and solution (for example, both agents may win the game by making a
pre-play agreement).
Figure 17: Weak competition, failure in reaching a successful joint plan
Before analyzing the behavior shown in Fig. 17, it is very instructive to begin by identifying the class of
MASs to which this game belongs. Clearly, the only possible solution to this game is that one agent x moves up
to the cell (2,3) to allow the other agent y to pass through (remember that switching cells and trying to occupy
the same cell result in a collision and both agents are bounced back to their original cells). But agent x who
makes this voluntary move becomes one step behind y who, given that he is rational, has no motivation to wait
(unlike the strong cooperation setting where both agents act as a team) and so he moves straightly to his goal
cell (his terminal state) winning the game which in turn is lost by the other. This implies that both agents can not
win together (given that communication and negotiation are not allowed). So given that both agents are rational,
at least one of them will not be able to reach his goal cell, that is at least one agent will not be able to achieve his
ideal optimality, so definitely this game is competitive.
But to what subclass of competitive MASs does this game belong? Obviously, the solution of this game
depends on which agent makes the voluntary move, so it may be modeled by the simple matrix game shown in
Fig. 18. The payoffs are the actual rewards received by both agents when the game terminates. V and V stand
for volunteer and not volunteer respectively. If agent x volunteers and y does not, then y, since he is rational,
eventually wins and receives 100 while x loses and receives 0. If both agents do not volunteer (the emergent
case in the experiment as shown in Fig. 17), then the game is a draw and both agents receive 0. If both try to
volunteer, then a collision occurs and both agents receive -1.
There are three pure Nash equilibrium strategy profiles: (V, V), (V,V), and (V, V). It may seem
surprising that the last two profiles are equilibrium since they mean that one agent will always win and the other
will always lose. But this can be explained as follows. Assume agent x adopts the strategy not to volunteer then
the best response from agent y will be either to volunteer or not, it does not matter since in both cases he does
not win and receives 0 (an important remark to mention here is that the reward taken by the winner is not paid
by the loser, the loser does not receive a negative payoff, he simply gets nothing). But the interesting thing about
this game is that the action V is a dominating action (0  0 and 100  -1) for both agents so by reducing this
matrix game, allowing only the dominating actions, the resulting matrix game will have only one entry (V,
V) which is actually the pure equilibrium joint profile adopted by both agents in the experiment as shown in
Fig. 17. This profile can be interpreted as follows: both agents choose draw all the time. From this model it is
clearly that this game is weakly competitive.
Figure 18: Modeling the second version of grid game 2 by a simple matrix game.
To see whether the game model in Fig. 18 has any mixed equilibrium strategy profile, assume that both
agents adopt the mixed joint profile ((p,1-p),(q,1-q)). Then agent B's goal is to find the best value for p to
maximize his expected payoff. The expected payoff of B given this mixed profile is
EP( B)  100 p(1  q)  (1  p)(1  q)
 100 p 100 pq 1  q  p  pq
 101 p 101 pq  q  1
 101p(1  q)  q  1
(42)
Clearly the best value for p is 1. By the same argument it can be shown that the best value for q is also 1.
This gives the dominating pure equilibrium profile (V, V). So there is no mixed Nash equilibrium profile for
this game. But what if both agents adopt the non-equilibrium mixed profile ((0.5,0.5),(0.5,0.5)), that is each
agent tries to volunteer half the time or volunteers with probability 0.5. This means that given the game will be
played infinitely often, each agent will win 25% of the games, lose 25% of the games, draw 25% of the games,
and collides with the other agent 25% of the games. Then the expected payoffs for both agents are as follows:
EP( B)  EP( R)  0.25(100)  0.25(1)  0.5(0)  24.75 (43)
It is conjectured that this result is better than the 100% draw of the adopted pure equilibrium profile
(because of action domination) through which both agents receive zero and also better than the other two pure
equilibrium profiles in which only one agent wins 100% of the games. Unfortunately, this profile is not
equilibrium. This is another example of the weakness of the Nash equilibrium solution concept in some weakly
competitive MASs (the first example was the prisoner's dilemma subclass of matrix games).
Figure 19: Payoff matrix at state ((1,4),(3,4))
Given the theoretical analysis above (using the simplified game model in Fig. 18), it is time to investigate
the experimental results. It is instructive to investigate the local matrix games at the critical states. First the
matrix game at state ((1,4),(3,4)) is shown in Fig. 19. The ideal optimal value for both agents is 99.3, but there is
no joint strategy profile that can achieve this value for both agents so this matrix game is competitive. All the
rectangled joint profiles can easily be verified to be Nash equilibria, so this matrix game is clearly weakly
competitive. These pure equilibrium profiles can be divided into three groups as follows: The first group
includes profiles: (E,N), (E,S), (N,W), and (S,W). In this group the game always ends in win for one and the
same agent. The second group includes profiles: (W,E), (N,N), (N,S), (S,N), and (S,S). In this group the game
always ends in draw. The third group includes profiles: (W,N), (W,S), (N,E), and (S,E). In this group the game
always ends in win for one and the same agent.
Concerning the first group, always one agent x volunteers by moving to cell (2,4), while the other agent y
remains still by moving north or south hitting the boarder or a forbidden cell. But what happens after that?
Assume that agent R makes this voluntary move leading to the state ((1,4),(2,4)), the matrix game
corresponding to this state is shown in Fig. 20.
Figure 20: Payoff matrix at state ((1,4),(2,4))
The ideal optimal value for B is 99.4 (he needs 7 steps to reach his goal cell, 99.4 = 0.9996 * 100) and for R
is 99.3 (he needs 8 steps to reach his goal cell, 99.3 = 0.9997 * 100). There is an ideal optimal profile (E,N) so
this matrix game is cooperative and specifically weakly cooperative since the profile (E,E) is ideal optimal for
only agent R. Clearly both agents adopt the profile (E,N) which clears the way for agent B to reach his goal cell
and win. Both agents then move straightly until they reach the state ((4,1),(0,2)) which has the corresponding
matrix game shown in Fig. 21.
The ideal optimal value for B is 100 (he needs one step to reach his goal cell) while for R is 99.9 (he needs
two steps to reach his goal cell, 99.9 = 0.999 * 100). There is no ideal optimal joint profile but there are pure
Nash equilibrium profiles, so this matrix game is weakly competitive. The north action for B is a dominating
action so by playing this action agent B, regardless of the action of R (actually there is no need for
coordination), will reach his goal cell and win the game.
Figure 21: Payoff matrix at state ((4,1),(0,2))
So, concerning the first group of pure equilibrium profiles of Fig. 19, the agent who makes the voluntary
move to cell (2,4) will always lose the game and the other agent will always win; the winning agent takes only 9
steps. So the game will never end in draw. This group corresponds to the equilibrium profiles (V,V) and
(V,V) of the simplified game model shown in Fig. 18. Concerning the second group, the last four profiles,
(N,N), (N,S), (S,N), and (S,S), lead to both agents remaining fixed in their cells by continuously hitting the
game board or the forbidden cells. The first profile (W,E) transmits the agents back to their initial situation
((0,4),(4,4)). The matrix game corresponding to this state is shown in Fig. 22. The ideal optimal value for both
agents is 99.2 (each agent needs 9 steps to reach his goal cell, 99.2 = 0.9998 * 100) and there is an ideal optimal
joint profile (E,W), so this matrix game is cooperative and specifically weakly cooperative since there exist
some profiles which are ideal optimal for only one agent like (E,E) and (W,W). This joint ideal optimal profile
will return the agents back to state ((1,4),(3,4)). As a result both agents keep moving back and forth between this
state and the initial state which ultimately leads to a draw. So the game, given these equilibrium profiles, will
always end in draw. This group of pure equilibrium profiles corresponds to the equilibrium profile (V,V) of
the simplified game model shown in Fig. 18.
The actual behavior shown by the agents in Fig. 17 resulted from coordinating on equilibrium profiles
belonging to the second group. This can be explained as follows. By carefully investigating the payoff matrix
shown in Fig. 19, it is clear that the actions north and south for both agents are dominating actions. So by
following the lexicographic convention each agent plays north which leads to coordination on the pure
equilibrium strategy profile (N,N).
Figure 22: Payoff matrix at state ((0,4),(4,4))
The third group of equilibrium profiles shows a surprising aspect. Take for example the profile (W,N). This
profile leads to state ((0,4),(3,4)) which has the corresponding matrix game shown in Fig. 23. The ideal optimal
value for B is 99.3 and for R is 99.2 and there is an ideal optimal joint profile (E,W) so this matrix game is
cooperative and specifically weakly cooperative since some profiles are ideal optimal for only one agent like
(E,E). Following this profile both agents are transmitted to state ((1,4),(2,4)) which has the corresponding matrix
game shown in Fig. 20. Given the discussion above agent B will ultimately win this game. So the equilibrium
profiles in this group lead to a win for one and the same agent, the game will never end in draw. However the
winning agent takes ten steps unlike the nine steps taken by the winning agent following the equilibrium profiles
in the first group. However, winning the game using the equilibrium profiles in the third group is done in a
strange way. It seems that winning is done by deception rather than by a voluntary move from the loser agent.
The loser agent always chooses to remain fixed in his cell when in state ((1,4),(3,4)) by playing north or south
implying that he has no intention to volunteer.
Figure 23: Payoff matrix at state ((0,4),(3,4))
The third group of equilibrium profiles is unparalleled in the simplified game model shown in Fig. 18. Both
agents do not volunteer but the game ends in win. This may be due to one or both of the following reasons: (1)
The simplified game model in Fig. 18 is just an approximate model. It does not fully represent the structure of
the real game. (2) Decomposing the real game, which was shown to be weakly competitive, into local
cooperative and competitive matrix games in the same way as CMMDP decomposes cooperative MASs into
local cooperative matrix games is just an approximation to the real problem. Specifically this decomposition
may add noise as that represented by the third group of equilibrium profiles. Even if one or both reasons are
true, it is believed that both are good representations of the game and in general of weakly competitive MASs.
The discussion above can be summarized as follows. The pure equilibrium strategy profiles at the critical
state ((1,4),(3,4)) will lead to one of the following: (1) 100% of the games end in draw (the second group). This
result is favored by the agents because of the domination of actions that lead to a draw. (2) 100% of the games
end in win for one and the same agent, 100% loss by the other agent (the first and third groups).
It is believed that the solutions provided by the pure Nash equilibrium profiles are not satisfactory. So a
mixed equilibrium solution of the matrix game ((1,4),(3,4)) is required. To test for the existence of mixed Nash
equilibria for this matrix game assume that both agents adopt the mixed profile ((p1,p2,p3,1-p1-p2-p3),(q1,1-q1q2-q3,q3,q2)). So agent B's goal is to find the best values for p1, p2, and p3 so as to maximize his expected
payoff. The expected payoff of agent B given this profile is
EP( B)  99.1q1  99.2(q2  q3 ) 
99.3(1  q1  q2  q3 )(1  p1  p2 ) 
99.2 p2 (1  q1  q2  q3 )  98.2 p1 (1  q1  q2  q3 )
 99.3  0.1(2q1  q2  q3 ) 
0.1(11 p1  p2 )(1  q1  q2  q3 )
(44)
Obviously the maximum value of the last term is zero which is achieved when p1 = p2 = 0. So the best
response of agent B is the mixed profile B = (0,0,p,1-p). Similarly, it can be shown that the best response of
agent R is R = (0,0,q,1-q). So there are infinitely many mixed equilibrium profiles ((0,0,p,1-p),(0,0,q,1-q)).
These mixed equilibrium profiles give both agents real optimal value of 99.2. Unfortunately, these mixed
equilibria also belong to the second group which will always lead to a draw, and this is expected because of the
domination of the north and south actions for both agents. This verifies the inadequacy of the Nash equilibrium
solution concept in some competitive MASs. This is also a verification of the same result drawn from the
simplified game model shown in Fig. 18.
From the discussion above it is clear that rational play at the critical state ((1,4),(3,4)) will lead to 100%
draw and both agents receive zero. In searching for a better real optimal value a non-equilibrium mixed profile
((0.5,0.0,0.5,0.0),(0.0,0.5,0.5,0.0)) at the state ((1,4),(3,4)) is investigated. Note that this profile is similar to the
non-equilibrium mixed profile ((0.5,0.5),(0.5,0.5)) in the simplified game model shown in Fig. 18, where each
agent volunteers 50% of the time. For the following analysis assume that the maximum length allowed for the
game is k. The new mixed profile reduces the original matrix game to that shown in Fig. 24. The expected
payoff of both agents is
EP( B)  EP( R) 
0.25(98.2  99.2  99.3  99.2)  98.975
(45)
Figure 24: The reduced matrix game at state ((1,4),(3,4))
To calculate the probability that an agent x wins the game:
Pr{agent x wins the game}
= Pr{agent x reaches his goal cell in a maximum of k steps}
= Pr{agent x passes the critical state ((1,4),(3,4)) in his side in at most k - 8
steps (the agent needs 7 steps after passing the critical state plus an
additional initial step)}
= Pr{agent x passes the critical state at the first trial} + Pr{agent x passes the
critical state at the second trial | both agents failed at the first trial}.Pr{both
agents fail at the first trial} + ……… + Pr {agent x passes the critical state at
the k - 8th trial | both agents failed at the first k - 9 trials}.Pr{both agents fail
at the first k - 9 trials}
Pr{agent x passes the critical state at the ith trial | both agents fail at the first i-1st trials}
= Pr{x remains fixed by playing north and the other agent y moves to cell
(2,4) at the ith trial | both agents fail at the first i-1st trials}
= 0.5 * 0.5 = 0.25
Pr{both agents fail at the first j trials}
= Pr{either (N,N) or (E,W) is played at the first j trials}
= 0.5j
Pr{agent x wins the game}
= 0.25 + 0.25 * 0.5 + 0.25 * 0.5 2 + … … … + 0.25 * 0.5k-9
= 0.25(1 + 0.5 + 0.52 + … … … + 0.5k-9)
(geometric series)
= 0.5(1 - 0.5k-8)


 Pragent x wi ns the game   0.5 1  0.5k 8 , k  8
To calculate the probability of a collision:
Pr{agent x makes at least one collision}
= 1 - Pr{no collision occurs at all}
Pr{no collision occurs at all}
= Pr{no collision | k-1 trials at the critical state and the last trial fails}.Pr{k-1
trials at the critical state and the last trial fails} + Pr{no collision | k-1 trials
at the critical state and the last trial succeeds}.Pr{k-1 trials at the critical
state and the last trial succeeds} + Pr{no collision | k-2 trials at the critical
state}.Pr{k-2 trials at the critical state} + ………+ Pr{no collision | 2 trials
at the critical state}.Pr{2 trials at the critical state} + Pr{no collision | 1 trial
at the critical state}.Pr{1 trial at the critical state}
Pr{no collision | i < k-1 trials at the critical state}
= Pr{(N,N) is played at the first i-1 steps at the critical state | i < k-1 trials at
the critical state}
= 0.5i-1
Pr{no collision | k-1 trials at the critical state and the last trial fails}
= 0.5k-1
Pr{no collision | k-1 trials at the critical state and the last trial succeeds}
= 0.5k-2
Pr{i < k-1 trials at the critical state}
= Pr{i-1 failures at the first i-1 trials}.Pr{success at the ith trial}
= 0.5i
Pr{k-1 trials at the critical state and the last trial fails}
= 0.5k-1
Pr{k-1 trials at the critical state and the last trial succeeds}
= 0.5k-1
Pr{no collision occurs at all}
(46)
= 0.5k-1 * 0.5k-1 + 0.5k-2 * 0.5k-1 + 0.5k-3 * 0.5k-2 + ………+ 0.5 * 0.52 + 0.5
= 0.25k-1 + 0.5 * 0.25k-2 + 0.5 * 0.25k-3 + ………+ 0.5 * 0.25 + 0.5
= 0.25k-1 + 0.5 (0.25k-2 + 0.25k-3 + ………+ 0.25 + 1)
= (1/3)(2 + 0.25k-1)
 Pragent x makes at least one collision 
1
 1  0.25k 1 , k  1
(47)
3


Let Z be a random variable denoting the number of trials performed at the critical state ((1,4),(3,4)) until the
first successful pass. Z is a geometric random variable with the following distribution
P(Z  j )  p(1  p) j 1,
j  1,2,3,
(48)
, where p is the probability of success, here it is the probability that one of the two profiles (E,N) and (N,W) is
played, so p = 0.5. The expectation of Z is
E[ Z ] 
1
1

2
p 0.5
(49)
Let L be a random variable denoting the length of a game then its expectation is
E[ L]  min k ,8  E[Z ]  min( k ,10), k  1
(50)
Assume that C is a random variable denoting the number of collisions in a game. Assume also Y as a
random variable denoting the number of failures at the critical state.. Then the expected number of collisions in
a game can be calculated as follows.
E[C] = E[E[C|Y]]
k 1
  PrY  j.EC | Y  j
j 1
Pr{Y = j} = 0.5j+1, 0  j  k-2
k-1
= 0.5 , j = k-1
 k 1

  PrY  j  1


 j 0

E[C|Y = j] = 0.5j
(the conditional distribution of C given Y = j is a binomial distribution with parameters j, 0.5,
then E[C|Y = j] = 0.5j)
k 2


E[C ]   0.5 j 10.5 j  0.5k 10.5(k  1)
j 1
 0.5  0.5k , k  1
(51)
Then the expected payoff of an agent x in a game of maximum length k can be calculated as follows.
EP(x) = Pr{agent x wins the game}.(reward of winning) + (expected number of collisions in a
game).(penalty of a collision)
= 100 * 0.5(1 - 0.5k-8) - (0.5 - 0.5k)
EP( x)  49.5  12799(0.5k ), k  8
(52)
Given that k is sufficiently large Table 1 compares the rational equilibrium joint profile (N,N) that was
actually played by the agents in the experiment and the newly suggested non-equilibrium mixed joint profile at
the critical state ((1,4),(3,4)).
Equilibrium Joint Profile
Non-equilibrium Joint Profile







100% of the games end in draw
Each agent has 0% chance of winning


100% of the games end in win
Each agent has 50% chance of
winning
The expected game length is k >> 10

The expected game length is 10
Probability of collision occurrence is 0

Probability of collision occurrence is
1/3
Expected number of collisions is 0

Expected number of collisions is 0.5
Expected payoff for each agent is 0.0

Expected payoff for each agent is 49.5
Table 1: Performance comparison between the equilibrium and non-equilibrium profiles
An experiment is performed to test the correctness of the above theoretical analysis for large k. A modified
version of the Extended-Q algorithm that deals with weakly competitive MASs as well as cooperative MASs,
Algorithm 6, is used. In the experiment data are gathered to compute the performance indices shown in Table 1.
The experiment results, averaged over ten trials, are as follows:
 The probability that the game ends in win = 1.0
 The probability that agent B wins the game = 0.4988
 The probability that a collision occurs in a game = 0.3385
 The expected number of collisions in a game = 0.5136
 The expected length of a game = 10.0287
 The expected payoff of agent B = 49.3664
Comparing these results with those shown in Table 1, it is clear that the experimental results are compatible
with the theoretical results. From Table 1 and the experiment results it is clear that the non-equilibrium mixed
strategy profile ((0.5,0.0,0.5,0.0),(0.0,0.5,0.5,0.0)) at the critical state ((1,4),(3,4)) outperforms the pure
equilibrium profile (N,N) at the same state. It is important to note that the expected payoff of each agent given
the non-equilibrium profile, 49.5, is different from both the value predicted by Eq. 43, 24.75, and the value
predicted by Eq. 45, 98.975 (actually it is greater than that because of discounting). This indicates that:
 The simplified game model shown in Fig. 18 is just an approximation of the real multi-agent
planning problem. It ignores the local details of the game (it views it as a single stage
problem), but its global perspective is very good that it gives, in a very simple way, the main
features of the game.
 Decomposing the multi-agent planning problem in weakly competitive MASs into local matrix
games planning problems using the CMMDP framework is also an approximation of the real
problem. The local matrix games can not accurately predict the expected final outcome of the
game for each agent. Note that this problem does not exist in cooperative MASs where this
decomposition is believed to represent the real planning problem in an exact way. But still the
local matrix games can give the global relative importance of the local different pure, mixed,
equilibrium, and non-equilibrium strategy profiles at each stage of the problem. This
decomposition also simplifies very much the planning problem and goes in harmony with
reinforcement learning.
From the discussion above the following important conclusions can be drawn:
 CMMDP can be used to model weakly competitive MASs as well as cooperative ones.
However, the local matrix games in the new model may generally be cooperative and weakly
competitive.
 Given the definition of CMMDP (with the extension mentioned above), the Nash equilibrium
solution concept is not the best to use in some weakly competitive MASs. A non-equilibrium
mixed profile was shown to give much better results in the second setting of grid game 2.
Algorithm 6 shows the exploitation step (solving the local matrix game) of a two-agent modified version of
the Extended-Q algorithm; the remaining steps are the same as in the original algorithm. This modified version
manipulates weakly competitive as well as cooperative MASs. It provides the starting step towards extending
the Extended-Q algorithm to fully handle weakly competitive and then strongly competitive MASs as well as
cooperative MASs.
Using the current learned estimation of the optimal joint Q-function, find all pure Nash
equilibrium strategy profiles and sort them lexicographically. Let N be the ordered list of
these profiles. Let āN be the first profile in N.
Let Ij be the ideal optimal value of agent j, Ik be the ideal optimal value of agent k.
Find the first profile ājkN: Pj(ājk) = Ij and Pk(ājk) = Ik
Find the first profile ājN: Pj(āj) = Ij, find the first profile ākN: Pk(āk) = Ik.
Do one of the following:
 If N = , then it is a strongly competitive matrix game; exit.
 If ājk  nil, then it is a cooperative matrix game; play ājk; exit.
 If exactly one of āj or āk is nil, then it is a weakly competitive matrix game; play the
non-nil profile; exit.
 If āj = āk = nil, then it is a weakly competitive matrix game; play ā; exit.
 If ājāk, then it is a weakly competitive matrix game; play each profile of them with
probability 0.5; exit.
Algorithm 6: The exploitation step of a two-agent modified version of Extended-Q to manipulate weakly
competitive as well as cooperative MASs
5.3 Grid Game 3
5.3.1 Rules of the game
The game is shown in Fig. 25. In this game, the blue and red agents are trying to reach a shared goal cell G. In
one version of the game a collision occurs if both agents either try to occupy the same cell (except the goal cell)
or try to switch cells. In this case, each one is penalized and bounced back to his cell. This definition of collision
increases the difficulty of the game. In the other simpler version of the game, agents are allowed to occupy the
same cell or to switch cells. The game ends either when at least one agent reaches the goal cell or when a
specified amount of time expires and no agent succeeds in reaching it. All agents' movements are deterministic
except north movements from the lower left and lower right cells, where the agent moves up with probability p
and remains fixed in his cell with probability 1 - p. The game has two settings. In the first setting, both agents
act as a team and gain positive payoffs only when they reach the goal cell simultaneously. This represents strong
cooperation. In the second setting, any agent gains his positive payoff when he reaches the goal cell regardless
of the status of the other agent. This represents weak cooperation in the simpler version of the game and weak
competition in the harder one where collisions are defined.
Figure 25: Grid game 3
5.3.2 Parameters
5.3.2.1 Agents' rewards
 The first setting: each agent gets 100 only when both agents reach the goal cell
simultaneously.
 The second setting: an agent gets 100 as soon as he reaches his goal cell regardless of the
status of the other agent.
 Each agent gets 0 if reaching other positions, and only without colliding in the harder
version.
 Each agent gets -1 in case of collision in the harder version.
The reward function as defined above is chosen similar to the reward function in similar experiments performed
in [18].
5.3.2.2 Learning parameters
 Discount factor = 0.999
 Exploration strategy: Simple with exploration probability 0.8
Several values are chosen for the exploration probability between 0.0 and 1.0 but 0.8 proved to be the best in
terms of successful convergence to a joint ideal optimal plan in all trials.
5.3.2.2.1 Learning rate
In the presentation of the Normal-Q learning algorithm it was mentioned that one of the conditions that should
hold to guarantee convergence is that the learning rate is decreased slowly during the training period. In both
grid game 1 and grid game 2 the learning rate was held constant during the whole training period and
convergence was reached. This is due to the deterministic nature of the environment model, that is the state
transition and the reward functions are deterministic. But in the current game the model of the system is nondeterministic and hence continual learning (constant learning rate) causes oscillation in the Q-function and
hence oscillation in the agents' adopted joint strategy profile.
Experimenting with a constant learning rate in grid game 3 showed that the oscillation led in most cases to
adopt real optimal non-equilibrium strategy profiles and in the remaining cases oscillation occurs between
different equilibrium profiles. It is strongly emphasized converging to a stationary optimal Q-function and hence
a stationary ideal equilibrium joint strategy profile. So the learning rate is chosen to decay in an inverse
proportion to the frequency of encountering the state-action pair (s,ā). That is the learning rate used when
updating Q(s,ā) depends on the frequency of visiting this particular state-action pair.
1,
1
 ( s, a ) 
,
log 2f ( s ,a )
1
,
c
f ( s, a )  (c  ceil (log 2 )  1)
0  f (s, a )  1
2  f ( s, a )  c (53)
c  1  f ( s, a )
where f(s, ā) is the update rate of the state-action pair (s, ā), ceil() is the normal ceiling function, and c is some
threshold value. The first and third parts of Eq. 53 were first tried following [18], but the decay rate of the
learning factor was too rapid to permit enough learning and useful use of the encountered experience, that is the
learning period was too small. So the logarithmic part in Eq. 53 is inserted to slow down the decrement rate of
the learning factor and hence allow enough learning period to get use of the experience the agents encounter.
The parameter c controls the length of this period. Different values were tried and finally c = 2000 proved to be
the best. The third part is defined so that the learning rate continues just after the stopping point of the second
part.
5.3.2.3 Experiment parameters
The probability p of successful north movement from the lower left or lower right cells are varied from 0.0 to
1.0 with step 0.1, thus making 11 sub-experiments each has the parameters described in the following
subsection.
5.3.2.3.1 Sub-experiment parameters
In each trial of the sub-experiment the agents play a specified number of games for training. Each game is
initialized to the configuration shown in Fig. 25. Then a number of test games are played to investigate the type
of joint plan reached by the agents. Results are averaged over all trials.
 Number of trials = 30
 Number of training games in the learning phase of each trial = 15000
 Number of test games in the test phase of each trial = 10000
 Maximum length of game = 50 (this limit is reached when no agent succeeds in reaching his
goal cell)
Various values for the number of training games and maximum game length are tried until 15000 and 50
respectively are found to be the minimum best values. The criterion used is that: both agents converge to a joint
ideal optimal plan in all the 30 trials (known through the test games). The number of trials is chosen to be 30 to
guarantee that the results obtained are reliable (hinted by theorems in probability theory and in particular The
Central Limit Theorem). The same reason applies for the number of test games (note that the cost of trying this
huge number of test games is small).
5.3.3 Game solution
5.3.3.1 Strong cooperation
In the strong cooperation setting both agents receive their positive rewards if and only if they reach the goal cell
simultaneously. That is they act as one team. The next two subsections show convergence results for both
versions of the game.
5.3.3.1.1 The simple version
The simple version of the game does not include the definition of collisions, therefore both agents can freely
occupy the same cell or switch their cells. In all sub-experiments, that is for all values of p, both agents
converge to the same joint Nash equilibrium plan which is depicted in Fig. 26. Both agents take only three steps
to reach the goal cell. This is the minimum possible number of steps. In reaching this joint plan the agents
always avoid choosing the non-deterministic action north from their initial cells since this choice may increase
the number of required steps to reach the goal cell.
Figure 26: Joint plan in the simple version of grid game 3, strong cooperation setting
5.3.3.1.2 The hard version
In the hard version of the game the notion of collision is defined. Both agents neither can occupy the same cell
nor can they switch their cells.
Figure 27: An instance of the first class of learned policies, a) when north movement succeeds, b) when north
movement fails
By investigating the joint policies learned over the whole range of p, it is observed that there are two main
classes of learned policies. The first class is depicted in Fig. 27. In this class an agent occupies the lower middle
cell while the other tries to move up. If the north movement succeeds, then the agents go on as shown in Fig.
27.a taking only three steps. If the north movement fails, then the agent making it gives up and both agents try
another path as shown in Fig. 27.b. This last path takes five steps. The expected number of movements in a
game in the first class of policies, E1(p), can be computed as follows:
E1 ( p)  3 p  5(1  p)  5  2 p
(54)
The second class of policies is depicted in Fig. 28. In this class of policies the agent choosing the north
action keeps trying it until the movement is successful. At this time the other agent waits for him in the lower
middle cell by repeatedly moving south hitting the game board. After the north movement succeeds, both agents
move straight to the goal cell. To calculate the expected number of steps in a game using the second class of
policies, let M be a random variable denoting the number of north tries made by an agent until success. M is a
geometric random variable which has the following distribution:
Pr{M  m}  p(1  p)m1
(55)
Figure 28: An instance of the second class of learned policies
Then the expected number of steps in a game using the second class of policies, E2(p), can then be
computed as follows:
E2 ( p)  2  E[ M ]  2 
1
p
(56)
The actual class of policies to which both agents converge was found to depend on p. The optimal policy
was found to satisfy the following equation:
*
1
2
 ( p)  arg min{E ( p), E ( p)}

(57)
,where  is the set of all possible joint policies. This equation can be verified from Table 2. The first row shows
the values of p, the probability of success of a north movement from the lower left and lower right cells. The
second row shows Av(p), the measured average number of steps in a game. The third row shows E1(p), and the
last row shows E2(p).
p
Av(p)
E1(p)
E2(p)
0.0
5.0
5.0

0.1
4.8
4.8
12.0
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
4.6
4.4
4.23
4.0
3.71
3.44
3.25
3.11
3.0
4.6
4.4
4.2
4.0
3.8
3.6
3.4
3.2
3.0
7.0
5.33
4.5
4.0
3.67
3.43
3.25
3.11
3.0
Table 2: Effect of p on the class of learned policies
From Table 2 it can be seen that for 0.0  p  0.4 the agents converge to a policy from the first class, where
E1(p)  E2(p), and for 0.6  p  0.9 the agents converge to a policy from the second class, where E2(p)  E1(p).
At p = 0.5, it is found that for half of the sub-experiments the agents converge to a policy from the first class,
and for the other half the agents converge to a policy from the second class; this is self-justified since E1(p) =
E2(p) = 4.0. At p = 1.0 the two classes are equivalent since no failure will occur when trying to move north from
the initial cells; both formulas lead to E1(p) = E2(p) = 3.0.
5.3.3.2 Weak cooperation
In the weak cooperation setting each agent receives a positive reward as soon as he reaches the goal cell
regardless of the status of the other agent. The game does not include the notion of collision, that is both agents
can freely occupy the same cell or switch their cells, otherwise it becomes a weak competitive game. In all subexperiments, that is for all values of p, both agents converge to the joint plan shown in Fig. 26 (the same joint
plan as in the simple version configuration of the strong cooperation setting). This plan is natural since no agent
has the motivation to try to move up from his initial cell and hence take the risk of losing the game.
5.3.3.3 Weak competition
This setting is similar to the weak cooperation setting except that the game in this case defines the notion of
collision. Both agents cannot occupy the same cell nor can they switch their cells, they are penalized if trying to
do that and are bounced back to their original cells. Using the first version of Extended-Q, both agents converge
to the situation depicted in Fig. 29. There is an inherent conflict around the cell (1,2), since the agent who
occupies this cell guarantees reaching the goal cell, while the other possible way through the non-deterministic
north movement is not guaranteed to succeed and so the agent who makes it may fail to reach the goal cell.
Figure 29: Irresolvable situation in the weak competition configuration
Fig. 30 shows the matrix game corresponding to the initial state ((0,2),(2,2)). The ideal optimal value for
both agents is 99.8 (for p<1: 99.8 > yB = yR). Assuming 0 < p < 1, there are only two pure Nash equilibrium
profiles: (E,N) and (N,W). The first profile is ideal optimal for agent B only, and the second one is ideal optimal
for agent R only. Clearly, this matrix game is weakly competitive. Actually, by following the same line of
reasoning of Section 5.2.3.2, it can be shown that grid game 3 in this setting is a weakly competitive MAS.
Figure 30: Payoff matrix at state ((0,2),(2,2))
What actually occurred that led to the situation depicted in Fig. 29 is that each agent plays his own ideal
optimal equilibrium profile. This results in the non-equilibrium joint profile (E,W) that causes this indefinite
competition around the cell (1,2). So the game always ends in draw and both agents receive nothing. Even if
both agents coordinate on the same equilibrium profile, then, assuming the game is played repeatedly, one agent
wins 100% of the games and the other one wins only 100p% of the games (note that the notions of draw, win,
and loss have the same meaning as defined in Section 5.2.3.2); this is unfair.
Using the second version of Extended-Q as defined in Section 5.2.3.2 yields the mixed joint strategy profile
((0.5,0.0,0.5,0.0),(0.0,0.5,0.5,0.0)). Given this profile the probability that an agent wins the game can be
calculated as follows.
Pr{agent x wins the game}
= 1 - Pr{agent x loses the game}
= 1 - Pr{agent x moves north and his movement fails while agent y moves to
cell (1,2)}
= 1 - 0.5(1 - p)
= 0.5(1 + p)
It is conjectured that this result is better than the 100% draw of the actual profile played by the agents and
also better than the unfair pure equilibrium profiles.
5.4 Discussion
Section 5 describes experimental work done on some MASs. Three main experiments are performed: grid game
1, grid game 2, and grid game 3. These games represent, through different settings, instances of strongly
cooperative, weakly cooperative, and weakly competitive MASs. The main results and conclusions of these
experiments are:
 Using the Extended-Q algorithm, defined in Section 4, agents in a cooperative MASs converge to an
ideal optimal rational joint plan. This is confirmed for deterministic environment (grid game 1 and
grid game 2) and non-deterministic environment (grid game 3).
 This convergence of Extended-Q algorithm provides an evidence for the correctness of the CMMDP
as a mathematical model for cooperative MASs.
 A smooth transition is done from cooperative MASs to weakly competitive ones in grid game 2 and
grid game 3. The Nash equilibrium solution concept (rational play) was shown to be a weak concept
in the weakly competitive MASs presented in this Section. It was shown that the proposed CMMDP
for cooperative MASs can also be applied to competitive MASs with one modification: the
constituent local matrix games may be weakly competitive and/or cooperative instead of just
cooperative. Based on this, an extension of Extended-Q was proposed to handle weakly competitive
MASs as well as cooperative ones. This new version assumes rational play in all cases of local
matrix games except one where rational play is weak. This case appears when the agents face a
weakly competitive matrix game which has the following property: each agent has an ideal optimal
pure equilibrium profile but there is not any shared ideal optimal pure equilibrium profile for all
agents. The new version of Extended-Q was shown (analytically and experimentally) to give better
results than the rational play in the weakly competitive settings (grid game 2 and grid game 3).
 A decaying learning rate is necessary for the convergence of the Extended-Q algorithm in nondeterministic environments (grid game 3). A constant learning rate in such environments leads to
continual oscillation of the Q-function and hence oscillation of the adopted joint policy.
 The exploration strategy followed in the experiments of this Section is very simple: at each time step
explore a random action with probability  and exploit the current estimation of the Q-function with
probability 1-. Despite its simplicity it was effective in converging to an ideal optimal plan.
 The multi-agent planning problem may be simplified to a matrix game planning problem (grid game
2). This simplification, although not useful practically in solving the planning problem, is important
to theoretically identify the nature of the planning problem and the class to which the MAS belongs.
6. Generalization
6.1 The need for generalization
So far it has been assumed that it is possible to enumerate the state and joint action spaces and store them as
lookup tables. Let m be the space required by this representation in a MAS, then m can be calculated as follows:
m  ktn
(58)
, where k is the number of environment's states, t is the number of actions available to each agent (assuming all
agents have the same number of actions), and n is the number of agents. Except in very small environments, this
means impractical space requirements. Additionally, the time and data needed to accurately fill them may be
prohibitively large.
In large smooth state spaces similar states are expected to have similar values and similar optimal joint
actions; said another way similar matrix games have similar payoffs and equilibrium solutions. Therefore, there
should be some more compact representation than a table. Most problems will have continuous or large discrete
state spaces; some will have large or continuous action spaces [20]; and some involve a large number of agents.
The problem of learning in large spaces is addressed through generalization techniques, which allow experience
with a limited subset of these spaces to be usefully generalized to produce a good approximation over a much
larger subset.
According to Eq. 58 the space requirement grows polynomially with the size of the state space and the size
of the action space of the individual agent and it grows exponentially with the number of agents. In this paper
the problem of generalizing over large state spaces is addressed. Generalization over the action space may be
handled in similar ways. The most severe problem is the exponential growth with the number of agents. This
may be handled in a hierarchical manner. A team of agents may be split into sub-teams each one of them has a
leader or controller. Agents within a sub-team coordinate their actions only within this sub-team, they do not
need to know anything about the agents within the other sub-teams. At the top level the leaders of the sub-teams
coordinate their actions. This idea can be applied recursively with the sub-teams themselves split into other subteams and so on.
Generalization from examples has already been extensively studied in supervised learning. What is needed
is to combine the Extended-Q algorithm with one of the existing generalization methods. This kind of
generalization is called function approximation because it takes examples from a desired function, here the
action-value function, and attempts to generalize from them to construct an approximation of the entire function.
6.2 Joint action evaluation with function approximation
Using function approximation to predict the value of joint actions means that the action-value function at time t,
Qt(s, ā) is represented not as a table, but as a parameterized functional form with parameter vector
wt . So Qt(s,
ā) depends totally on
wt varying from time step to time step only as wt varies. For example, Qt(s, ā) might be
the function computed by an artificial neural network, with wt being the vector of connection weights. By
adjusting the weights, any of a wide range of different functions Qt(s, ā) can be implemented by the network.
Typically, the number of parameters (the number of components of
wt ) is much less than the number of
states or state-action pairs, and changing one parameter changes the estimated value of many state-action pairs.
Consequently, when a single state-action pair is backed up, the change generalizes from that state-action pair to
affect the values of many other state-action pairs.
As mentioned above this paper addresses only generalization over the state space so it is assumed that each
agent has a small finite number of actions and the number of agents is small. This assumption is exploited by
assigning one function approximator per joint action ā. This reduces the interference between the values of
different joint actions and hence reduces the noise induced by the function approximator on the output
predictions. From now on Qā(s) is used to denote the output of the function approximator for the joint action ā.
In ordinary supervised learning the learner is given constant (Input,Output) examples, in this case
(s,Qā*(s)) where s is the environment's state and the target value Qā*(s) is the optimal action-value function of
the joint action ā (represented by the function approximator ā) at state s. Unfortunately, Qā* is not known so it
has to be replaced by its estimate Qā depending on the current experience tuple (s,ā,r,s'):
Qa (s)  r   max Qu (s' )
(59)
uA
, where Qū(s') is an estimate and may come from a different function approximator to the Qā(s) estimate. As an
agent gains more experience his estimate Qā(s) improves meaning that the examples presented to the function
approximators are changing over time which complicates the problem of estimating the optimal action-value
function.
As Sutton pointed out [33], the purpose of both backpropagation (a function approximation algorithm for
Artificial Neural Networks) and TD methods (Extended-Q is a TD method) is accurate credit-assignment (action
evaluation based on long-term consequences). Backpropagation decides which part(s) of a neural network to
change so as to influence the network's prediction and thus to reduce its overall error, whereas TD methods
decide how each prediction of a temporal sequence of predictions should be changed. Backpropagation (or in
general any function approximator) addresses a structural credit-assignment issue whereas TD methods address
a temporal credit-assignment issue.
6.3 The gradient descent methods
Gradient descent methods are among the most widely used of all function approximation methods and are
particularly well-suited to reinforcement learning [35]. In gradient descent methods the parameter vector is a
column vector with a fixed number of real valued components,
t


T
w t  w1t , w2t ,...wmt (T denotes transpose),
and Qāt(s) is a smooth differentiable function of w . On each time step t the function approximator is fed with a
new training example. The ideal training example is (st, Qā*(st)), but unfortunately Qā* is not known so a first
level estimation is given based on the current joint policy t and thus providing the training example (st,Qāt
(st)). Unfortunately, even Qāt is not exactly known so a second level of estimation is given based on the current
interaction with the environment and the current estimation of Qāt(st+1) providing the actual training example
(st,Qāt(st)), where Qāt(st) is calculated as follows (same as Eq. 59 but in time notation):
Qat (st )  rt 1   max Qut (st 1 )
(60)
uA
The key idea behind the gradient descent methods is to search the space of possible weight vectors to find
the weights that best fit the temporal training examples. In order to derive a weight learning rule, that is a weight
update rule, an error measure of a particular weight vector must be specified. Although there are many ways to
define this error, one common measure that will be used in the following analysis is as follows:
Et ( wt ) 


2
1 t
Qa ( st )  qat ( st )
2
(61)
, where qāt is the output of the function approximator ā at time t.
Gradient descent search determines a weight vector that minimizes E by starting with an arbitrary initial
weight vector, then repeatedly modifying it in small steps. At each step, the weight vector is altered in the
direction that produces the steepest descent along the error surface [25]. This process continues until some local
minimum error is reached. This direction can be found by computing the partial derivative of E with respect to
each component of the weight vector w . This vector derivative, written E( w ), is called the gradient of E
with respect to w ; it is calculated as follows:
 E E
E 
E ( w )  
,
,...,


w

w
wm 
2
 1
T
(62)
The gradient is itself a vector whose components are the partial derivatives of E with respect to each of the
wi. When interpreted as a vector in weight space, the gradient specifies the direction that produces the steepest
increase in E. The negative of this vector therefore gives the direction of steepest decrease. Then the learning
rule for gradient descent is:
w  w  w
w  E(w )
(63)
, where  is a positive constant called the learning rate, which determines the step size in the gradient descent
search. The negative sign is present to move the weight vector in the direction that decreases E. This learning
rule can also be written in its component form as follows:
wi  wi  wi
E
wi  
wi
(64)
Given the particular definition of the error function in Eq. 61 then the weight update rule specializes to the
following rule:
wi  wi  wi
qat ( st )
wi   Q ( st )  q ( st )
wi

t
a
t
a

(65)
The only remaining thing to do is the determination of the approximating function qāt(st). The most
important feature about this function is that it must be differentiable over the weight vector w . In this paper an
Artificial Neural Network is proposed as an implementation of qāt(st).
Figure 31: ANN Architecture used in function approximation
6.4 Neural network implementation
In this paper the function approximators are implemented using feedforward multilayer ANNs. An artificial
neural network can be defined as follows:
Definition 28: Artificial Neural Network: A neural network is an interconnected assembly of simple
processing elements, units or nodes, whose functionality is loosely based on the animal neuron. The processing
ability of the network is stored in the inter-unit connection strengths, or weights, obtained by a process of
adaptation to, or learning from, a set of training patterns [14].
The architecture of the ANN proposed in this paper is depicted in Fig. 31. The input pattern is a vector of
np+1 components. The first component is a threshold and is always 1. The last np components describe the state
of the environment. Each arrow in the figure represents an interconnection between either an input node and a
hidden layer unit or a hidden layer unit and the output unit. A weight or strength w is associated with each
interconnection. The set of these weights comprise the weight vector w which controls the nature of the
approximating function. Each unit i in the hidden and output layers is a simple processing element which first
computes a weighted linear combination of its inputs. This linear sum is called the net input neti to unit i. The
unit then produces its output by applying a function ƒ, called the unit's activation function, to neti. The output
unit assumes the simplest case where ƒ is the identity function, and the unit's output is just its net input. This is
called a linear unit. For each hidden unit, ƒ is the sigmoid function which is defined as follows:
f (net ) 
1
1  ek .net
(66)
,where k is some positive constant. Hence, each hidden unit is called the sigmoid unit as illustrated in Fig. 32.
Figure 32: The sigmoid unit
6.5 Derivation of the weight update rule
This section presents the derivation of the weight-tuning rule for the ANN shown in Fig. 31. It begins by giving
the definitions of the symbols used in the derivation. Then it gives the derivation of the learning rule for the
output unit weights. Finally, it gives the derivation of the learning rule for any hidden unit weights.
6.5.1 Notation
The derivation of the weight update rule uses the following notation:





xj = the jth input to the output unit (the output of hidden unit j)
wj = the weight associated with the jth input to the output unit (or the weight associated with
the output of hidden unit j)
xji = the jth input to hidden unit i
wji = the weight associated with the jth input to hidden unit i
l   w j x j (the net input of the output unit)
j

li   wki xki (the net input of hidden unit i)



ƒ = the sigmoid function
o = ƒ(l) (the neural network output, the output computed by the output unit)
oi = ƒ(li) (the output computed by hidden unit i)

o' 
k





o
(the partial derivative of o with respect to l)
l
o
oi '  i (the partial derivative of oi with respect to li)
li
t = the target output of the neural network (the target output of the output unit)
ti = the target output of hidden unit i
 = learning rate
E (w ) = the error function
6.5.2 Learning rule for the output unit weights
As seen before the error function was defined as follows:
E (w ) 
1
t  o2
2
Then according to the gradient descent rule:
w j  
E
w j
 
E o
o w j
 
E o l
o l w j
E   1
2



t

o
 t  o 


o o  2

l
 


w
x
 i i   x j
w j w j  i
 w j   (t  o)o' x j
 x j
(67)
, where  = (t-o)o' is the error term of the output unit. Eq. 67 gives the weight update rule of a general output
unit. For a linear output unit o' = 1 (since o = ƒ(l) = l), hence the weight update rule for the linear output unit is:
 w j   (t  o) x j
 x j
(68)
, where  = (t-o) is the error term of the output unit.
6.5.3 Learning rule for a hidden unit weights
According to the gradient descent rule:
E
w ji
E o
 
o w ji
w ji  
o l
l w ji
l oi
  (t  o)o'
oi w ji
l oi l i
  (t  o)o'
oi li w ji
  (t  o)

l
 

 w j o j   wi
oi oi  j

li
 


w
x
 ki ki   x ji
w ji w ji  k
w ji   (t  o)o' oi ' wi x ji
 i x ji
(69)
, where i = (t - o)o'oi'wi is the error term of hidden unit i. Eq. 69 represents the weight update rule of a general
hidden unit which also assumes a general output unit. For a sigmoid hidden unit and linear output unit the
derivatives o' and oi' become: o' = 1 and oi' = oi.(1 - oi). The weight update rule then becomes:
w ji   (t  o)oi (1  oi )wi x ji
 i x ji
(70)
, where i = (t - o)oi(1 - oi)wi is the error term of hidden unit i.
6.6 Neuro-Extended-Q
Algorithm 7 defines a new version of the Extended-Q algorithm where the Q-function is represented using a set
of ANNs. Each ANN has the architecture defined in Section 6.4. Algorithm 7 shows only the learning step of
the algorithm, the remaining steps are the same as in Algorithm 6. This means that Neuro-Extended-Q handles
cooperative as well as weakly competitive MASs.
For each agent j of which agent i is aware, perform the following steps:
Given the experience tuple (s,ā,s',rj), update the Q-function based on this experience as follows:
1. Calculate a new estimate Qāj(s) as follows:
Qaj (s)  rj   max Quj (s' )
(71)
u A
2.
3.
4.
Let x be a vector that describes the environment's state s. Then the new training example for
the ANN concerned with the joint action ā will be < x , Qāj(s)>.
Propagate the input forward through the network by presenting the input vector x to the
network and compute the output ok of every unit k in the network.
Propagate the error backward through the network:

For the output unit, calculate its error term :
  Qaj (s)  o



(72)
For each hidden unit h, calculate its error term h:
 h  oh (1  oh )wh
(73)
Update each weight to the output unit as follows:
w j  w j  x j
(74)
Update each weight to hidden unit h as follows:
w jh  w jh  h x jh
(75)
Algorithm 7: Neuro-Extended-Q algorithm for agent i
Algorithm 7 is actually a combination of Extended-Q with the BACKPROPAGATION algorithm. The
BACKPROPAGATION algorithm learns the weights for a multilayer ANN, given a fixed set of units and
interconnections. It employs gradient descent to attempt to minimize the squared error between the network
output values with the target values for these outputs [25].
7. Conclusions and Future work
7.1 Conclusions
The major contributions provided by this paper are:
 A review of the RL paradigm and its basic elements is given.
 A review of MASs is given and a new taxonomy of them is proposed. There are two main classes:
cooperative MASs and competitive MASs. Within each class there are two subclasses: weak and
strong. This taxonomy is based on the following notions: rationality, ideal optimality, real
optimality, and the type of action profiles (deterministic versus non-deterministic).
 Single-agent Q-learning is shown to lack convergence guarantee when used in the multi-agent
setting. The key aspect of successful planning in MASs is that each agent becomes aware of the
remaining agents. Four degrees of awareness are proposed in this paper. The research done assumes
level-2.
 A mathematical formulation of MASs, Multi-Agent Markov Decision Process (MMDP), is provided.
All subclasses of matrix games (a special type of MASs), according to the proposed MASs
taxonomy, are formally defined. CMMDP (Cooperative Multi-agent Markov Decision Process) is
defined. It is a mathematical formulation of the planning problem in cooperative MASs that greatly
facilitates the task of finding a global joint optimal plan. CMMDP is based on the decomposition of
the global planning problem into local matrix games planning problems. CMMDP is also shown,
with little modification, to have the ability to formulate the planning problem in at least a subclass of
weakly competitive MASs.
 A new learning algorithm, called Extended-Q, is proposed to solve CMMDPs. Each matrix game is
solved using the Nash equilibrium concept. It is shown that, in cooperative MASs rational play
(action choice according to the concept of Nash equilibrium) is completely compatible with ideal




optimality for all agents involved in the MAS. It is conjectured that the optimal equilibrium local
solutions of the matrix games will yield globally optimal solution, that is globally optimal joint plan.
Another version of Extended-Q is proposed that can handle weakly competitive MASs as well as
cooperative ones. It is conjectured that, for the new version to give optimal results, the agents should
ignore the equilibrium solutions (that is the rational play) in some local situations of the weakly
competitive MASs.
Experiments are performed on some grid games, as examples of MASs, these experiments show: (1)
the convergence of the first version of Extended-Q to an ideal optimal joint plan in cooperative
MASs for both deterministic and non-deterministic environments, (2) the weakness of the Nash
equilibrium solution concept in some weakly competitive MASs, (3) the convergence of the second
version of Extended-Q in weakly competitive MASs, and (4) a decaying learning rate is necessary
for the convergence of Extended-Q in non-deterministic environments.
Both versions of Extended-Q mentioned above assume using a lookup table for the representation of
the Q-function. Unfortunately, this representation is infeasible for large problems in terms of space
required for storage and time and data required for accurate estimation. It is shown that in MASs
there are three sources for this problem: (1) continuous or large discrete state space, (2) continuous
or large discrete action space, and (3) large number of agents. A new version of Extended-Q, called
Neuro-Extended-Q, is proposed to handle the first cause of these, the large state space. This
algorithm assumes using an ANN as a function approximator for each joint action. It combines
Extended-Q with BACKPROPAGATION, which is a gradient descent learning technique for
feedforward ANNs.
The research done in this paper may also be of importance to the Game Theory community, it
provides: (1) a taxonomy of MASs or games, (2) formal frameworks for MASs or games, (3)
algorithms for general classes of MASs or games, (4) experiments and their thorough analysis are
done for some interesting games, (5) a critic is given of the Nash equilibrium solution concept, and
(6) a generalization technique for a class of MASs or games.
7.2 Future Work
Some suggestions for future work are:
 In this paper new notions were defined: ideal optimality and real optimality, and a new MAS
taxonomy was proposed based on these notions. These new concepts should be examined thoroughly
to clarify their importance in the study of MASs.
 It is required to analytically prove the convergence of the first version of Extended-Q to an ideal
optimal rational joint plan in cooperative MASs. This analytical proof also validates the correctness
of CMMDP as a framework for cooperative multi-agent planning.
 The CMMDP framework should be extended to formalize the planning problem in competitive
MASs as well as cooperative ones. An attempt was made in this paper to extend it to the weakly
competitive setting (the second version of Extended-Q). Algorithms based on this extension of
CMMDP (essentially the update rule of the Q-function) should also be developed.
 In this paper it was shown that rational play (playing Nash equilibrium profiles) may not be the best
choice in competitive MASs; an informal alternative play was proposed in weakly competitive
MASs that was shown to give better results for repeated play (repeatedly facing the planning
problem). This proposal should be formally investigated. Other alternatives for action choice in
competitive MASs should also be investigated. Game Theory plays a crucial role in this
investigation.
 The work in this paper assumes that the environment is completely observable, that is each agent can
perfectly sense the complete state of the environment. This is not always true in real applications
where the environment is partially observable. Each agent then must have the ability to predict the
current environmental state. Some work has been done for partially observable single agent systems,
this work needs to be extended to the multi-agent setting.
 A pre-imposed lexicographic convention was proposed to overcome the problem of equilibrium
selection. Another possible method to overcome this problem is through fictitious play, where each
agent model the behaviors of others (level-3 awareness). Extended-Q may be integrated with
fictitious play and the new resulting algorithm be tested.
 Experimental study should be done to test the performance of Neuro-Extended-Q.
 Neuro-Extended-Q handles only large state spaces assuming that the size of the joint action space is
manageable. Extended-Q should be combined with generalization techniques over large joint action
spaces as well as large state spaces.

All versions of the Extended-Q algorithm developed in this paper assume offline learning.
Exploration techniques that support online learning should be developed and integrated with
Extended-Q.
Acknowledgments
This work is supported in part by E-JUST Research Fellowship awarded to Dr. Mohamed A. Khamis.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
Aha D. W., "Machine Learning," A tutorial presented at the Fifth International Workshop on
Artificial Intelligence & Statistics, 1995.
Benda M., Jagannathan V., and Dodhiawala R., "On Optimal Cooperation of Knowledge Sources An Empirical Investigation," Technical Report BCS--G2010--28, Boeing Advanced Technology
Center, Boeing Computing Services, Seattle, Washington, 1986.
Bose N. K. and Liang P., "Neural Network Fundamentals with Graphs, Algorithms, and
Applications," McGraw-Hill International Editions, 1996.
Boutilier C., "Planning, Learning and Coordination in Multiagent Decision Processes," In
Proceedings of the Sixth Conference on Theoretical Aspects of Rationality and Knowledge, pp. 195210, Amsterdam, 1996.
Boutilier C., Dean T., and Hanks S., "Decision Theoretic Planning: Structural Assumptions and
Computational Leverage," Journal of Artificial Intelligence, Vol. 11, pp. 1-94, 1999.
Bowling M. and Veloso M., "An Analysis of Stochastic Game Theory for Multiagent Reinforcement
Learning," Technical Report CMU-CS-00-165, 2000.
Cao Y. U., Fukunaga A. S., and Kahng A. B., "Cooperative mobile robotics: Antecedents and
directions," Autonomous Robots, Vol. 4, pp. 7-27, 1997.
Charniak E. and McDermott D., "Introduction to Artificial Intelligence," Addison-Wesley
Publishing Company, 1985.
Claus C. and Boutilier C., "The Dynamics of Reinforcement Learning in Cooperative Multiagent
Systems," In Proceedings of the Fifteenth National Conference on Artificial Intelligence, Madison,
WI, 1998.
Dautenhahn K., "Getting to know each other-artificial social intelligence for autonomous robots,"
Robotics and Autonomous Systems, Vol. 16, pp. 333-356, 1995.
Dietterich T. G., "Machine Learning," ACM Computing Surveys, Vol. 28A, No. 4, December 1996.
Garcia A., Reaume D., and Smith R. L., "Fictitious Play for Finding System Optimal Routings in
Dynamic Traffic Networks," Transportation Research Part B: Methodological, Vol. 34, No. 2, pp.
147-156, 2000.
Garrido L. and Sycara K., "Multiagent Meeting Scheduling: Preliminary Experimental Results," In
Proceedings of the Second International Conference on Multiagent Systems, pp. 95-102, Menlo
Park, Calif.: AAAI Press, 1996.
Gurney K., "An Introduction to Neural Networks," UCL Press, UK, 1996.
Harmon M. E. and Harmon S. S., "Reinforcement Learning: A Tutorial," URL:
http://citeseer.nj.nec.com/harmon96reinforcement.html, 1996.
Haykin S., "Neural Networks - A Comprehensive Foundation," Second Edition, Prentice-Hall, Inc,
1999.
Hu J. and Wellman M. P., "Multiagent Reinforcement Learning: Theoretical Framework and an
Algorithm," In Proceedings of the Fifteenth International Conference on Machine Learning, pp. 242250, San Francisco, CA, 1998.
Hu J. and Wellman M. P., "Experimental Results on Q-Learning for General-Sum Stochastic
Games," In Proceedings of the 17th International Conference on Machine Learning, pp. 407-414,
2000.
Jung D. and Zelinsky A., "Grounded symbolic communication between heterogeneous cooperating
robots," Autonomous Robots, 2000.
Kaelbling L. P., Littman M. L., and Moore A. W., "Reinforcement Learning: A Survey," Journal of
AI Research, Vol. 4, pp. 237-258, 1996.
Littman M. L., "Markov games as a framework for multi-agent reinforcement learning," In
Proceedings of the Eleventh International Conference on Machine Learning, pp. 157-163, New
Brunswick, NJ, 1994.
22. Ljunberg M. and Lucas A., "The OASIS Air-Traffic Management System," In Proceedings of the
Second Pacific Rim International Conference on AI (PRICAI-92), Seoul, Korea, 1992.
23. Luger G. F. and Stubblefield W. A., "Artificial Intelligence Structures and Strategies for Complex
Problem Solving," Third Edition, Addison Wesley Longman, Inc, 1998.
24. McCarthy
J.,
"What
is
Artificial
Intelligence,"
URL:
http://wwwformal.stanford.edu/jmc/whatisai.html, 2001.
25. Mitchell T. M., "Machine Learning," McGraw-Hill International Edition, 1997.
26. Moder J. J. and Elmaghraby S. E., Editors, "Handbook of Operations Research - Foundations and
Fundamentals," Van Nostrand Reinhold Company, 1978.
27. Nwana H. S., "Software Agents: An Overview," Knowledge Engineering Review, Vol. 11, No. 3,
pp. 1-40, Cambridge University Press, 1996.
28. Nwana H. S. and Ndumu D. T., "A Perspective on Software Agents Research," Knowledge
Engineering Review, Vol. 14, No. 2, pp. 125-142, 1999.
29. Sandholm T. W. and Crites R. H., "Multiagent Reinforcement Learning in the Iterated Prisoner's
Dilemma," Biosystems Vol. 37, pp. 147-166, 1995.
30. Sen S., "Multiagent Systems: Milestones and New Horizons," Trends in Cognitive Science, Vol. 1,
No. 9, pp. 334-339, 1997.
31. Sheppard J. W., "Multi-Agent Reinforcement Learning in Markov Games," PhD Paper, Johns
Hopkins University, 1997.
32. Stone P. and Veloso M., "Multiagent Systems: A Survey from a Machine Learning Perspective,"
Autonomous Robots, Vol. 8, No. 3, July 2000.
33. Sutton R. S., "Learning to Predict by the Methods of Temporal Differences," Machine Learning Vol.
3, pp. 9-44, 1988.
34. Sutton R. S., "On the Significance of Markov Decision Processes," In Proceedings of ICANN97, pp.
273-282, Springer, 1997.
35. Sutton R. S. and Barto A. G., "Reinforcement Learning: An Introduction," MIT Press, Cambridge,
MA, 1998.
36. Sycara K., Decker K., Pannu A., Williamson M., and Zeng D., "Distributed Intelligent Agents,"
IEEE Expert Vol. 11, No. 6, pp. 36-46, 1996.
37. Sycara K., "Multiagent Systems," AI magazine, Vol. 19, No. 2, pp. 79-92, 1998.
38. Sycara K., Decker K., and Zeng D., "Intelligent Agents in Portfolio Management," In Agent
Technology: Foundations, Applications, and Markets, eds, Wooldridge M. and Jennings N. R., pp.
267-283. Berlin: Springer, 1998.
39. Tan M., "Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents," In
Proceedings of the Tenth International Conference on Machine Learning, pp. 330-337, 1993.
40. Thrun S. and Schwartz A., "Issues in Using Function Approximation for Reinforcement Learning,"
In Proceedings of the Fourth Connectionist Models Summer School. Lawrence Erlbaum, Hillsdale,
NJ, 1993.
41. Watkins C. J., "Learning from Delayed Rewards," PhD paper, Cambridge University, Cambridge,
England 1989.
42. Winston P. H., "Artificial Intelligence," Second Edition, Addison-Wesley Publishing Company,
1984.
43. Walid E. Gomaa, Amani A. Saad, and Mohamed A. Ismail, Learning Joint Coordinated Plans in
Multi-agent Systems, P.W.H. Chung, C.J. Hinde, M. Ali (Eds.): IEA/AIE 2003, LNAI 2718, pp.
154–165, 2003. Springer-Verlag Berlin Heidelberg 2003.
Appendix A
1.
Bellman equation for V:
V  (s)  E Rt st  s
 k

 E   rt  k 1 st  s 
 k 0




 E rt 1     k rt  k  2 st  s 
k 0


 k

 E rt 1 st  s E   rt  k  2 st  s 
 k 0

   ( s, a) Ert 1 st  s, at  a 
aA

 k

   ( s, a) E   rt  k  2 st  s, at  a 
aA 
 k 0

   ( s, a) R( s, a) 
aA


 k
 
   ( s, a)  T ( s, a, s' ) E   rt  k  2 st 1  s' 
aA 
s 'S 

 k 0
 
   ( s, a) R( s, a) 
aA


   ( s, a) T ( s, a, s' ) E Rt 1 st 1  s'
s 'S

   ( s, a) R( s, a) 

aA
aA
   ( s, a) T ( s, a, s' )V  ( s' )

aA



s 'S



   ( s, a)  R( s, a)    T ( s, a, s' )V  ( s' )  
aA 
s 'S



2.
Bellman optimality equation for V*:
V * ( s)  max Q* (s, a)
aA
 max E * Rt st  s, at  a
aA
 k

 max E *   rt k 1 st  s, at  a 
aA
 k 0





 max E * rt 1     k rt k 2 st  s, at  a 
aA
k 0


 max E * rt 1 st  s, at  a
aA

 k

E *   rt k 2 st  s, at  a 
 k 0

 max R( s, a) 
aA

 k
 
  T ( s, a, s' ) E *   rt  k  2 st 1  s' 
s 'S 
 k 0
 


 max  R( s, a)    T ( s, a, s' ) E * Rt 1 st 1  s' 
aA
s 'S




 max  R( s, a)    T ( s, a, s' )V * ( s' ) 
aA
s 'S





