A New Learning Technique for planning in Cooperative Multi-agent Systems Walid Gomaa a,b, Mohamed A. Khamis a,∗ a Cyber-Physical Systems Lab, Egypt-Japan University of Science and Technology (E-JUST), New Borg El-Arab City, Alexandria 21934, Egypt b Department of Computer and Systems Engineering, Faculty of Engineering, Alexandria University, Egypt Abstract One important class of problems in multi-agent systems is that of planning or sequential decision making. Planning is the process of constructing an optimal policy with the objective of reaching some terminal goal state. The key aspect of multiagent planning is coordinating the actions of the individual agents. There are three major approaches for coordination: communication, pre-imposed conventions, and learning. This paper addresses multi-agent planning through the learning approach. To facilitate this study a new taxonomy of multi-agent systems is proposed that divides them into two main classes: cooperative multi-agent systems and competitive multi-agent systems. This paper focuses on planning in the cooperative setting with some extensions to the competitive setting. A new reinforcement learning-based algorithm is proposed for learning a joint optimal plan in cooperative multi-agent systems. This algorithm is based on the decomposition of the global planning problem into multiple local matrix games planning problems. The algorithm assumes that all agents are rational. Experimental studies on some grid games show the convergence of this algorithm to an optimal joint plan. This algorithm is further extended to planning in weakly competitive multi-agent systems (a subclass of competitive multi-agent systems). This new version of the algorithm also assumes rational behavior of the agents in all problem stages except in some situations where it is believed that a deviation from the rational behavior gives better results. These two versions of the algorithm assume using a lookup table to hold the problem data. This representation is infeasible in practical applications where the problem size is prohibitive. So, a third version of the algorithm is proposed which uses a two-layer feed-forward Artificial Neural Network to represent the problem data. The learning rule of this version is a combination of reinforcement learning and BACKPROPAGATION (a learning technique for feed-forward artificial neural networks). Keywords: Multi-agent systems, reinforcement learning, feed-forward Artificial Neural Network, cooperative multi-agent systems, competitive multi-agent systems. 1. Introduction MASs aims at creating a system that interconnects separately developed agents, thus enabling the ensemble to function beyond the capabilities of any singular agent in the set-up [28]. The field of MASs has gained much interest in recent years because of the increasing complexity, openness, and distributivity of current and future systems. Research in MASs dates back to the late 1970's where the field was mostly explanatory in nature. The main research issues were: distributed interpretation of sensor data, organizational structuring, and generic negotiation protocols [30]. Like the rest of AI, the field has matured and is now focusing on formal theories of negotiation, distributed reasoning, multi-agent learning, communication languages [30], and multi-agent planning. The first MASs applications appeared in the mid-1980's and increasingly cover a variety of domains, ranging from manufacturing, air traffic control, and information management. Some representative examples of these applications are: Distributed Vehicle Monitoring (DVMT), ARCHON, OASIS, and WARREN [37]. The key aspect of multi-agent planning is coordinating the actions of the individual agents. Two major approaches have been used to attack the coordination problem: communication and pre-imposed conventions. Research in Distributed Artificial Intelligence (DAI) has been mainly concerned with these approaches. DAI assumes that the agents cooperate to achieve some common goal or accomplish some task. Unfortunately, communication and pre-imposed conventions are not sufficient to meet today's and future MASs applications which have the following unique characteristics: (1) The relation between the agents is not just fully cooperative as assumed by DAI. A broader spectrum of MASs exists covering the range from fully cooperative to fully competitive agents. (2) MASs must deal with dynamic, open, non-stationary, unmodeled, and highly distributed environments and must do that in an efficient way. Even the number of agents and their capabilities need not be held fixed. (3) Communication bandwidth is a scarce resource. Communication delays may also affect the system effectiveness especially in real-time systems. So the goal is to minimize the communication between the agents. (4) As the number of agents in a MAS increases, the search space grows exponentially making the requirements of storage and computational power infeasible. So a more compact representation of the search space is required so that similar states are represented by just one state. The only feasible way to handle such challenges is through Machine Learning (ML) techniques and in particular Reinforcement Learning (RL). The RL paradigm may be the best way to address the above challenges since it has the following distinguishing features: (1) RL algorithms do not require a pre-knowledge of the environment's model. (2) RL provides continual adaptive online learning algorithms where learning is interleaved with operation. Contrast this with other forms of learning such as supervised and unsupervised learning where learning is done offline, that is the learning phase is completely separated from the operation phase of the system. (3) The main class of problems best handled by RL is the class of multi-stage problems. The planning problem fits completely within this class. (4) RL has been rigorously studied in the domain of single-agent systems during the past ten years. This provides a strong base for extension to the MASs domain. (5) RL provides algorithms that learn in an incremental step-by-step manner. This implies efficiency both in terms of required storage and computational power. (6) RL can be combined with supervised learning to provide generalization techniques which are necessary for compact representation of the search space. This combination is sometimes referred to as neurodynamic programming. The most popular RL algorithm that has been used in single-agent systems and as a basis for RL extension to MASs is Q-learning. So far in practice, most people still use single-agent Q-learning for learning in MASs [18]. However, during the past few years some multi-agent Q-learning algorithms have been proposed. For example, [21] extends Q-learning to zero-sum MASs. [17] further extends the work of [21] to the case of general-sum MASs that have certain restrictions. [39] uses single-agent Q-learning in the multi-agent setting giving the agents the facility to communicate. The purpose of communication is to share sensation, experience, and learned knowledge. [4] conjectures that single-agent Q-learning will lead to equilibrium strategies in the fully cooperative MASs. The work in this paper represents another attempt to extend the Q-learning algorithm to the multi-agent setting. To do this Q-learning is combined with work in Game Theory (GT). The major objectives of this paper are: (1) To provide a taxonomy of MASs. MASs is a complex research domain, the only feasible way to study it is to classify MASs according to some criteria. (2) To define the planning problem in MASs and identify its distinguishing features from planning in single-agent systems. (3) To show the lack of convergence guarantee of single-agent Q-learning when used in the multi-agent setting. This justifies the need of a new formulation of the multi-agent setting and the necessity of modifying the single-agent Q-learning. (4) To provide a mathematical formulation of the multi-agent planning problem (Multi-Agent Markov Decision Process, MMDP). (5) To provide formal definitions of all subclasses of matrix games (a special type of MASs). (6) To formulate the planning problem in cooperative MASs (a subclass of MASs) based on the decomposition of the global planning problem into multiple local matrix games planning problems (Cooperative Multi-agent Markov Decision Process, CMMDP). (7) To propose a new learning algorithm for solving the planning problem in cooperative MASs. The new algorithm is an extension of Q-learning that is based on CMMDP. (8) To perform experiments to examine the CMMDP as a new model of cooperative multi-agent planning and to verify the convergence of the new algorithm. (9) To address the problem of combinatorial explosion of the problem size in practical applications by combining the new Q-learning based algorithm for cooperative MASs with generalization techniques from supervised learning. (10) To provide new concepts such as the taxonomy of MASs, new models such as CMMDP, and new algorithms that may be of importance to the Game Theory community. (11) To give a review of the RL paradigm and its basic elements. Preliminary results of this work have been published in [43]. In this paper, we provide a more detailed description and a detailed survey for the state-of-the-art work. This paper is organized in seven Sections as follows: Section 1 begins with a general introduction to MASs, followed by explaining the motivation behind the research done in this paper. Section 2 reviews RL. A precise and simple mathematical formulation of the RL problem, the Markov Decision Process (MDP), is given. Section 3 discusses MASs. It begins by defining MASs and the research motivations behind it. The preference of Machine Learning techniques and in particular RL to solve the coordination problem is justified. Section 4 extends the single-agent Q-learning algorithm (introduced in Section 2) to the cooperative multi-agent setting. Section 5 presents the experimental work done in this paper. Three main experiments are performed: grid game 1, grid game 2, and grid game 3. These games represent, through different settings, instances of strongly cooperative, weakly cooperative, and weakly competitive MASs. The main objective of these experiments is the verification of the theoretical results concluded in Section 3 and 4. Section 6 presents generalization where a function approximator, a parameterized functional form, is used to represent a joint action. It proposes using a feedforward Artificial Neural Network to implement the function approximator with gradient descent as the technique for learning the parameter vector. In Section 7, some concluding remarks and suggestions for future work are presented. 2. Reinforcement Learning 2.1 Basic definitions Definition 1: Artificial Intelligence: Artificial Intelligence (AI) may be defined as the branch of computer science that is concerned with the automation of intelligent behavior [23]. Definition 2: Machine Learning: Machine Learning studies methods for developing computer systems that can learn from experience. These methods are appropriate in situations where it is impossible to anticipate at design time exactly how a computer system should behave. The ultimate behavior of the system is determined both by the framework provided by the programmer and the subsequent input/output experiences of the program [11]. Definition 3: Supervised Learning: In a supervised learning system, also called learning by a teacher, the learner is provided with a set of example pairs <input,output> for training. The second component of this pair represents the desired output for the particular input given by the first component. Definition 4: Unsupervised Learning: In an unsupervised learning system, also called learning without a teacher, the learner is provided only with a set of instances or inputs as training examples. The system is not given any external indication as to what the correct responses should be nor whether the generated responses are right or wrong. Statistical clustering methods, without knowledge of the number of clusters, are examples of unsupervised learning [3]. 2.2 Reinforcement Learning Definition 5: Reinforcement Learning: Reinforcement learning is learning what to do, how to map situations to actions, so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them. In the most interesting and challenging cases, actions may affect not only the immediate reward, but also the next situation and, through that, all subsequent rewards. These two characteristics, trial-and-error search and delayed reward, are the two most important distinguishing features of reinforcement learning [35]. 2.3 The MDP Framework Definition 6: Markov Decision Process: A Markov decision process (MDP) is a 4-tuple (S,A,T,R) where i. S is a discrete state space (a finite set of environment states). ii. A is the agent's action space (the set of actions available to the agent). iii. T is the state transition function T: S x A x S [0,1]. It defines a probability distribution over next states as a function of the current state and the agent's action. T can also be defined as T: S x A PD(S) where PD(S) is the set of discrete probability distributions over the set S. iv. R is the reward function R: S x A ( is the set of real numbers). It defines the expected immediate reward received when selecting an action from the given state. Solving MDPs consists of finding a stationary policy or plan : S PD(A) where PD(A) is the set of discrete probability distributions over the set A. This mapping is such that the agent can achieve his goal. A detailed discussion of the basic elements of the MDP framework is given in the following subsections. At each of a sequence of discrete time steps (or successive stages of a sequential decision-making process [34]), t = 0,1,2,3,…., the agent perceives the state of the environment st, and selects an action at. In response to the action, the environment changes to a new state st+1, and emits a scalar reward rt+1. The dynamics of the environment are stationary and Markovian, but are otherwise unconstrained [34]. Stationarity means that the model of the environment, that is the transition and reward functions, do not change over time. The environment is Markovian if the state transitions are independent of any previous environment states or agent's actions. In a formal way, the environment is Markovian if the following equation is satisfied at every time step t: Prst 1 s'| st , at , st 1 , at 1 ,s0 , a0 Prst 1 s' | st , at (1) At each time step, the agent implements a mapping from states to probabilities of selecting each possible action. This mapping is called the agent's policy or plan. The agent's goal is to maximize the total amount of rewards he receives over the long run. Let the sequence of rewards received after time step t be rt+1, rt+2, rt+3,…. In general, the goal is to maximize the return Rt which is defined as some specific function of the reward sequence. In the simplest case the return is just the sum of the rewards: Rt rt 1 rt 2 rt 3 ... rT (2) , where T is a final time step [35]. According to this approach, the agent tries to select actions so that the sum of the discounted rewards he receives over the future is maximized. In particular, he chooses action at to maximize the discounted return: Rt rt 1 rt 2 rt 3 ... k rt k 1 2 (3) k 0 where [0,1], called the discount factor or the discount rate. 2.4 Value Functions 2.4.1 State-value function A policy is a mapping from states sS and actions a A to the probability (s,a) of taking action a in state s. Informally, the value of a state s under a policy , denoted V(s), is the expected return when starting in state s and following policy thereafter. For MDPs, V(s) can be defined formally as: k V ( s) E Rt st s E rt k 1 st s (4) k 0 , where E{} denotes the expected value given that the agent follows policy , and t is any time step. Note that the value of the terminal state, if any, is always zero. The function V is called the state-value function for policy [35]. 2.4.2 Action-value function Similarly, the value of taking action a in state s under a policy , denoted, Q(s,a), can be defined as the expected return if starting from state s, taking action a, and thereafter following policy : Q (s, a) Ert 1 st s, at a E Rt 1 st s, at a k R( s, a) E rt k 1 st s, at a (5) k 1 The function Q is called the action-value function for policy [35]. 2.4.3 Bellman equations A fundamental property of value functions is that they satisfy particular recursive relationships. For any policy and any state s, the following condition holds between the value of state s and the value of its possible successor states: V ( s) ( s, a) R( s, a) T ( s, a, s' )V ( s' ) aA s 'S (6) Eq. 6 is called Bellman equation for V. It expresses a relationship between the value of a state and the values of its successor states [35]. See Appendix A.1 for the derivation of Eq. 6. Similarly, given a policy the following equation holds between the value of any state-action pair (s,a) and the values of possible successor states: Q ( s, a) R( s, a) T ( s, a, s' )V ( s' ) , s 'S V ( s' ) ( s' , a' )Q ( s' , a' ) (7) a 'A Eq. 7 is the Bellman equation for Q. 2.5 Optimal Value Functions Solving a reinforcement learning (an MDP) task means, roughly, finding a policy that achieves a lot of reward over the long run. A precise definition of an optimal policy can be as follows. Value functions define a partial ordering over policies. A policy is defined to be better than or equal to a policy ' if its expected return is greater than or equal to that of ' for all states [35]. Formally: ' V (s) V ' (s) s S (8) , where denotes the relation better than or equal to. There is always at least one policy that is better than or equal to all other policies [35]. This is an optimal policy. Although there may be more than one, all optimal policies are denoted *. They share the same state-value function, called the optimal state-value function, denoted V*, defined as follows: V * max V (9) , where is the space of all possible policies. Optimal policies also share the same optimal action-value function, denoted Q* defined as follows: Q* max Q (10) For any state-action pair (s,a), this function gives the expected return for taking action a in state s and thereafter following an optimal policy [35]. Because V* is the state-value function for a policy, it must satisfy the condition given by Bellman Eq. 6. Because it is the optimal state-value function, however, V* can be written in a special form without reference to any specific policy. This is the Bellman equation for V* or the Bellman optimality equation. This equation expresses the fact that the value of a state under an optimal policy must equal the expected return of the best action from that state: V * ( s) max R( s, a) T ( s, a, s' )V * ( s' ) (11) aA s 'S Eq. 11 is the Bellman optimality equation for V* (see Appendix A.2 for its derivation). The Bellman optimality equation for Q* is: * * Q ( s, a) R( s, a) T ( s, a, s' )V ( s' ) , s 'S V * (s' ) max Q* (s' , a' ) a 'A (12) 2.6 Approaches for solving the RL problem 2.6.1 Dynamic programming Dynamic programming refers to the class of algorithms that can be used to compute optimal policies given the model of the environment. The environment's model is fully described by giving the state transition function T and the reward function R. In such cases where the model is known, learning can be performed offline on a simulated environment. Classical DP algorithms are of limited utility in reinforcement learning both because of their assumption of a known model and because of their computational expense, but they are still very important theoretically [35]. The key idea of DP algorithms is to turn the Bellman optimality equations (Equations 11 and 12) into assignment statements, that is, into update rules for improving approximations of the desired value functions. The next subsection gives an example of a DP algorithm. Initialize V(s) arbitrarily, e.g., V(s) = 0 sS Repeat 0 For each sS do: v V (s ) V ( s) max R( s, a) T ( s, a, s' )V ( s' ) (13) aA s 'S max( , v V ( s ) ) until < (a small positive number) Output a deterministic policy, , such that: ( s) arg max R( s, a) T ( s, a, s' )V ( s' ) aA s 'S (14) Algorithm 1: Value iteration One way to find an optimal policy is to find the optimal value function. It can be determined by a simple iterative algorithm called value iteration. Algorithm 1 gives the complete value iteration algorithm. In this version of the algorithm iterations are stopped once the value function changes by only a small amount in a sweep. Note the use of Bellman optimality equation, Eq. 11, as an assignment statement, Eq. 13, to improve the current approximation of the optimal value function. Value iteration can be shown to converge to the optimal value function V* given that each state is visited infinitely often. Updates of the value function given in Algorithm 1 are known as full backups since they make use of information from all possible successor states. This is contrasted with sample backups which are critical to the operation of model-free methods such as Qlearning which will be discussed shortly. 2.6.2 Monte Carlo methods Unlike the DP approach, Monte Carlo methods do not assume knowledge of the environment's model. They require only experience (sample sequences of states, actions, and rewards from actual or simulated interaction with the environment). At each interaction an experience tuple is formed (s,a,s',r), where st = s, at = a, st+1 = s', and rt+1 = r. It is assumed that the sequence of these tuples is divided into subsequences called episodes or trials. Each episode terminates in a final goal state at which the total return is computed. Monte Carlo methods are ways of solving the RL problem based on averaging sample returns. It is only upon the completion of an episode that value estimates and policies are changed. Monte Carlo methods are thus incremental in an episodeby-episode sense rather than in a step-by-step sense [35]. Initialize, sS, aA: Q(s,a) arbitrary (s) arbitrary Returns(s,a) empty list Repeat Policy Evaluation: Generate an episode using the current policy and an appropriate exploration strategy. For each pair s,a appearing in the episode: R rt 1 rt 2 rT | t is the first time step st s, at a (15) , where T is the final time step of the episode at which the goal state is reached. Append R to Returns(s,a) Q(s, a) average(Re turns(s, a)) (16) Policy Improvement: For each state s in the episode: (s) arg max Q(s, a) aA Until termination Algorithm 2: MC policy iteration (17) In MC policy iteration one maintains both an approximate policy and an approximate value function. The value function is repeatedly altered to more closely approximate the value function of the current policy, and the policy is repeatedly improved with respect to the current value function. These two phases, called policy evaluation and policy improvement, are continually alternating until reaching a fixed policy. Thus the sequence of monotonically improving policies and value functions can be obtained as shown in Fig. 1: Figure 1: The operation of MC policy iteration , where E means policy evaluation and I means policy improvement. Once a policy has been improved using its action-value function Q to yield a better policy ', Q' can be computed and used again to yield an even better policy ''. This constitutes a sequence of monotonically improving policies and action-value functions. 2.6.3 Temporal difference learning Temporal difference (TD) learning is the most important approach for solving the RL planning problem. It gives the RL technique its wide applicability in real practical problems. TD methods have two main features: (1) Learning is done directly from raw experience without a model of the environment's dynamics. (2) Updating estimates is based in part on other learned estimates, without waiting for a final outcome as MC methods do. Updates are done on a step-by-step incremental basis. TD methods use raw experience to estimate the value function. If a state st is visited at time t, a TD method updates its estimate V(st) based on what happens after that visit at the next time step t+1. At time t+1 it immediately forms a target and makes a useful update using the observed reward rt+1 and the estimate V(st+1). The simplest TD method, known as TD(0), is V (st ) V (st ) rt 1 V (st 1 ) V (st ) (18) , where [0,1] is the learning rate. The update rule in Eq. 18 is known as a sample backup rule since it is based on a single sample successor state rather than on a complete distribution of all possible successor states. One of the most important breakthroughs in reinforcement learning was the development of the TD algorithm known as Q-learning [41] (Cited in [20]). It is defined by the following update rule: Qt 1 (st , at ) Qt (st , at ) st ,at rt 1 max Qt (st 1 , a) Qt (st , at ) aA (19) In this case the learned action-value function, Q, approximates Q*, the optimal action-value function, independently of the policy being followed. Although the policy has an effect in that it determines which stateaction pairs are visited and updated, all that is required for convergence is that all pairs continue to be updated sufficiently enough. The learning rate should be decayed appropriately so it is in general a function of the state-action pair (st,at). Algorithm 3 shows Q-learning in procedural form [35]. The Q-learning algorithm is guaranteed to converge to the optimal Q-values with probability 1 if the following conditions hold [29]: (1) The environment is stationary and Markovian, that is the environment can be modeled as a Markov Decision Process. (2) A lookup table is used to store the Q-values. (3) Every state-action pair continues to be visited (the issue of exploration). (4) The learning rate is decreased appropriately over time. Initialize Q(s,a) arbitrarily Repeat From the current state s, select an action a, using policy derived from the current approximation of Q, or just explore a random action. Take action a, and observe the environment's response r and s'. Update Q(s,a) based on this experience as follows: Q(s, a) Q(s, a) r max Q(s' , a' ) Q(s, a) a' (20) s s' Until termination Algorithm 3: Q-learning Section 4 extends the single-agent Q-learning algorithm to the cooperative multi-agent setting, the new algorithm will be called Extended-Q. In Section 5 Extended-Q is further modified to manipulate some weakly competitive MASs as well as cooperative ones. 3. Multi-agent Systems 3.1 Definitions Definition 7: Agent: An agent can be defined as an entity, such as a robot, a component of a robot, or a software program, with sensation, actions, and goals, situated in a dynamic, stationary and Markovian environment, and is capable of acting in a self-interested autonomous fashion in this environment. The way he acts is called his behavior. Definition 8: Multi-agent Systems: Multi-agent systems (MASs) is the emerging sub-field of AI that views intelligence and the appearance of intelligent behavior as the result of a complex structural arrangement of several autonomous independent interacting agents [2]. Definition 9: Agent Policy or Plan: A policy is a set of decision rules each at every state the agent may encounter. The policy is said to be non-deterministic if the decision rules are probability distributions over the agent's actions. If all the probability distributions assign a probability one to just one action then the policy is said to be deterministic. Definition 10: Rationality: An agent in a multi-agent system is said to behave rationally if he always follows an optimal policy given the stationary joint policy of the other agents. (A stationary policy is a policy that does not change over time.) Definition 11: Ideal Optimality: The ideal optimal value of an agent in a multi-agent system is the possible maximum payoff the agent can gain. An agent in a multi-agent system is said to behave ideally optimal or achieve ideal optimality if when interacting an infinite amount of time (repeated initiation and termination) he can always gain the ideal optimal value or in other words he can always reach the terminal goal state and he reaches it with the finite possible minimum number of steps. The notion of a state in Definition 9 means the state of the external environment as defined in Section 2. Precise definitions of what is meant by an agent, an environment, the interface between them, and agent's policy and goal were given in Section 2. The current paper assumes that all agents in a MAS are rational. The property of being rational will be elaborated in terms of the Nash equilibrium concept in Section 4. The current paper also assumes that any sort of communication or negotiation among the agents is forbidden. An important note from Definition 11 is that gaining maximum payoff is equivalent to reaching the terminal goal state in the minimum number of steps. MASs can be classified according to their joint ideal optimality and joint policy. This classification is shown in Fig. 2. Figure 2: A taxonomy of MASs The main characteristic of cooperative MASs is that all agents can together achieve ideal optimality if and only if they are rational, and they can do that by following a stationary joint deterministic policy. This implies that rationality is compatible with ideal optimality in cooperative MASs. Strongly cooperative MASs are cooperative MASs which have the following property: if at least one agent is not rational, then none of the agents in the system can achieve his ideal optimality. In other words, if agent i is achieving his ideal optimality, this implies that all other agents are also achieving their ideal optimality. They all have the same payoff function, or at least the payoff functions are directly proportional. Mostly, the agents in this class work as a team to achieve a single global goal. This class of MASs naturally arises in task distribution [4]. For example, a user might assign some number of autonomous mobile robots, or perhaps software agents, to some task, all of which should share the same payoff function (namely that of the user) [4]. A company or an organization may also be modeled as a strongly cooperative MAS. Another example is the soccer team where all agents have the shared goal of winning the game. Weakly cooperative MASs are cooperative MASs which have the following property: there exists a joint policy by which some agents can achieve their ideal optimality while the others cannot. All agents are strongly cooperative only over a portion of their joint action space, while there may be conflicts between two or more agents over other portions. Said another way, the payoff functions are identical or at least directly proportional only over a portion of the agents' joint action space. Through this portion all agents can achieve ideal optimality. Mostly each agent in this class has his own independent goal to achieve but there is no conflict between the goals of different agents. An example is the traffic system where each vehicle is an agent with the goal of reaching his destination completely safe. The main characteristic of competitive MASs is that given all agents are rational one or more agents cannot achieve ideal optimality because of the inherent conflict between some or all the agents in this class. For agent i to behave ideally optimal, the cost for that will be paid by other agents who will sacrifice their payoffs and this cannot happen since the agents are rational. This means that rationality is not compatible with ideal optimality in competitive MASs. So all what the agents can do is to achieve real optimality. Definition 12: Real Optimality: The real optimal value of an agent in a multi-agent system is the minimum level of expected payoff the agent guarantees given that all agents are rational. The real optimal value of the agent is less than or equal to his ideal optimal value. An agent in a multi-agent system is said to achieve real optimality if when interacting an infinite amount of time (repeated initiation and termination) and given that all other agents are rational he can always by acting rationally guarantee a minimum level of expected payoff or in other words he guarantees reaching the terminal goal state a minimum expected number of times and he reaches it with the expected minimum number of steps. Weakly competitive MASs are competitive MASs which have the following property: all agents can achieve real optimality by following a stationary deterministic joint policy. Examples are soccer or chess games that end in draw between the two teams or players. Strongly competitive MASs are competitive MASs which have the following property: all agents can achieve real optimality only by following a stationary nondeterministic joint policy. An example is a basketball game, a tennis game, or a ping pong game, where one of two teams must win. Another example is the soccer world cup championship where only one of many teams wins the cup while the others lose it. This paper mainly studies cooperative MASs which will be defined formally in Section 4. In Section 5 some aspects of weakly competitive MASs are investigated. 3.2 Planning in MASs One important class of problems in MASs is that of multi-agent planning (or multi-agent sequential decision making). Definition 13: Single-Agent Planning: Planning for a single agent (single-agent sequential decision making) is a process of constructing an optimal policy with the objective of reaching some terminal goal state given the agent's capabilities and the environmental constraints [37,31]. Definition 14: Multi-agent Planning: Multi-agent planning (multi-agent sequential decision making) consists of multiple single-agent planning problems, however, each agent must also consider the constraints that the other agents' activities place on an agent's choice of actions, the constraints that an agent's commitments to others place on his own choice of actions, and the unpredictable evolution of the environment caused by other unmodeled agents. The key aspect of multi-agent planning is coordinating the actions of the individual agents so that the goals are achieved efficiently [4]. Without coordination each agent ignores the existence of other agents and as a result will plan and act separately as if it were a single-agent system. The resulting joint plan in this case may be suboptimal or even diverge from the desired goals. Since all agents are rational, they are interested in this coordination since jointly optimal action is individually optimal for each agent [4]. Section 2 introduced reinforcement learning for single agent systems. It described the Q-learning algorithm. Section 4 extends this algorithm to the cooperative multi-agent setting. Section 5 further extends it to manipulate some weakly competitive MASs as well as cooperative MASs. 4. Extended-Q Algorithm 4.1 Single-Agent MDP in MASs One of the conditions that must hold for the single-agent Q-learning to converge is that the environment can be modeled as a Markov Decision Process. This implies that the environment dynamics must be stationary. That is the transition and reward functions do not change over time. These functions are functions of the environment state as perceived by the agent, for example his position within the environment, and the agent's action, for example the movement in any one of the four directions east, west, north, and south. This condition is very well satisfied in single-agent systems. The agent is an active learning adaptable entity that can affect the environment in an arbitrary way. So the transition and reward functions become functions of an environment state that includes the positions of all agents within the environment and the joint action of all agents. This new fact turns the environment non-stationary to each agent. To recover stationarity each agent must be aware of the existence of other agents and coordinate his action with them. Hence Normal-Q must be modified. There exist multiple levels of awareness an agent may have about the other agents. This paper proposes four levels of awareness where each level includes all the preceding ones plus an additional feature of its own. These levels are as follows: Level-1: Each agent considers the other agents as part of the environment state. For the example given above, the environment state will consist of all agents' positions within the environment. Level2: Each agent considers the payoff functions of the other agents. This implies that each agent considers the actions available to the other agents. Level-3: Each agent models the behaviors of the other agents. This means that each agent tries to model or predict the learned plans of the others. Level-4: Each agent tries to model the internal state of the other agents. The absence of awareness will be referred to as level-0. Level-4 may be needed when the agents within the same MAS use different learning algorithms. The work in this paper assumes that all agents use the same learning algorithm so level-4 is ignored. Level-3 may be needed in two cases: (1) some agents are not rational. In such case the best response of the remaining agents may not be the rational behavior. However, to identify this irrationality, agents' behaviors must be modeled. (2) All agents are rational, but there may exist multiple optimal joint plans to follow. Assuming no communication, there must be a way so that all agents could coordinate on the same joint plan. The only way to do this is through modeling each other's behavior. Behavior modeling is called fictitious play in the terminology of Game Theory [6,9,26,12]. This paper assumes that all agents are rational (with little deviation in Section 5) so the first case is irrelevant. The problem that appears in the second case is solved by using a pre-imposed lexicographic convention so that when there are multiple plans each agent knows exactly (assuming cooperative MASs and some weakly competitive MASs which are investigated in Section 5) which plan to choose without the need to model others' behaviors. So level3 is ignored. The work done in this paper considers only level-2 of awareness. Given the above discussion, it is conjectured that: Normal-Q algorithm, which assumes level-0 awareness, is not guaranteed to converge in multi-agent systems so it should not be used in such setting. This conjecture is further justified through the following example. 4.1.1 An illustrative example Consider a MAS with two agents A and B. Each uses the Normal-Q algorithm to learn his optimal policy. This means that each agent is unaware of the other one (level-0 awareness). Consider the situation depicted in Fig. 3. At a certain moment in time the environment becomes in state s. Agent A decides to take action a1 while B takes action b1. This joint choice of actions turns out to be beneficial (as immediate and future consequence) to agent A. So A updates his Q-entry for the state-action pair (s,a1) accordingly. At a later moment in time the environment returns back to state s, agent A decides to exploit a1, which previously proved to be good, while B decides to explore a new action b2. Unfortunately, this joint choice of actions proves (as immediate and future consequence) to be very harmful to A, he receives high penalties. Agent A responds by changing the same Qentry for (s,a1) accordingly. This oscillation of Q(s,a1) continues as A is not aware of what B is doing. The overall consequence is that the Q-function will oscillate leading to the oscillation of the agents' policies and hence no convergence occurs. Figure 3: Agent A is confused about the significance of his action a1 As mentioned before the essential difference between planning in MASs and in single-agent systems is the issue of coordination. Without coordination, the agents may converge to sub-optimal plans or even diverge. Coordination requires that each agent has at least level-2 of awareness, that is be aware of the actions of others. Normal-Q algorithm lacks this level of awareness. 4.2 Multi-agent Markov Decision Process (MMDP) Naturally, the first thing to start with is to try to formulate the multi-agent planning problem. To accomplish this MDP model of the single agent systems is extended to the multi-agent setting. The extended model is called Multi-agent Markov Decision Process (MMDP). It is defined as follows: Definition 15: Multi-agent Markov Decision Process: A multi-agent Markov decision process (MMDP) is a 5-tuple (n, S, A, T, R) where n is the number of agents. S is a discrete state space (a finite set of environment states). A is the agents joint action space, A = A1 x A2 x …x An where Ai is the set of actions (also called strategies) available to agent i. A may also be called the set of joint action (strategy) profiles. T is the state transition function T: S x A x S [0,1]. It defines a probability distribution over next states as a function of the current state and the agents joint action. T can also be defined as T: S x A PD(S) where PD(S) is the set of discrete probability distributions over the state space S. R is a set of reward functions Ri in1 for all agents where Ri: S x A ( is the set of real numbers). It defines the expected immediate reward received by agent i when all agents select a joint action from the given state. R can also be defined as one global reward function R: S x A n, where the ith component of the reward vector defines the expected immediate reward received by agent i. The agents-environment interaction is depicted in Fig. 4. At time t the agents sense the environment state st, and accordingly take joint action (at1,at2,…,ati,…,atn). The environment responds with the new state st+1 and the reward vector (rt+11,rt+12,…,rt+1i,…,rt+1n). 4.3 Solving Multi-agent Markov Decision Process Solving MMDPs consists of finding a joint stationary policy or plan : S PD(A) where PD(A) is the set of discrete probability distributions over the joint action space A. This mapping is such that each agent can achieve his goal. Figure 4: The agents-environment interaction The solution concept of the MMDP planning problem assumed in this paper combines ideas from two fields: (1) Reinforcement Learning: where the multi-agent planning problem is a multi-step modelfree problem so incremental experience-based learning is required. (2) Game Theory: where the concept of matrix games is used to reformulate the multi-agent planning problem and to provide a new meaning to the Q-function. The Nash equilibrium solution concept of matrix games provides the essence of the agents' joint action selection at each state. RL has already been introduced in Section 2. The next section introduces matrix games. 4.4 Matrix Games (MGs) Definition 16: Matrix Game: A matrix game is a 3-tuple (n, A1…n, P1…n) where n is the number of agents (players). Ai is the action space of agent i (the set of actions, also called strategies, available to agent i). A = A 1 x A2 x … x An is the joint action space. Pi is agent i's payoff function Pi: A (where is the set of real numbers). It represents the utility of agent i given all agents' actions. It is assumed that all agents act simultaneously. The agents select actions from their available sets with the goal of maximizing their payoffs which depend on all agents' actions. These are often called matrix games since the Pi functions can be written as n-dimensional matrices [6]. A matrix game can be modeled as an MMDP with the state space S containing only one state and the reward function Ri of agent i corresponding to the payoff function Pi of the same agent. (If the game is played repeatedly, then maximizing the immediate reward is the same as maximizing the total cumulative discounted reward since the single state makes transitions only to itself.) So matrix games are single-state MASs, therefore the taxonomy of MASs can be applied to them. The ideal optimal value for agent i in a matrix game is defined as follows: Pi max Pi (a ) * aA (21) The following subsections give representative examples for the different classes of matrix games. 4.4.1 The coin matching game There are two agents (n = 2), each has two available actions: head or tail, that is A1 = A2 = {H,T}. If both agents make the same choice, (H,H) or (T,T), then agent 1 wins a dollar from agent 2. If they are different then agent 1 loses a dollar to agent 2. The payoff matrix for both agents are shown in Fig. 5. The first component of the ordered pairs represents the payoff of agent 1 (the row agent) while the second component represents the payoff of agent 2 (the column agent) [6]. This game is an example of a competitive matrix game and specifically strongly competitive as will be shown later (strongly competitive single-state MAS). The ideal optimal value of the game for each agent, as defined by Eq. 21, is 1 but there is no joint action profile (a joint deterministic policy) that can achieve this ideal optimal value for both agents. Figure 5: Agents' payoffs in the coin matching game 4.4.2 The prisoner's dilemma Two well-known crooks are captured and separated into different rooms. The district attorney knows he does not have enough evidence to convict on the serious charge of his choice, but offers each prisoner a deal. If just one of them will turn state's evidence (i.e. rat on his confederate), then the one who confesses will be set free, and the other sent to jail for the maximum sentence. If both confess, they are both sent to jail for the minimum sentence. If both exercise their right to remain silent, then the district attorney can still convict them both on a very minor charge. Fig. 6.a shows the payoff matrix of this situation. Figure 6: The prisoner's dilemma matrix game, a) a particular instantiation, b) the general form Here n = 2 and A1 = A2 = {C, C} (for not confess and confess). In the payoff matrix in Fig. 6.a, the payoffs are taken such that the most disagreeable outcome (the maximum sentence) has value 0, and the next most disagreeable outcome (minimum sentence) has value 1. Then being convicted on a minor charge has value 3, and being set free has value 4. The prisoner's dilemma game belongs to a general subclass of matrix games that have the payoff structure shown in Fig. 6.b where y > x > v > u [29]. All matrix games in this subclass are competitive and specifically weakly competitive as will be shown later (weakly competitive single-state MAS). The ideal optimal value of the game for each agent, as defined by Eq. 21, is y but there is no joint action profile (a joint deterministic policy) that can achieve this optimal value for both agents. An interesting property about the prisoner's dilemma subclass of matrix games is that the action a2 for the row agent and b2 for the column agent are dominating actions (y > x and v > u), so each agent may choose to play this action without thinking about what the other agent will do or what his payoffs are. The resulting joint action (a2,b2) will yield a payoff v to each agent. Later in this Section this joint action profile is shown to represent the only possible rational play for both agents. 4.4.3 Cooperative games Consider the matrix games shown in Fig. 7. In both games n = 2, A1 = {a1,a2}, and A2 = {b1,b2}. Both agents can together achieve their ideal optimal value 5 in both games by adopting the deterministic joint strategy profile (a1,b1). This strategy profile is rational as will be shown later. So these two examples represent cooperative matrix games. Figure 7: Cooperative games, a) strongly cooperative, b) weakly cooperative Both agents' payoff functions shown in Fig. 7.a are directly proportional over the whole joint action space (actually they are the same), so it’s a strongly cooperative matrix game. This is not the case in Fig. 7.b which represents a weakly cooperative matrix game. 4.5 Nash Equilibrium Solution of Matrix Games Solving a matrix game means the computation of optimal strategies for all agents. An agent's strategy is called a pure strategy if he chooses his action deterministically, whereas an agent's strategy is called a mixed strategy if he selects his action non-deterministically according to some probability distribution over his action space. The Nash equilibrium concept is a formal definition of the assumption followed throughout this paper that all agents in a MAS are rational. In this section the definition is given for a special type of MASs: matrix games. In the next section the Nash equilibrium concept is extended to cooperative MASs by decomposing them into interdependent matrix games. Definition 17: The Best-Response Function: For a matrix game, the best-response function for agent i, BRi(i), is the set of all, possibly mixed, strategies for agent i that are optimal given that the other agents choose the possibly mixed joint strategy -i [6]. Definition 18: Nash Equilibrium: Nash equilibrium (best-response equilibrium) is a collection of strategies (1,…, i,…, n), called a strategy profile, for all n agents with: i i BRi (i ) (22) , that is each agent strategy is a best response to the others joint strategy. Nash equilibrium can also be defined more elaborately in terms of the payoff functions as follows: Definition 19: Pure Nash Equilibrium: An n-tuple of pure strategies (1*,...,i*,...,n*) is a pure Nash equilibrium strategy profile if: i : i Ai : Pi ( 1 ,..., i ,... n ) Pi ( 1 ,..., i ,... n ) (23) * * * * * Definition 20: Mixed Nash Equilibrium: An n-tuple of mixed strategies (1*,..., i* ,..., n*) is a mixed Nash equilibrium strategy profile if: i : i i : a1A1 * * * 1 i n 1 n ( a ) ( a ) 1 i n (a ) Pi (a , , a ) a n An a1A1 * 1 (a1 ) i (ai ) n (a n ) Pi (a1 ,, a n ) (24) * a n An , where i is the set of all mixed strategies of agent i, that is the set of all discrete probability distributions over the action space of agent i. So a Nash equilibrium is a collection of strategies for all agents such that each agent's strategy is a best-response to the other agents' strategies. So no agent can do better by changing his strategy given that the other agents continue to follow the equilibrium strategy. What makes the notion of equilibrium compelling is that all matrix games have a Nash equilibrium, although there may be more than one. Theorem 4.1: Given any finite n-player matrix game, there exists at least one Nash equilibrium strategy profile (Nash 1951 Cited in [31]). Finite games are games where the action spaces of all agents are finite. It is very important to note that a pure Nash equilibrium can be thought of as a mixed Nash equilibrium where each one of the probability distributions that make up the joint strategy profile assigns a probability 1 to just one action. But in this paper a clear distinction between the notions of pure and mixed (or deterministic and non-deterministic) is crucial. This means that a mixed profile is strictly non-deterministic. Unless otherwise stated, for the rest of this Section assume that for any two-agent matrix game the row agent is called agent A and the column agent is called agent B. The following subsections identify the nature of the Nash equilibrium solution concept in each subclass of matrix games by applying it to the examples given in Section 4.4. 4.5.1 Nash equilibrium in competitive games Fig. 8.a shows the payoff matrix of the coin matching game, a competitive matrix game introduced above. Clearly, there is no pure Nash equilibrium solution to this game. But according to Theorem 4.1, there must be at least one mixed Nash equilibrium profile. One such profile is shown in Fig. 8.a which is ((0.5,0.5),(0.5,0.5)). To verify that this profile is equilibrium assume that both agents adopt the joint profile ((p,1-p),(0.5,0.5)). Agent A wants to find the best value for p so as to maximize his expected payoff. The expected payoff of A given this joint profile is: EP( A) 0.5 p 0.5 p 0.5(1 p) 0.5(1 p) 0 So whatever the value of p is, agent A can achieve his maximum expected value of zero, so A's mixed strategy A = (0.5,0.5) is a best response to B's mixed strategy B = (0.5,0.5). In the same way B = (0.5,0.5) can be shown to be a best response to A = (0.5,0.5). So the mixed joint strategy profile ((0.5,0.5),(0.5,0.5)) is a Nash equilibrium profile. Figure 8: Nash equilibrium in competitive games, a) the coin matching, b) the subclass of prisoner's dilemma. For this game, although both agents are rational they cannot achieve their ideal optimality. So the coin matching game is a competitive matrix game. All what each agent can do is to achieve a real optimal value (a minimum expected level) of 0 and he can do it only by following a mixed equilibrium profile (there is no pure equilibrium profiles) so it is a strongly competitive matrix game. Fig. 8.b shows the general payoff matrix of the prisoner's dilemma game (y > x > v > u). There is only one pure Nash equilibrium solution: (a2,b2). This rational profile cannot achieve the ideal optimal value y to any agent. So this game is a competitive matrix game. However, through this pure equilibrium profile each agent guarantees a real optimal value of v. So the subclass of the prisoner's dilemma matrix games is weakly competitive. An interesting question is whether the prisoner's dilemma matrix game has a mixed Nash equilibrium strategy profile. To examine this, assume that both agents adopt the mixed profile ((p,1-p),(q,1-q)). Agent A wants to find the best value for p so as to maximize his expected payoff. The expected payoff of A given this joint profile is: EP( A) pqx p(1 q)u q(1 p) y (1 p)(1 q)v pq( x u y v) p(u v) q( y v) v p(qx qu qy qv u v) q( y v) v p[q( x y) (1 q)u (1 q)v] q( y v) v p[q( x y) (1 q)(u v)] q( y v) v (25) The only term where p appears in Eq. 25 is the first term where it is a factor. This term is negative since y > x and v > u. Agent A's goal is to try to maximize this term by playing with p. Since p[0,1] then the best value for p is clearly zero which eliminates the negative effect of this term. Following the same line of reasoning it can be shown that given the profile ((p,1-p),(q,1-q)), the best value that agent B can assign to q is also zero. The above results lead to the following conclusions: (1) There does not exist any mixed Nash equilibrium strategy profiles in the subclass of the prisoner's dilemma matrix games. (2) Any agent can always do better by playing his second action (a2 for agent A and b2 for agent B) regardless of the strategy of the other agent. This is selfjustified since, as pointed out above, the second action for each agent is a dominating action. Some interesting questions arise about the class of competitive matrix games or competitive MASs in general. For example, if in a competitive matrix game there are multiple Nash equilibrium strategy profiles, which profile(s) is(are) the best in terms of giving the highest real optimal value for all agents? how to find such profile(s)? and how can all agents coordinate on the same one of these profiles? Another important question is: is the Nash equilibrium concept (rational agents) the best solution concept for solving competitive matrix games? The prisoner's dilemma subclass of games shows a weakness in this concept. The non-Nash equilibrium profile (a1,b1) gives a better real optimal value x for both agents. Actually, such profiles as (a1,b1) are called Pareto Optimal. Pareto optimality is an evaluative principle according to which: the whole community of agents becomes better off if one agent becomes better off and none worse off. Other interesting questions are specifically about the weak competitive subclass: can there be any mixed Nash equilibrium strategy profile in a weakly competitive matrix game? As shown above, the prisoner's dilemma subclass of games does not have mixed equilibria. In Section 5 another example of a weakly competitive game is shown to have only pure equilibria. If there is any mixed equilibrium in weakly competitive games, may it give a better real optimal value for all agents? All of these questions need investigation. 4.5.2 Nash equilibrium in cooperative games Fig. 9 shows Nash equilibrium solutions in a strongly cooperative matrix game. As shown in Fig. 9.a, there are multiple pure strategy profiles that can easily be verified to be Nash equilibria; these are: (a1,b1), (a2,b2), and (a3,b3). The first and third profiles give the ideal optimal value of the game, 5, to both agents, while the second profile gives the real optimal value of 1 to both agents. On the other hand, Fig. 9.b, shows a mixed equilibrium profile: ((0.5,0.0,0.5),(0.5,0.0,0.5)). To verify that this profile is equilibrium assume that both agents adopt the profile ((p1,p2,p3),(0.5,0.0,0.5)). Agent A wants to find the best values for p1, p2, and p3 so as to maximize his expected payoff. The expected payoff of A given this joint profile is: EP( A) 2.5( p1 p3 ) (26) Figure 9: Equilibrium solutions in a strongly cooperative matrix game, a) pure equilibria, b) mixed equilibrium. So for A to maximize his expected payoff then p1 + p3 must equal its maximum value of 1. So A's mixed profile A = (p,0.0,1-p) is a best-response to B's mixed profile B = (0.5,0.0,0.5). Hence A = (0.5,0.0,0.5) is a best-response to B's mixed profile B = (0.5,0.0,0.5). By the same way, B = (0.5,0.0,0.5) can be shown to be a best-response to A = (0.5,0.0,0.5). Hence, the joint mixed strategy profile ((0.5,0.0,0.5),(0.5,0.0,0.5)) is a Nash equilibrium with value 2.5 which is a real optimal value. Note that both agents in the game shown in Fig. 9 can achieve their ideal optimality if and only if they are rational. And they can do it by following a joint pure equilibrium profile. So it is a cooperative matrix game. If one agent is not rational, that he does not follow any equilibrium profile, then neither agent can achieve his ideal optimality (all ideal optimal profiles for both agents are equilibrium profiles). So it is a strongly cooperative matrix game. To complete the discussion of cooperative matrix games, an example of a weakly cooperative game is shown in Fig. 10. Fig. 10.a shows two pure equilibrium strategy profiles: (a1,b1) and (a3,b3), which can easily be verified. Both of these pure equilibrium profiles are ideal optimal for both agents with value 5. Fig. 10.b shows the mixed equilibrium profile ((0.5,0.0,0.5),(0.5,0.0,0.5)). To verify that this profile is equilibrium assume that both agents adopt the profile ((p1,p2,p3),(0.5,0.0,0.5)). Agent A wants to find the best values for p1, p2, and p3 so as to maximize his expected payoff. The expected payoff of A given this joint profile is: Figure 10: Equilibrium solutions in a weakly cooperative matrix game, a) pure equilibria, b) mixed equilibrium. EP( A) 2.5 p1 2.5 p2 2.5 p3 2.5( p1 p2 p3 ) 2.5[ p1 (1 p1 p3 ) p3 ] 2.5[2( p1 p3 ) 1] (27) So for A to maximize his expected payoff then p1 + p3 must equal its maximum value of 1. So A's mixed profile A = (p,0.0,1-p) is a best-response to B's mixed profile B = (0.5,0.0,0.5). Hence A = (0.5,0.0,0.5) is a best-response to B's mixed profile B = (0.5,0.0,0.5). By the same way, B = (0.5,0.0,0.5) can be shown to be a best-response to A = (0.5,0.0,0.5). Hence, the joint mixed strategy profile ((0.5,0.0,0.5),(0.5,0.0,0.5)) is a Nash equilibrium with real optimal value 2.5. Note that both agents in the game shown in Fig. 10 can achieve their ideal optimality if and only if they are rational. And they can do it by following a joint pure equilibrium profile. So it is a cooperative matrix game. However, some ideal optimal profiles of any agent is not ideal optimal for the other. For example, the joint profile (a1,b2) is ideal optimal for agent A only. These profiles are not equilibrium. So it is a weakly cooperative matrix game. It can be concluded that the above two cooperative matrix games have multiple pure and mixed Nash equilibrium solutions, some of them are ideal optimal for both agents and the others are real optimal for both agents. 4.6 MGs Taxonomy: A Formal Definition Since matrix games are special type of MASs (single-state MASs), the MASs taxonomy given in Section 3 can be applied to them. In this section each subclass of MGs is defined formally according to the notions given in Section 3 and the above discussion. Let n be the number of agents involved in a cooperative matrix game. Each agent is identified by an idL = {1,2,…,n}. Let * * i L : Ti a A : a arg max Pi (a ) a A (28) , that is Ti is the set of all ideal optimal joint actions for agent i, the joint actions that enable agent i to receive his ideal optimal value of the game. Let D be the set of pure Nash equilibrium strategy profiles. Then a definition of cooperative matrix game can be formulated as follows. Definition 21: Cooperative Matrix Game: a cooperative matrix game is a matrix game with the following property: a1* T1 : a2* T2 : : an* Tn : b * D : a * 1 a2* an* b * (29) that is there is a pure Nash equilibrium strategy profile by which all agents can receive their ideal optimal values. Definition 22: Strongly Cooperative Matrix Game: A strongly cooperative matrix game is a cooperative matrix game with the following property: 30 T1 T2 Tn T D that is all agents have the same ideal optimal strategy profiles. These profiles belong to the set of pure Nash equilibrium strategy profiles. Definition 23: Weakly Cooperative Matrix Game: A weakly cooperative matrix game is a cooperative matrix game with the following property: i : j : i j Ti T j (31) that is there are some ideal optimal profiles for at least one agent that are not ideal optimal for at least one of the other agents. Definition 24: Competitive Matrix Game: A competitive matrix game is a matrix game with the following property: a A : a T1 a T2 a Tn (32) that is there does not exist a common ideal optimal profile by which all agents can achieve their ideal optimality. Definition 25: Weakly Competitive Matrix Game: A weakly competitive matrix game is a competitive matrix game with the following property: D (33) that is there exists at least one pure Nash equilibrium strategy profile. Definition 26: Strongly Competitive Matrix Game: A strongly competitive matrix game is a competitive matrix game with the following property: D (34) that is there does not exist any pure Nash equilibrium profile. The matrix game has only mixed equilibria. 4.7 Solving Cooperative MGs Solving a cooperative matrix game means finding one and the same ideal optimal pure Nash equilibrium strategy profile. But there may exist many such profiles causing the problem of equilibrium selection. All agents must coordinate on the same equilibrium profile, otherwise the resulting joint action may not be equilibrium. For example, in Fig. 9.a if agent A coordinates on the first ideal optimal pure equilibrium (a1,b1), while B coordinates on the second one (a3,b3), then the actual joint action taken will be (a1,b3) which is not equilibrium and each agent receives only 0. The problem of equilibrium selection can be solved by modeling the behavior (assuming repeated play of the matrix game) of other agents (fictitious play), that is each agent has level-3 awareness. But as mentioned before, this paper assumes that all agents have only level-2 awareness so fictitious play is not possible. So the problem of equilibrium selection is solved by adopting a pre-imposed convention called the lexicographic convention [4] which is described in the following subsection. 4.7.1 The lexicographic convention This is a simple convention that can be applied quite generally. The basis of this convention is that the system designer gives all agents the ability to identify each other. Three assumptions allow this convention to be imposed: (1) The set of agents is ordered. (2) The set of actions available to each agent i is ordered. (3) These orderings are known by all agents. Given this information, the lexicographic convention works as follows so that all agents can coordinate on the same pure equilibrium strategy profile: Each agent i extracts all ideal optimal pure equilibrium strategy profiles, those profiles that satisfy Eq. 29. In each of these profiles the actions of the low order agents come first. Then each agent i sorts these profiles lexicographically, that is starting from the actions of the low order agents the profiles with the low order actions come first. Finally, each agent i adopts the ith component of the first profile as his action. Algorithm 4 outlines the procedure for solving a cooperative matrix game for agent i. Extract all ideal optimal pure equilibrium strategy profiles (profiles that satisfy Eq. 29). Sort these profiles lexicographically according to the pre-imposed convention by the system designer. Adopt the ith component of the first profile, ā1[i], as agent i action decision. Play action ā1[i]. Algorithm 4: Solving a cooperative MG for agent i For example, in the cooperative matrix game shown in Fig. 9.a each agent extracts the ideal optimal pure equilibrium profiles (a3,b3) and (a1,b1), where agent A precedes agent B in his order. Then each agent sorts these profiles lexicographically yielding the ordered list (a1,b1) and (a3,b3), where action a1 precedes a3 in its order. Both agents then coordinate on the first equilibrium profile (a1,b1). Agent A adopts a1 as his decision and B adopts b1 as his decision. Algorithm 4 is extended in Section 5 to handle some weakly competitive matrix games as well as cooperative ones. 4.8 Cooperative Multi-agent Markov Decision Process (CMMDP) It is mentioned above that a matrix game can be modeled as a single-state MMDP. This idea can be reversed, that is a state in an MMDP can be modeled as a matrix game. So an MMDP can be redefined as a set of interdependent matrix games. This new view of an MMDP is formalized for cooperative MASs as follows. Definition 27: Cooperative Multi-agent Markov Decision Process (CMMDP): A Cooperative Multi-agent Markov Decision Process (CMMDP) is a 5-tuple (n, G, A, T, R) where n is the number of agents. G = M i i 1 , m is a set of interdependent cooperative matrix games as defined by Eq. 29 (m corresponds to the number of states, that is |S|). A is the agents joint action space, A = A1 x A2 x …x An , where Aj is the set of actions available to agent j. T is the probability transition function between the matrix games, T: G x A x G [0,1]. It defines a probability distribution over next matrix games to be played as a function of the current matrix game and the agents joint action. R is the joint reward function, R: G x A n. It defines a vector of expected immediate rewards for all agents when they take a particular joint action in a particular matrix game. Each matrix game MiG is a 3-tuple (n, A1…n , P1n ), where i Pj is defined as follows: Pji is agent j's payoff function in MG Mi, Pji : A . m P (a ) R j (a ) T ( M i , a , M k )O j ( M k ) i j (35) k 1 k O j (M k ) max Pj (a ) (36) aA is the discount factor. Eq. 36 defines the ideal optimal value of matrix game Mk for agent j. In CMMDP the environment's dynamics or the environment's model consists of: (1) the agents' payoff functions of the matrix games and (2) the transition probabilities between the matrix games. Both of these components are assumed to be unknown in the cooperative multi-agent planning problem. The Bellman optimality equation for Q* has to be modified to satisfy level-2 awareness for cooperative MASs. The new cooperative Bellman optimality equation for Q* is as follows: m Q j ( si , a ) R( si , a ) T ( si , a , sk )V j ( sk ) * k 1 V j ( sk ) max Q j (sk , a ' ) * * a 'A * (37) (38) , where Qj* is the optimal action-value function of agent j, and Vj* is the optimal state-value function of agent j. The first parameter of Q* is the environment's state which must now express the environment from the viewpoint of all agents. The second parameter ā is a vector of all agents' actions. An important remark to make here is about the similarity between the two Equation pairs: (35,36) and (37,38). Qj* has now a new interpretation, it estimates the payoff functions of agent j in the different matrix games of the planning problem. And so does Vj* estimate the values of these matrix games. 4.9 Solving CMMDP The above definition of CMMDP represents a mathematical framework for cooperative MASs. In cooperative MASs all agents can achieve their ideal optimality. It is divided into two subclasses: strongly cooperative MASs and weakly cooperative MASs. In strongly cooperative MASs all agents have the same or at least proportional payoff functions. Mostly, the agents in this class work as a team to achieve a single global goal. This subclass of MASs naturally arises in task distribution [4]. For example, a user might assign some number of autonomous mobile robots, or perhaps software agents, to some task, all of which should share the same payoff function (namely that of the user) [4]. A company or an organization may also be modeled as a strongly cooperative MAS. Another example is the soccer team where all agents have the shared goal of winning the game. In weakly cooperative MASs all agents are strongly cooperative only over a portion of their joint action space, while there may be conflicts between two or more agents over other portions. Said another way, the payoff functions are identical or at least directly proportional only over a portion of the agents' joint action space. Through this portion all agents can achieve ideal optimality. Mostly each agent in this subclass has his own independent goal to achieve, but there is no conflict between the goals of different agents. An example is the traffic system where each vehicle is an agent with the goal of reaching his destination completely safe. Solving a CMMDP means finding a global joint policy that is ideal optimal for all agents, that is a joint policy by which all agents can achieve their ideal optimality. The proposed solution exploits the previous definition of CMMDP as follows: the global task of finding an ideal optimal joint policy can be decomposed into finding ideal optimal solutions (ideal optimal joint actions) to the individual local cooperative matrix games that comprise the CMMDP. Each cooperative matrix game is solved independently (see Section 4.7 for how to solve cooperative MGs). This greedy approach of solving the joint planning problem in cooperative MASs is self-justified since the agents' payoffs in any particular matrix game, as defined by Equations 35 and 36, take into account the consequences of future matrix games, that is they make the dependencies between the different matrix games locally and immediately available. So it is conjectured that the global deterministic joint policy that is formed by merging the local ideal optimal deterministic joint policies (resulting by solving the individual local cooperative matrix games according to the Nash equilibrium concept) is ideally optimal for all agents and at the same time it is a global Nash equilibrium. An extended form of Q-learning (Equation 37 and 38) is used to estimate the payoff functions of all agents in the cooperative MGs that comprise the CMMDP. For every agent j initialize Qj(s,ā) arbitrarily (ā is ordered according to Assumption 1 in the lexicographic convention) Repeat Based on the current state s decide the next action ai by taking one of the following two options: a) Exploit the current learned estimation of the optimal Q-functions of all agents and solve the cooperative matrix game M corresponding to the state s. Adopt the ith component of the solution profile as agent i's action decision. b) Explore a random action. Take action ai, and observe the environment's response: next state s' and all agents' rewards rj's. Update the Q-functions of all agents based on this experience as follows: j Q j (s, a ) Q j (s, a ) rj V j (s' ) Q j (s, a ) (39) V j (s' ) max Q j (s' , a ' ) (40) a 'A s s' Until termination Algorithm 5: The Extended-Q algorithm for agent i Algorithm 5 outlines the above procedure for solving the planning problem in cooperative MASs with respect to agent i. This algorithm is called Extended-Q. Note that in strongly cooperative MASs, the payoff (Q-) functions are the same or at least directly proportional over the whole joint state-action space. As a result, each agent can estimate only his payoff (Q-) function saving a great deal of storage. In this case each agent observes only his reward. 5. Experimental Results 5.1 Grid Game 1 5.1.1 Rules of the game Fig. 11 depicts the game board and agents' possible actions. Two agents, the blue agent B and the red agent R, start from the lower left and lower right corners, trying to reach their goal cells. The goal cells are at the upper right and upper left corners. An agent can move only one cell at a time and in four possible directions: AB = AR = {east, west, north, south}. If the two agents attempt to move into the same cell or try to switch their cells, a collision occurs and they are bounced back to their cells. The game ends either when at least one agent reaches his goal cell or when a specified amount of time steps expires and no agent succeeds in reaching it. The game has two settings. The strong cooperation setting where both agents act as a team and gain positive rewards only when they reach the goal cells simultaneously. And the weak cooperation setting where any agent gains his positive reward when he reaches his goal cell regardless of the status of the other agent. Figure 11: Grid game 1 5.1.2 The state space The state space S is defined as a subset of S' (according to the rules mentioned above): S ' {s | s (lB , lR ), li ( xi , yi ), i {B, R}} (41) , where li defines the x- and y-coordinate of agent i with respect to the Cartesian coordinate system defined by the game board itself as shown in Fig. 11. The state transitions are deterministic which means that the current state and agents' joint action will uniquely determine the next state. This definition of the state space is applied to the remaining grid game experiments. 5.1.3 The lexicographic convention All grid game experiments obey the same pre-imposed lexicographic convention which is as follows: (1) Agents: (B, R). (2) Actions: (East, West, North, South). (3) The above orderings are known by both agents. These orderings are also applied to the remaining grid game experiments. 5.1.4 Parameters 5.1.4.1 Agents' rewards Strong cooperation setting: each agent gets 100 only when both agents reach their goal cells simultaneously. Weak cooperation setting: an agent gets 100 as soon as he reaches his goal cell regardless of the status of the other agent. Each agent gets 0 if reaching other positions without colliding. Each agent gets -1 in case of collision. The reward function as defined above is chosen similar to the reward function in similar experiments performed in [18]. 5.1.4.2 Learning parameters Learning rate = 0.999 Discount factor = 0.999 Exploration strategy: Simple with exploration probability 0.4 Initialization: sā Q(s,ā) 0.0 The learning rate is chosen so high so as to speed up the learning process and it remains constant during the whole learning process. Note that this fixation of the learning rate does not affect the convergence of the Qfunction since the environment is completely deterministic unlike the case of grid game 3 where monotonically decreasing learning rate is necessary to prevent oscillation of the Q-function in a non-deterministic environment. Generally in planning problems the discount factor is chosen high to reflect the main interest of the agent to reach a terminal future goal state. So typical values for the discount factor are greater than or equal to 0.8. Here 0.999 is chosen and proved successful for converging to an ideal optimal plan. But it has to be noted that large discount factors make the distinctions between nearly equal states very minute, and for limited precision these distinctions may be ignored causing the agent to get confused and may deviate from converging to the ideal optimal plan. Also an important feature for the success of the generalization techniques (see Section 6) is that the distinctions between the states are magnified so as to reduce the effect of the induced errors of the function approximator on the correct estimation of the states values [40]. So the discount factor has to be carefully chosen. A simple exploration strategy is chosen with exploration probability 0.4 which proved good for such configuration of the game. Small values for the initialization of the Q-function are typically preferred so 0 is chosen as the initial value for all the experiments in this Section. 5.1.4.3 Experiment parameters To gain statistically reliable results, convergence is tested over a number of trials where each trial involves two phases: a learning phase and a test phase. In the learning phase the agents play a specified number of games for training. Each game is initialized to the configuration shown in Fig. 11. Then a test game is played to verify the convergence of the algorithm, that is both agents reach their goal cells successfully and in an optimal way. Number of trials = 30 Number of training games in the learning phase of each trial = 5000 Maximum length of game = 50 (this limit is reached when no agent succeeds in reaching his goal cell) Various values for the number of training games and maximum game length are tried until 5000 and 50 respectively are found to be the minimum best values. The criterion used is that: both agents converge to the same ideal optimal plan in all the 30 trials (known through the test games). The number of trials are chosen to be 30 to guarantee that the results obtained are reliable (hinted by theorems in probability theory and in particular The Central Limit Theorem). 5.1.5 Game solution Some of the solutions to grid game 1 are shown in Fig. 12. These solutions are applicable to both strong and weak cooperation settings. Note that the agents' strategies are deterministic and optimal, they reach their goal cells only in four steps which is the minimum possible number of steps given the agents' capabilities. This multiplicity of solutions indicate the existence of multiple pure ideal optimal global Nash equilibrium strategies. In this experiment , in both settings, the agents converge successfully to the joint plan shown in Fig. 12.a in all trials. This particular solution is chosen due to the nature of the pre-imposed lexicographic convention. Figure 12: Some solutions to grid game 1 5.1.6 Payoffs at the initial state To get a better understanding of grid game 1 and its emergent solution, the matrix game corresponding to the initial state is investigated in both settings of the game. 5.1.6.1 Strong cooperation Fig. 13 shows the payoff matrix for both agents in the strong cooperative setting at the initial state of the game ((0,2),(2,2)). The ideal optimal value for both agents is 99.7. This value indicates that the ideal optimal path to win the game for both agents takes only four steps (99.7 = 100 * 0.9993). All the rectangled entries are the pure Nash equilibrium strategy profiles. There are three ideal optimal equilibrium profiles, these are (sorted lexicographically): (E,N), (N,W), and (N,N). These profiles correspond to the initial movements shown in Fig. 12. Due to the lexicographic convention, both agents coordinate on the first profile (E,N). Since the non-equilibrium profile (E,W) leads to a collision which is penalized, it has the lowest value 98.6. This value is interpreted as follows: 98.6 = -1 + 100 * 0.9994, where -1 is the penalty of collision and 4 indicates that the game can take at least five steps to be won: one step caused by the collision delay and four necessary steps to reach the goal cells from the current positions of the agents. All other non-equilibrium profiles and nonideal optimal equilibrium ones lead to one step time delay and have value 99.6 = 100 * 0.9994. Figure 13: Initial state payoff matrix (strong cooperation setting) An important note to make here is that the non-equilibrium profiles (except (E,W)) and the non-ideal optimal equilibrium profiles have the same value 99.6. But there is an important difference between these two groups of profiles: the first group causes one agent to be ahead of the other while the second group causes the same delay for both agents. The excess step caused by the first group of profiles is irrelevant since the game in this setting is strongly cooperative (both agents have the same payoff functions and the same ideal optimal strategy profiles), both agents act as one team to achieve a common goal so if one agent is ahead of the other, sooner or later he will have to delay himself so that both agents arrive at their goal cells simultaneously, otherwise the game ends with both agents getting nothing. 5.1.6.2 Weak cooperation Fig. 14 shows the payoff matrix for both agents in the weak cooperation setting. The ideal optimal value for both agents is 99.7. In this case there are only three pure Nash equilibrium strategy profiles: (E,N), (N,W), and (N,N). These profiles are also ideal optimal for both agents. Due to the lexicographic convention, both agents coordinate on the first profile (E,N). The profile (E,W) still has the lowest value 98.6 since it leads to a collision which is penalized. The ideal optimal profiles shared by both agents are the only pure Nash equilibrium profiles. Other profiles (except (E,W)) either lead to delay for both agents or one agent being step ahead of the other one. The second case has a value 99.7 for one agent and 99.6 for the other, that is it is ideal optimal for only one agent. This is because the game in this setting is weakly cooperative (both agents' ideal optimal profiles are not the same), each agent has his own independent goal to achieve (reaching his goal cell) regardless of the status of the other agent. So if one agent is step ahead he will go straight to his cell (acting in self-interested manner) winning the game which in turn is lost by the other agent. An analogy to this situation can be drawn from the traffic system. Suppose it is red light for vehicle A and green for vehicle B. Suppose that vehicle A moves while B does not, then no crash occurs and A reaches his destination safely. Of course this is not an equilibrium strategy since there is no guarantee that B will behave such nonsensically. Figure 14: Initial state payoff matrix (weak cooperation setting) 5.2 Grid Game 2 5.2.1 Rules of the game Fig. 15 depicts the game board and agents' possible actions. Figure 15: Grid game 2 Again two agents, the blue B and the red R, are trying to reach their goal cells. A collision occurs if both agents either try to occupy the same cell or try to switch their cells. In this case, both agents are penalized and bounced back to their cells. This definition of a collision is central to the level of difficulty of this game. The gray cells are forbidden cells, that is the agents cannot move into them (if an agent tries to move into a forbidden cell, he will be bounced back to his original cell). The game ends either when at least one agent reaches his goal cell or when a specified amount of time steps expires and no agent succeeds in reaching it. The game has two settings. In the first setting, both agents act as a team and gain positive rewards only when they reach their goal cells simultaneously. This is the strong cooperation version. In the second setting, any agent gains his positive reward as soon as he reaches his goal cell regardless of the status of the other agent. This will turn out to be a weak competitive version of the game. 5.2.2 Parameters 5.2.2.1 Agents' rewards Strong cooperation setting: each agent gets 100 only when both agents reach their goal cells simultaneously. Weak competition setting: an agent gets 100 as soon as he reaches his goal cell regardless of the status of the other agent. Each agent gets 0 if reaching other positions without colliding. Each agent gets -1 in case of collision. The reward function as defined above is chosen similar to the reward function in similar experiments performed in [18]. 5.2.2.2 Learning parameters Same as in grid game 1 except the following: Exploration strategy: Simple with exploration probability 0.8 This particular choice of the exploration probability is justified in Section 5.2.3.1. 5.2.2.3 Experiment parameters Same as in grid game 1 except the following: Number of training games in the learning phase of each trial = 30000 Maximum length of game = 200 (this limit is reached when no agent succeeds in reaching his goal cell) The choice of these particular values is justified in Section 5.2.3.1. 5.2.3 Game solution 5.2.3.1 Strong cooperation In the strong cooperation setting both agents act as a team where they get their positive rewards only by reaching their goal cells simultaneously. In all trials both agents succeed in learning an optimal joint coordinated plan to reach their goal cells. Fig. 16 depicts this plan. Both agents take ten steps to reach their goal cells, which is the minimum possible number of steps for this game. The game structure implies that both agents have very limited choice of good actions. Nearly in all cells most of the actions lead to a collision either with the board boundary, the forbidden cells, or with the other agent. This results in both agents being trapped in a small region of the game board. So many random moves are required to get out of this trap and discover the way to the goal cells. This explains the high exploration probability of 0.8. This value is the smallest value tried that led to convergence in all trials. Also this explains the huge number of training games required 30,000. Figure 16: Successful joint plan in grid game 2 (strong cooperation setting) 5.2.3.2 Weak competition In the second setting of grid game 2, any agent gains his positive reward only when he reaches his goal cell regardless of the status of the other agent. In all trials of the game both agents converge to the situation shown in Fig. 17, each agent moves just one step to the horizontal neighboring cell and then remains in this cell indefinitely. The situation depicted in Fig. 17 will be called draw since the game ends with both agents get zero (neither one reaches his goal cell). If any agent x reaches his goal cell (gaining 100), then x is said to win the game. If any agent x does not reach his goal cell while the other wins the game, then x is said to lose the game. These three notions, draw, win, and loss, are used only for the sake of the following discussion, they do not imply anything about the game structure and solution (for example, both agents may win the game by making a pre-play agreement). Figure 17: Weak competition, failure in reaching a successful joint plan Before analyzing the behavior shown in Fig. 17, it is very instructive to begin by identifying the class of MASs to which this game belongs. Clearly, the only possible solution to this game is that one agent x moves up to the cell (2,3) to allow the other agent y to pass through (remember that switching cells and trying to occupy the same cell result in a collision and both agents are bounced back to their original cells). But agent x who makes this voluntary move becomes one step behind y who, given that he is rational, has no motivation to wait (unlike the strong cooperation setting where both agents act as a team) and so he moves straightly to his goal cell (his terminal state) winning the game which in turn is lost by the other. This implies that both agents can not win together (given that communication and negotiation are not allowed). So given that both agents are rational, at least one of them will not be able to reach his goal cell, that is at least one agent will not be able to achieve his ideal optimality, so definitely this game is competitive. But to what subclass of competitive MASs does this game belong? Obviously, the solution of this game depends on which agent makes the voluntary move, so it may be modeled by the simple matrix game shown in Fig. 18. The payoffs are the actual rewards received by both agents when the game terminates. V and V stand for volunteer and not volunteer respectively. If agent x volunteers and y does not, then y, since he is rational, eventually wins and receives 100 while x loses and receives 0. If both agents do not volunteer (the emergent case in the experiment as shown in Fig. 17), then the game is a draw and both agents receive 0. If both try to volunteer, then a collision occurs and both agents receive -1. There are three pure Nash equilibrium strategy profiles: (V, V), (V,V), and (V, V). It may seem surprising that the last two profiles are equilibrium since they mean that one agent will always win and the other will always lose. But this can be explained as follows. Assume agent x adopts the strategy not to volunteer then the best response from agent y will be either to volunteer or not, it does not matter since in both cases he does not win and receives 0 (an important remark to mention here is that the reward taken by the winner is not paid by the loser, the loser does not receive a negative payoff, he simply gets nothing). But the interesting thing about this game is that the action V is a dominating action (0 0 and 100 -1) for both agents so by reducing this matrix game, allowing only the dominating actions, the resulting matrix game will have only one entry (V, V) which is actually the pure equilibrium joint profile adopted by both agents in the experiment as shown in Fig. 17. This profile can be interpreted as follows: both agents choose draw all the time. From this model it is clearly that this game is weakly competitive. Figure 18: Modeling the second version of grid game 2 by a simple matrix game. To see whether the game model in Fig. 18 has any mixed equilibrium strategy profile, assume that both agents adopt the mixed joint profile ((p,1-p),(q,1-q)). Then agent B's goal is to find the best value for p to maximize his expected payoff. The expected payoff of B given this mixed profile is EP( B) 100 p(1 q) (1 p)(1 q) 100 p 100 pq 1 q p pq 101 p 101 pq q 1 101p(1 q) q 1 (42) Clearly the best value for p is 1. By the same argument it can be shown that the best value for q is also 1. This gives the dominating pure equilibrium profile (V, V). So there is no mixed Nash equilibrium profile for this game. But what if both agents adopt the non-equilibrium mixed profile ((0.5,0.5),(0.5,0.5)), that is each agent tries to volunteer half the time or volunteers with probability 0.5. This means that given the game will be played infinitely often, each agent will win 25% of the games, lose 25% of the games, draw 25% of the games, and collides with the other agent 25% of the games. Then the expected payoffs for both agents are as follows: EP( B) EP( R) 0.25(100) 0.25(1) 0.5(0) 24.75 (43) It is conjectured that this result is better than the 100% draw of the adopted pure equilibrium profile (because of action domination) through which both agents receive zero and also better than the other two pure equilibrium profiles in which only one agent wins 100% of the games. Unfortunately, this profile is not equilibrium. This is another example of the weakness of the Nash equilibrium solution concept in some weakly competitive MASs (the first example was the prisoner's dilemma subclass of matrix games). Figure 19: Payoff matrix at state ((1,4),(3,4)) Given the theoretical analysis above (using the simplified game model in Fig. 18), it is time to investigate the experimental results. It is instructive to investigate the local matrix games at the critical states. First the matrix game at state ((1,4),(3,4)) is shown in Fig. 19. The ideal optimal value for both agents is 99.3, but there is no joint strategy profile that can achieve this value for both agents so this matrix game is competitive. All the rectangled joint profiles can easily be verified to be Nash equilibria, so this matrix game is clearly weakly competitive. These pure equilibrium profiles can be divided into three groups as follows: The first group includes profiles: (E,N), (E,S), (N,W), and (S,W). In this group the game always ends in win for one and the same agent. The second group includes profiles: (W,E), (N,N), (N,S), (S,N), and (S,S). In this group the game always ends in draw. The third group includes profiles: (W,N), (W,S), (N,E), and (S,E). In this group the game always ends in win for one and the same agent. Concerning the first group, always one agent x volunteers by moving to cell (2,4), while the other agent y remains still by moving north or south hitting the boarder or a forbidden cell. But what happens after that? Assume that agent R makes this voluntary move leading to the state ((1,4),(2,4)), the matrix game corresponding to this state is shown in Fig. 20. Figure 20: Payoff matrix at state ((1,4),(2,4)) The ideal optimal value for B is 99.4 (he needs 7 steps to reach his goal cell, 99.4 = 0.9996 * 100) and for R is 99.3 (he needs 8 steps to reach his goal cell, 99.3 = 0.9997 * 100). There is an ideal optimal profile (E,N) so this matrix game is cooperative and specifically weakly cooperative since the profile (E,E) is ideal optimal for only agent R. Clearly both agents adopt the profile (E,N) which clears the way for agent B to reach his goal cell and win. Both agents then move straightly until they reach the state ((4,1),(0,2)) which has the corresponding matrix game shown in Fig. 21. The ideal optimal value for B is 100 (he needs one step to reach his goal cell) while for R is 99.9 (he needs two steps to reach his goal cell, 99.9 = 0.999 * 100). There is no ideal optimal joint profile but there are pure Nash equilibrium profiles, so this matrix game is weakly competitive. The north action for B is a dominating action so by playing this action agent B, regardless of the action of R (actually there is no need for coordination), will reach his goal cell and win the game. Figure 21: Payoff matrix at state ((4,1),(0,2)) So, concerning the first group of pure equilibrium profiles of Fig. 19, the agent who makes the voluntary move to cell (2,4) will always lose the game and the other agent will always win; the winning agent takes only 9 steps. So the game will never end in draw. This group corresponds to the equilibrium profiles (V,V) and (V,V) of the simplified game model shown in Fig. 18. Concerning the second group, the last four profiles, (N,N), (N,S), (S,N), and (S,S), lead to both agents remaining fixed in their cells by continuously hitting the game board or the forbidden cells. The first profile (W,E) transmits the agents back to their initial situation ((0,4),(4,4)). The matrix game corresponding to this state is shown in Fig. 22. The ideal optimal value for both agents is 99.2 (each agent needs 9 steps to reach his goal cell, 99.2 = 0.9998 * 100) and there is an ideal optimal joint profile (E,W), so this matrix game is cooperative and specifically weakly cooperative since there exist some profiles which are ideal optimal for only one agent like (E,E) and (W,W). This joint ideal optimal profile will return the agents back to state ((1,4),(3,4)). As a result both agents keep moving back and forth between this state and the initial state which ultimately leads to a draw. So the game, given these equilibrium profiles, will always end in draw. This group of pure equilibrium profiles corresponds to the equilibrium profile (V,V) of the simplified game model shown in Fig. 18. The actual behavior shown by the agents in Fig. 17 resulted from coordinating on equilibrium profiles belonging to the second group. This can be explained as follows. By carefully investigating the payoff matrix shown in Fig. 19, it is clear that the actions north and south for both agents are dominating actions. So by following the lexicographic convention each agent plays north which leads to coordination on the pure equilibrium strategy profile (N,N). Figure 22: Payoff matrix at state ((0,4),(4,4)) The third group of equilibrium profiles shows a surprising aspect. Take for example the profile (W,N). This profile leads to state ((0,4),(3,4)) which has the corresponding matrix game shown in Fig. 23. The ideal optimal value for B is 99.3 and for R is 99.2 and there is an ideal optimal joint profile (E,W) so this matrix game is cooperative and specifically weakly cooperative since some profiles are ideal optimal for only one agent like (E,E). Following this profile both agents are transmitted to state ((1,4),(2,4)) which has the corresponding matrix game shown in Fig. 20. Given the discussion above agent B will ultimately win this game. So the equilibrium profiles in this group lead to a win for one and the same agent, the game will never end in draw. However the winning agent takes ten steps unlike the nine steps taken by the winning agent following the equilibrium profiles in the first group. However, winning the game using the equilibrium profiles in the third group is done in a strange way. It seems that winning is done by deception rather than by a voluntary move from the loser agent. The loser agent always chooses to remain fixed in his cell when in state ((1,4),(3,4)) by playing north or south implying that he has no intention to volunteer. Figure 23: Payoff matrix at state ((0,4),(3,4)) The third group of equilibrium profiles is unparalleled in the simplified game model shown in Fig. 18. Both agents do not volunteer but the game ends in win. This may be due to one or both of the following reasons: (1) The simplified game model in Fig. 18 is just an approximate model. It does not fully represent the structure of the real game. (2) Decomposing the real game, which was shown to be weakly competitive, into local cooperative and competitive matrix games in the same way as CMMDP decomposes cooperative MASs into local cooperative matrix games is just an approximation to the real problem. Specifically this decomposition may add noise as that represented by the third group of equilibrium profiles. Even if one or both reasons are true, it is believed that both are good representations of the game and in general of weakly competitive MASs. The discussion above can be summarized as follows. The pure equilibrium strategy profiles at the critical state ((1,4),(3,4)) will lead to one of the following: (1) 100% of the games end in draw (the second group). This result is favored by the agents because of the domination of actions that lead to a draw. (2) 100% of the games end in win for one and the same agent, 100% loss by the other agent (the first and third groups). It is believed that the solutions provided by the pure Nash equilibrium profiles are not satisfactory. So a mixed equilibrium solution of the matrix game ((1,4),(3,4)) is required. To test for the existence of mixed Nash equilibria for this matrix game assume that both agents adopt the mixed profile ((p1,p2,p3,1-p1-p2-p3),(q1,1-q1q2-q3,q3,q2)). So agent B's goal is to find the best values for p1, p2, and p3 so as to maximize his expected payoff. The expected payoff of agent B given this profile is EP( B) 99.1q1 99.2(q2 q3 ) 99.3(1 q1 q2 q3 )(1 p1 p2 ) 99.2 p2 (1 q1 q2 q3 ) 98.2 p1 (1 q1 q2 q3 ) 99.3 0.1(2q1 q2 q3 ) 0.1(11 p1 p2 )(1 q1 q2 q3 ) (44) Obviously the maximum value of the last term is zero which is achieved when p1 = p2 = 0. So the best response of agent B is the mixed profile B = (0,0,p,1-p). Similarly, it can be shown that the best response of agent R is R = (0,0,q,1-q). So there are infinitely many mixed equilibrium profiles ((0,0,p,1-p),(0,0,q,1-q)). These mixed equilibrium profiles give both agents real optimal value of 99.2. Unfortunately, these mixed equilibria also belong to the second group which will always lead to a draw, and this is expected because of the domination of the north and south actions for both agents. This verifies the inadequacy of the Nash equilibrium solution concept in some competitive MASs. This is also a verification of the same result drawn from the simplified game model shown in Fig. 18. From the discussion above it is clear that rational play at the critical state ((1,4),(3,4)) will lead to 100% draw and both agents receive zero. In searching for a better real optimal value a non-equilibrium mixed profile ((0.5,0.0,0.5,0.0),(0.0,0.5,0.5,0.0)) at the state ((1,4),(3,4)) is investigated. Note that this profile is similar to the non-equilibrium mixed profile ((0.5,0.5),(0.5,0.5)) in the simplified game model shown in Fig. 18, where each agent volunteers 50% of the time. For the following analysis assume that the maximum length allowed for the game is k. The new mixed profile reduces the original matrix game to that shown in Fig. 24. The expected payoff of both agents is EP( B) EP( R) 0.25(98.2 99.2 99.3 99.2) 98.975 (45) Figure 24: The reduced matrix game at state ((1,4),(3,4)) To calculate the probability that an agent x wins the game: Pr{agent x wins the game} = Pr{agent x reaches his goal cell in a maximum of k steps} = Pr{agent x passes the critical state ((1,4),(3,4)) in his side in at most k - 8 steps (the agent needs 7 steps after passing the critical state plus an additional initial step)} = Pr{agent x passes the critical state at the first trial} + Pr{agent x passes the critical state at the second trial | both agents failed at the first trial}.Pr{both agents fail at the first trial} + ……… + Pr {agent x passes the critical state at the k - 8th trial | both agents failed at the first k - 9 trials}.Pr{both agents fail at the first k - 9 trials} Pr{agent x passes the critical state at the ith trial | both agents fail at the first i-1st trials} = Pr{x remains fixed by playing north and the other agent y moves to cell (2,4) at the ith trial | both agents fail at the first i-1st trials} = 0.5 * 0.5 = 0.25 Pr{both agents fail at the first j trials} = Pr{either (N,N) or (E,W) is played at the first j trials} = 0.5j Pr{agent x wins the game} = 0.25 + 0.25 * 0.5 + 0.25 * 0.5 2 + … … … + 0.25 * 0.5k-9 = 0.25(1 + 0.5 + 0.52 + … … … + 0.5k-9) (geometric series) = 0.5(1 - 0.5k-8) Pragent x wi ns the game 0.5 1 0.5k 8 , k 8 To calculate the probability of a collision: Pr{agent x makes at least one collision} = 1 - Pr{no collision occurs at all} Pr{no collision occurs at all} = Pr{no collision | k-1 trials at the critical state and the last trial fails}.Pr{k-1 trials at the critical state and the last trial fails} + Pr{no collision | k-1 trials at the critical state and the last trial succeeds}.Pr{k-1 trials at the critical state and the last trial succeeds} + Pr{no collision | k-2 trials at the critical state}.Pr{k-2 trials at the critical state} + ………+ Pr{no collision | 2 trials at the critical state}.Pr{2 trials at the critical state} + Pr{no collision | 1 trial at the critical state}.Pr{1 trial at the critical state} Pr{no collision | i < k-1 trials at the critical state} = Pr{(N,N) is played at the first i-1 steps at the critical state | i < k-1 trials at the critical state} = 0.5i-1 Pr{no collision | k-1 trials at the critical state and the last trial fails} = 0.5k-1 Pr{no collision | k-1 trials at the critical state and the last trial succeeds} = 0.5k-2 Pr{i < k-1 trials at the critical state} = Pr{i-1 failures at the first i-1 trials}.Pr{success at the ith trial} = 0.5i Pr{k-1 trials at the critical state and the last trial fails} = 0.5k-1 Pr{k-1 trials at the critical state and the last trial succeeds} = 0.5k-1 Pr{no collision occurs at all} (46) = 0.5k-1 * 0.5k-1 + 0.5k-2 * 0.5k-1 + 0.5k-3 * 0.5k-2 + ………+ 0.5 * 0.52 + 0.5 = 0.25k-1 + 0.5 * 0.25k-2 + 0.5 * 0.25k-3 + ………+ 0.5 * 0.25 + 0.5 = 0.25k-1 + 0.5 (0.25k-2 + 0.25k-3 + ………+ 0.25 + 1) = (1/3)(2 + 0.25k-1) Pragent x makes at least one collision 1 1 0.25k 1 , k 1 (47) 3 Let Z be a random variable denoting the number of trials performed at the critical state ((1,4),(3,4)) until the first successful pass. Z is a geometric random variable with the following distribution P(Z j ) p(1 p) j 1, j 1,2,3, (48) , where p is the probability of success, here it is the probability that one of the two profiles (E,N) and (N,W) is played, so p = 0.5. The expectation of Z is E[ Z ] 1 1 2 p 0.5 (49) Let L be a random variable denoting the length of a game then its expectation is E[ L] min k ,8 E[Z ] min( k ,10), k 1 (50) Assume that C is a random variable denoting the number of collisions in a game. Assume also Y as a random variable denoting the number of failures at the critical state.. Then the expected number of collisions in a game can be calculated as follows. E[C] = E[E[C|Y]] k 1 PrY j.EC | Y j j 1 Pr{Y = j} = 0.5j+1, 0 j k-2 k-1 = 0.5 , j = k-1 k 1 PrY j 1 j 0 E[C|Y = j] = 0.5j (the conditional distribution of C given Y = j is a binomial distribution with parameters j, 0.5, then E[C|Y = j] = 0.5j) k 2 E[C ] 0.5 j 10.5 j 0.5k 10.5(k 1) j 1 0.5 0.5k , k 1 (51) Then the expected payoff of an agent x in a game of maximum length k can be calculated as follows. EP(x) = Pr{agent x wins the game}.(reward of winning) + (expected number of collisions in a game).(penalty of a collision) = 100 * 0.5(1 - 0.5k-8) - (0.5 - 0.5k) EP( x) 49.5 12799(0.5k ), k 8 (52) Given that k is sufficiently large Table 1 compares the rational equilibrium joint profile (N,N) that was actually played by the agents in the experiment and the newly suggested non-equilibrium mixed joint profile at the critical state ((1,4),(3,4)). Equilibrium Joint Profile Non-equilibrium Joint Profile 100% of the games end in draw Each agent has 0% chance of winning 100% of the games end in win Each agent has 50% chance of winning The expected game length is k >> 10 The expected game length is 10 Probability of collision occurrence is 0 Probability of collision occurrence is 1/3 Expected number of collisions is 0 Expected number of collisions is 0.5 Expected payoff for each agent is 0.0 Expected payoff for each agent is 49.5 Table 1: Performance comparison between the equilibrium and non-equilibrium profiles An experiment is performed to test the correctness of the above theoretical analysis for large k. A modified version of the Extended-Q algorithm that deals with weakly competitive MASs as well as cooperative MASs, Algorithm 6, is used. In the experiment data are gathered to compute the performance indices shown in Table 1. The experiment results, averaged over ten trials, are as follows: The probability that the game ends in win = 1.0 The probability that agent B wins the game = 0.4988 The probability that a collision occurs in a game = 0.3385 The expected number of collisions in a game = 0.5136 The expected length of a game = 10.0287 The expected payoff of agent B = 49.3664 Comparing these results with those shown in Table 1, it is clear that the experimental results are compatible with the theoretical results. From Table 1 and the experiment results it is clear that the non-equilibrium mixed strategy profile ((0.5,0.0,0.5,0.0),(0.0,0.5,0.5,0.0)) at the critical state ((1,4),(3,4)) outperforms the pure equilibrium profile (N,N) at the same state. It is important to note that the expected payoff of each agent given the non-equilibrium profile, 49.5, is different from both the value predicted by Eq. 43, 24.75, and the value predicted by Eq. 45, 98.975 (actually it is greater than that because of discounting). This indicates that: The simplified game model shown in Fig. 18 is just an approximation of the real multi-agent planning problem. It ignores the local details of the game (it views it as a single stage problem), but its global perspective is very good that it gives, in a very simple way, the main features of the game. Decomposing the multi-agent planning problem in weakly competitive MASs into local matrix games planning problems using the CMMDP framework is also an approximation of the real problem. The local matrix games can not accurately predict the expected final outcome of the game for each agent. Note that this problem does not exist in cooperative MASs where this decomposition is believed to represent the real planning problem in an exact way. But still the local matrix games can give the global relative importance of the local different pure, mixed, equilibrium, and non-equilibrium strategy profiles at each stage of the problem. This decomposition also simplifies very much the planning problem and goes in harmony with reinforcement learning. From the discussion above the following important conclusions can be drawn: CMMDP can be used to model weakly competitive MASs as well as cooperative ones. However, the local matrix games in the new model may generally be cooperative and weakly competitive. Given the definition of CMMDP (with the extension mentioned above), the Nash equilibrium solution concept is not the best to use in some weakly competitive MASs. A non-equilibrium mixed profile was shown to give much better results in the second setting of grid game 2. Algorithm 6 shows the exploitation step (solving the local matrix game) of a two-agent modified version of the Extended-Q algorithm; the remaining steps are the same as in the original algorithm. This modified version manipulates weakly competitive as well as cooperative MASs. It provides the starting step towards extending the Extended-Q algorithm to fully handle weakly competitive and then strongly competitive MASs as well as cooperative MASs. Using the current learned estimation of the optimal joint Q-function, find all pure Nash equilibrium strategy profiles and sort them lexicographically. Let N be the ordered list of these profiles. Let āN be the first profile in N. Let Ij be the ideal optimal value of agent j, Ik be the ideal optimal value of agent k. Find the first profile ājkN: Pj(ājk) = Ij and Pk(ājk) = Ik Find the first profile ājN: Pj(āj) = Ij, find the first profile ākN: Pk(āk) = Ik. Do one of the following: If N = , then it is a strongly competitive matrix game; exit. If ājk nil, then it is a cooperative matrix game; play ājk; exit. If exactly one of āj or āk is nil, then it is a weakly competitive matrix game; play the non-nil profile; exit. If āj = āk = nil, then it is a weakly competitive matrix game; play ā; exit. If ājāk, then it is a weakly competitive matrix game; play each profile of them with probability 0.5; exit. Algorithm 6: The exploitation step of a two-agent modified version of Extended-Q to manipulate weakly competitive as well as cooperative MASs 5.3 Grid Game 3 5.3.1 Rules of the game The game is shown in Fig. 25. In this game, the blue and red agents are trying to reach a shared goal cell G. In one version of the game a collision occurs if both agents either try to occupy the same cell (except the goal cell) or try to switch cells. In this case, each one is penalized and bounced back to his cell. This definition of collision increases the difficulty of the game. In the other simpler version of the game, agents are allowed to occupy the same cell or to switch cells. The game ends either when at least one agent reaches the goal cell or when a specified amount of time expires and no agent succeeds in reaching it. All agents' movements are deterministic except north movements from the lower left and lower right cells, where the agent moves up with probability p and remains fixed in his cell with probability 1 - p. The game has two settings. In the first setting, both agents act as a team and gain positive payoffs only when they reach the goal cell simultaneously. This represents strong cooperation. In the second setting, any agent gains his positive payoff when he reaches the goal cell regardless of the status of the other agent. This represents weak cooperation in the simpler version of the game and weak competition in the harder one where collisions are defined. Figure 25: Grid game 3 5.3.2 Parameters 5.3.2.1 Agents' rewards The first setting: each agent gets 100 only when both agents reach the goal cell simultaneously. The second setting: an agent gets 100 as soon as he reaches his goal cell regardless of the status of the other agent. Each agent gets 0 if reaching other positions, and only without colliding in the harder version. Each agent gets -1 in case of collision in the harder version. The reward function as defined above is chosen similar to the reward function in similar experiments performed in [18]. 5.3.2.2 Learning parameters Discount factor = 0.999 Exploration strategy: Simple with exploration probability 0.8 Several values are chosen for the exploration probability between 0.0 and 1.0 but 0.8 proved to be the best in terms of successful convergence to a joint ideal optimal plan in all trials. 5.3.2.2.1 Learning rate In the presentation of the Normal-Q learning algorithm it was mentioned that one of the conditions that should hold to guarantee convergence is that the learning rate is decreased slowly during the training period. In both grid game 1 and grid game 2 the learning rate was held constant during the whole training period and convergence was reached. This is due to the deterministic nature of the environment model, that is the state transition and the reward functions are deterministic. But in the current game the model of the system is nondeterministic and hence continual learning (constant learning rate) causes oscillation in the Q-function and hence oscillation in the agents' adopted joint strategy profile. Experimenting with a constant learning rate in grid game 3 showed that the oscillation led in most cases to adopt real optimal non-equilibrium strategy profiles and in the remaining cases oscillation occurs between different equilibrium profiles. It is strongly emphasized converging to a stationary optimal Q-function and hence a stationary ideal equilibrium joint strategy profile. So the learning rate is chosen to decay in an inverse proportion to the frequency of encountering the state-action pair (s,ā). That is the learning rate used when updating Q(s,ā) depends on the frequency of visiting this particular state-action pair. 1, 1 ( s, a ) , log 2f ( s ,a ) 1 , c f ( s, a ) (c ceil (log 2 ) 1) 0 f (s, a ) 1 2 f ( s, a ) c (53) c 1 f ( s, a ) where f(s, ā) is the update rate of the state-action pair (s, ā), ceil() is the normal ceiling function, and c is some threshold value. The first and third parts of Eq. 53 were first tried following [18], but the decay rate of the learning factor was too rapid to permit enough learning and useful use of the encountered experience, that is the learning period was too small. So the logarithmic part in Eq. 53 is inserted to slow down the decrement rate of the learning factor and hence allow enough learning period to get use of the experience the agents encounter. The parameter c controls the length of this period. Different values were tried and finally c = 2000 proved to be the best. The third part is defined so that the learning rate continues just after the stopping point of the second part. 5.3.2.3 Experiment parameters The probability p of successful north movement from the lower left or lower right cells are varied from 0.0 to 1.0 with step 0.1, thus making 11 sub-experiments each has the parameters described in the following subsection. 5.3.2.3.1 Sub-experiment parameters In each trial of the sub-experiment the agents play a specified number of games for training. Each game is initialized to the configuration shown in Fig. 25. Then a number of test games are played to investigate the type of joint plan reached by the agents. Results are averaged over all trials. Number of trials = 30 Number of training games in the learning phase of each trial = 15000 Number of test games in the test phase of each trial = 10000 Maximum length of game = 50 (this limit is reached when no agent succeeds in reaching his goal cell) Various values for the number of training games and maximum game length are tried until 15000 and 50 respectively are found to be the minimum best values. The criterion used is that: both agents converge to a joint ideal optimal plan in all the 30 trials (known through the test games). The number of trials is chosen to be 30 to guarantee that the results obtained are reliable (hinted by theorems in probability theory and in particular The Central Limit Theorem). The same reason applies for the number of test games (note that the cost of trying this huge number of test games is small). 5.3.3 Game solution 5.3.3.1 Strong cooperation In the strong cooperation setting both agents receive their positive rewards if and only if they reach the goal cell simultaneously. That is they act as one team. The next two subsections show convergence results for both versions of the game. 5.3.3.1.1 The simple version The simple version of the game does not include the definition of collisions, therefore both agents can freely occupy the same cell or switch their cells. In all sub-experiments, that is for all values of p, both agents converge to the same joint Nash equilibrium plan which is depicted in Fig. 26. Both agents take only three steps to reach the goal cell. This is the minimum possible number of steps. In reaching this joint plan the agents always avoid choosing the non-deterministic action north from their initial cells since this choice may increase the number of required steps to reach the goal cell. Figure 26: Joint plan in the simple version of grid game 3, strong cooperation setting 5.3.3.1.2 The hard version In the hard version of the game the notion of collision is defined. Both agents neither can occupy the same cell nor can they switch their cells. Figure 27: An instance of the first class of learned policies, a) when north movement succeeds, b) when north movement fails By investigating the joint policies learned over the whole range of p, it is observed that there are two main classes of learned policies. The first class is depicted in Fig. 27. In this class an agent occupies the lower middle cell while the other tries to move up. If the north movement succeeds, then the agents go on as shown in Fig. 27.a taking only three steps. If the north movement fails, then the agent making it gives up and both agents try another path as shown in Fig. 27.b. This last path takes five steps. The expected number of movements in a game in the first class of policies, E1(p), can be computed as follows: E1 ( p) 3 p 5(1 p) 5 2 p (54) The second class of policies is depicted in Fig. 28. In this class of policies the agent choosing the north action keeps trying it until the movement is successful. At this time the other agent waits for him in the lower middle cell by repeatedly moving south hitting the game board. After the north movement succeeds, both agents move straight to the goal cell. To calculate the expected number of steps in a game using the second class of policies, let M be a random variable denoting the number of north tries made by an agent until success. M is a geometric random variable which has the following distribution: Pr{M m} p(1 p)m1 (55) Figure 28: An instance of the second class of learned policies Then the expected number of steps in a game using the second class of policies, E2(p), can then be computed as follows: E2 ( p) 2 E[ M ] 2 1 p (56) The actual class of policies to which both agents converge was found to depend on p. The optimal policy was found to satisfy the following equation: * 1 2 ( p) arg min{E ( p), E ( p)} (57) ,where is the set of all possible joint policies. This equation can be verified from Table 2. The first row shows the values of p, the probability of success of a north movement from the lower left and lower right cells. The second row shows Av(p), the measured average number of steps in a game. The third row shows E1(p), and the last row shows E2(p). p Av(p) E1(p) E2(p) 0.0 5.0 5.0 0.1 4.8 4.8 12.0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 4.6 4.4 4.23 4.0 3.71 3.44 3.25 3.11 3.0 4.6 4.4 4.2 4.0 3.8 3.6 3.4 3.2 3.0 7.0 5.33 4.5 4.0 3.67 3.43 3.25 3.11 3.0 Table 2: Effect of p on the class of learned policies From Table 2 it can be seen that for 0.0 p 0.4 the agents converge to a policy from the first class, where E1(p) E2(p), and for 0.6 p 0.9 the agents converge to a policy from the second class, where E2(p) E1(p). At p = 0.5, it is found that for half of the sub-experiments the agents converge to a policy from the first class, and for the other half the agents converge to a policy from the second class; this is self-justified since E1(p) = E2(p) = 4.0. At p = 1.0 the two classes are equivalent since no failure will occur when trying to move north from the initial cells; both formulas lead to E1(p) = E2(p) = 3.0. 5.3.3.2 Weak cooperation In the weak cooperation setting each agent receives a positive reward as soon as he reaches the goal cell regardless of the status of the other agent. The game does not include the notion of collision, that is both agents can freely occupy the same cell or switch their cells, otherwise it becomes a weak competitive game. In all subexperiments, that is for all values of p, both agents converge to the joint plan shown in Fig. 26 (the same joint plan as in the simple version configuration of the strong cooperation setting). This plan is natural since no agent has the motivation to try to move up from his initial cell and hence take the risk of losing the game. 5.3.3.3 Weak competition This setting is similar to the weak cooperation setting except that the game in this case defines the notion of collision. Both agents cannot occupy the same cell nor can they switch their cells, they are penalized if trying to do that and are bounced back to their original cells. Using the first version of Extended-Q, both agents converge to the situation depicted in Fig. 29. There is an inherent conflict around the cell (1,2), since the agent who occupies this cell guarantees reaching the goal cell, while the other possible way through the non-deterministic north movement is not guaranteed to succeed and so the agent who makes it may fail to reach the goal cell. Figure 29: Irresolvable situation in the weak competition configuration Fig. 30 shows the matrix game corresponding to the initial state ((0,2),(2,2)). The ideal optimal value for both agents is 99.8 (for p<1: 99.8 > yB = yR). Assuming 0 < p < 1, there are only two pure Nash equilibrium profiles: (E,N) and (N,W). The first profile is ideal optimal for agent B only, and the second one is ideal optimal for agent R only. Clearly, this matrix game is weakly competitive. Actually, by following the same line of reasoning of Section 5.2.3.2, it can be shown that grid game 3 in this setting is a weakly competitive MAS. Figure 30: Payoff matrix at state ((0,2),(2,2)) What actually occurred that led to the situation depicted in Fig. 29 is that each agent plays his own ideal optimal equilibrium profile. This results in the non-equilibrium joint profile (E,W) that causes this indefinite competition around the cell (1,2). So the game always ends in draw and both agents receive nothing. Even if both agents coordinate on the same equilibrium profile, then, assuming the game is played repeatedly, one agent wins 100% of the games and the other one wins only 100p% of the games (note that the notions of draw, win, and loss have the same meaning as defined in Section 5.2.3.2); this is unfair. Using the second version of Extended-Q as defined in Section 5.2.3.2 yields the mixed joint strategy profile ((0.5,0.0,0.5,0.0),(0.0,0.5,0.5,0.0)). Given this profile the probability that an agent wins the game can be calculated as follows. Pr{agent x wins the game} = 1 - Pr{agent x loses the game} = 1 - Pr{agent x moves north and his movement fails while agent y moves to cell (1,2)} = 1 - 0.5(1 - p) = 0.5(1 + p) It is conjectured that this result is better than the 100% draw of the actual profile played by the agents and also better than the unfair pure equilibrium profiles. 5.4 Discussion Section 5 describes experimental work done on some MASs. Three main experiments are performed: grid game 1, grid game 2, and grid game 3. These games represent, through different settings, instances of strongly cooperative, weakly cooperative, and weakly competitive MASs. The main results and conclusions of these experiments are: Using the Extended-Q algorithm, defined in Section 4, agents in a cooperative MASs converge to an ideal optimal rational joint plan. This is confirmed for deterministic environment (grid game 1 and grid game 2) and non-deterministic environment (grid game 3). This convergence of Extended-Q algorithm provides an evidence for the correctness of the CMMDP as a mathematical model for cooperative MASs. A smooth transition is done from cooperative MASs to weakly competitive ones in grid game 2 and grid game 3. The Nash equilibrium solution concept (rational play) was shown to be a weak concept in the weakly competitive MASs presented in this Section. It was shown that the proposed CMMDP for cooperative MASs can also be applied to competitive MASs with one modification: the constituent local matrix games may be weakly competitive and/or cooperative instead of just cooperative. Based on this, an extension of Extended-Q was proposed to handle weakly competitive MASs as well as cooperative ones. This new version assumes rational play in all cases of local matrix games except one where rational play is weak. This case appears when the agents face a weakly competitive matrix game which has the following property: each agent has an ideal optimal pure equilibrium profile but there is not any shared ideal optimal pure equilibrium profile for all agents. The new version of Extended-Q was shown (analytically and experimentally) to give better results than the rational play in the weakly competitive settings (grid game 2 and grid game 3). A decaying learning rate is necessary for the convergence of the Extended-Q algorithm in nondeterministic environments (grid game 3). A constant learning rate in such environments leads to continual oscillation of the Q-function and hence oscillation of the adopted joint policy. The exploration strategy followed in the experiments of this Section is very simple: at each time step explore a random action with probability and exploit the current estimation of the Q-function with probability 1-. Despite its simplicity it was effective in converging to an ideal optimal plan. The multi-agent planning problem may be simplified to a matrix game planning problem (grid game 2). This simplification, although not useful practically in solving the planning problem, is important to theoretically identify the nature of the planning problem and the class to which the MAS belongs. 6. Generalization 6.1 The need for generalization So far it has been assumed that it is possible to enumerate the state and joint action spaces and store them as lookup tables. Let m be the space required by this representation in a MAS, then m can be calculated as follows: m ktn (58) , where k is the number of environment's states, t is the number of actions available to each agent (assuming all agents have the same number of actions), and n is the number of agents. Except in very small environments, this means impractical space requirements. Additionally, the time and data needed to accurately fill them may be prohibitively large. In large smooth state spaces similar states are expected to have similar values and similar optimal joint actions; said another way similar matrix games have similar payoffs and equilibrium solutions. Therefore, there should be some more compact representation than a table. Most problems will have continuous or large discrete state spaces; some will have large or continuous action spaces [20]; and some involve a large number of agents. The problem of learning in large spaces is addressed through generalization techniques, which allow experience with a limited subset of these spaces to be usefully generalized to produce a good approximation over a much larger subset. According to Eq. 58 the space requirement grows polynomially with the size of the state space and the size of the action space of the individual agent and it grows exponentially with the number of agents. In this paper the problem of generalizing over large state spaces is addressed. Generalization over the action space may be handled in similar ways. The most severe problem is the exponential growth with the number of agents. This may be handled in a hierarchical manner. A team of agents may be split into sub-teams each one of them has a leader or controller. Agents within a sub-team coordinate their actions only within this sub-team, they do not need to know anything about the agents within the other sub-teams. At the top level the leaders of the sub-teams coordinate their actions. This idea can be applied recursively with the sub-teams themselves split into other subteams and so on. Generalization from examples has already been extensively studied in supervised learning. What is needed is to combine the Extended-Q algorithm with one of the existing generalization methods. This kind of generalization is called function approximation because it takes examples from a desired function, here the action-value function, and attempts to generalize from them to construct an approximation of the entire function. 6.2 Joint action evaluation with function approximation Using function approximation to predict the value of joint actions means that the action-value function at time t, Qt(s, ā) is represented not as a table, but as a parameterized functional form with parameter vector wt . So Qt(s, ā) depends totally on wt varying from time step to time step only as wt varies. For example, Qt(s, ā) might be the function computed by an artificial neural network, with wt being the vector of connection weights. By adjusting the weights, any of a wide range of different functions Qt(s, ā) can be implemented by the network. Typically, the number of parameters (the number of components of wt ) is much less than the number of states or state-action pairs, and changing one parameter changes the estimated value of many state-action pairs. Consequently, when a single state-action pair is backed up, the change generalizes from that state-action pair to affect the values of many other state-action pairs. As mentioned above this paper addresses only generalization over the state space so it is assumed that each agent has a small finite number of actions and the number of agents is small. This assumption is exploited by assigning one function approximator per joint action ā. This reduces the interference between the values of different joint actions and hence reduces the noise induced by the function approximator on the output predictions. From now on Qā(s) is used to denote the output of the function approximator for the joint action ā. In ordinary supervised learning the learner is given constant (Input,Output) examples, in this case (s,Qā*(s)) where s is the environment's state and the target value Qā*(s) is the optimal action-value function of the joint action ā (represented by the function approximator ā) at state s. Unfortunately, Qā* is not known so it has to be replaced by its estimate Qā depending on the current experience tuple (s,ā,r,s'): Qa (s) r max Qu (s' ) (59) uA , where Qū(s') is an estimate and may come from a different function approximator to the Qā(s) estimate. As an agent gains more experience his estimate Qā(s) improves meaning that the examples presented to the function approximators are changing over time which complicates the problem of estimating the optimal action-value function. As Sutton pointed out [33], the purpose of both backpropagation (a function approximation algorithm for Artificial Neural Networks) and TD methods (Extended-Q is a TD method) is accurate credit-assignment (action evaluation based on long-term consequences). Backpropagation decides which part(s) of a neural network to change so as to influence the network's prediction and thus to reduce its overall error, whereas TD methods decide how each prediction of a temporal sequence of predictions should be changed. Backpropagation (or in general any function approximator) addresses a structural credit-assignment issue whereas TD methods address a temporal credit-assignment issue. 6.3 The gradient descent methods Gradient descent methods are among the most widely used of all function approximation methods and are particularly well-suited to reinforcement learning [35]. In gradient descent methods the parameter vector is a column vector with a fixed number of real valued components, t T w t w1t , w2t ,...wmt (T denotes transpose), and Qāt(s) is a smooth differentiable function of w . On each time step t the function approximator is fed with a new training example. The ideal training example is (st, Qā*(st)), but unfortunately Qā* is not known so a first level estimation is given based on the current joint policy t and thus providing the training example (st,Qāt (st)). Unfortunately, even Qāt is not exactly known so a second level of estimation is given based on the current interaction with the environment and the current estimation of Qāt(st+1) providing the actual training example (st,Qāt(st)), where Qāt(st) is calculated as follows (same as Eq. 59 but in time notation): Qat (st ) rt 1 max Qut (st 1 ) (60) uA The key idea behind the gradient descent methods is to search the space of possible weight vectors to find the weights that best fit the temporal training examples. In order to derive a weight learning rule, that is a weight update rule, an error measure of a particular weight vector must be specified. Although there are many ways to define this error, one common measure that will be used in the following analysis is as follows: Et ( wt ) 2 1 t Qa ( st ) qat ( st ) 2 (61) , where qāt is the output of the function approximator ā at time t. Gradient descent search determines a weight vector that minimizes E by starting with an arbitrary initial weight vector, then repeatedly modifying it in small steps. At each step, the weight vector is altered in the direction that produces the steepest descent along the error surface [25]. This process continues until some local minimum error is reached. This direction can be found by computing the partial derivative of E with respect to each component of the weight vector w . This vector derivative, written E( w ), is called the gradient of E with respect to w ; it is calculated as follows: E E E E ( w ) , ,..., w w wm 2 1 T (62) The gradient is itself a vector whose components are the partial derivatives of E with respect to each of the wi. When interpreted as a vector in weight space, the gradient specifies the direction that produces the steepest increase in E. The negative of this vector therefore gives the direction of steepest decrease. Then the learning rule for gradient descent is: w w w w E(w ) (63) , where is a positive constant called the learning rate, which determines the step size in the gradient descent search. The negative sign is present to move the weight vector in the direction that decreases E. This learning rule can also be written in its component form as follows: wi wi wi E wi wi (64) Given the particular definition of the error function in Eq. 61 then the weight update rule specializes to the following rule: wi wi wi qat ( st ) wi Q ( st ) q ( st ) wi t a t a (65) The only remaining thing to do is the determination of the approximating function qāt(st). The most important feature about this function is that it must be differentiable over the weight vector w . In this paper an Artificial Neural Network is proposed as an implementation of qāt(st). Figure 31: ANN Architecture used in function approximation 6.4 Neural network implementation In this paper the function approximators are implemented using feedforward multilayer ANNs. An artificial neural network can be defined as follows: Definition 28: Artificial Neural Network: A neural network is an interconnected assembly of simple processing elements, units or nodes, whose functionality is loosely based on the animal neuron. The processing ability of the network is stored in the inter-unit connection strengths, or weights, obtained by a process of adaptation to, or learning from, a set of training patterns [14]. The architecture of the ANN proposed in this paper is depicted in Fig. 31. The input pattern is a vector of np+1 components. The first component is a threshold and is always 1. The last np components describe the state of the environment. Each arrow in the figure represents an interconnection between either an input node and a hidden layer unit or a hidden layer unit and the output unit. A weight or strength w is associated with each interconnection. The set of these weights comprise the weight vector w which controls the nature of the approximating function. Each unit i in the hidden and output layers is a simple processing element which first computes a weighted linear combination of its inputs. This linear sum is called the net input neti to unit i. The unit then produces its output by applying a function ƒ, called the unit's activation function, to neti. The output unit assumes the simplest case where ƒ is the identity function, and the unit's output is just its net input. This is called a linear unit. For each hidden unit, ƒ is the sigmoid function which is defined as follows: f (net ) 1 1 ek .net (66) ,where k is some positive constant. Hence, each hidden unit is called the sigmoid unit as illustrated in Fig. 32. Figure 32: The sigmoid unit 6.5 Derivation of the weight update rule This section presents the derivation of the weight-tuning rule for the ANN shown in Fig. 31. It begins by giving the definitions of the symbols used in the derivation. Then it gives the derivation of the learning rule for the output unit weights. Finally, it gives the derivation of the learning rule for any hidden unit weights. 6.5.1 Notation The derivation of the weight update rule uses the following notation: xj = the jth input to the output unit (the output of hidden unit j) wj = the weight associated with the jth input to the output unit (or the weight associated with the output of hidden unit j) xji = the jth input to hidden unit i wji = the weight associated with the jth input to hidden unit i l w j x j (the net input of the output unit) j li wki xki (the net input of hidden unit i) ƒ = the sigmoid function o = ƒ(l) (the neural network output, the output computed by the output unit) oi = ƒ(li) (the output computed by hidden unit i) o' k o (the partial derivative of o with respect to l) l o oi ' i (the partial derivative of oi with respect to li) li t = the target output of the neural network (the target output of the output unit) ti = the target output of hidden unit i = learning rate E (w ) = the error function 6.5.2 Learning rule for the output unit weights As seen before the error function was defined as follows: E (w ) 1 t o2 2 Then according to the gradient descent rule: w j E w j E o o w j E o l o l w j E 1 2 t o t o o o 2 l w x i i x j w j w j i w j (t o)o' x j x j (67) , where = (t-o)o' is the error term of the output unit. Eq. 67 gives the weight update rule of a general output unit. For a linear output unit o' = 1 (since o = ƒ(l) = l), hence the weight update rule for the linear output unit is: w j (t o) x j x j (68) , where = (t-o) is the error term of the output unit. 6.5.3 Learning rule for a hidden unit weights According to the gradient descent rule: E w ji E o o w ji w ji o l l w ji l oi (t o)o' oi w ji l oi l i (t o)o' oi li w ji (t o) l w j o j wi oi oi j li w x ki ki x ji w ji w ji k w ji (t o)o' oi ' wi x ji i x ji (69) , where i = (t - o)o'oi'wi is the error term of hidden unit i. Eq. 69 represents the weight update rule of a general hidden unit which also assumes a general output unit. For a sigmoid hidden unit and linear output unit the derivatives o' and oi' become: o' = 1 and oi' = oi.(1 - oi). The weight update rule then becomes: w ji (t o)oi (1 oi )wi x ji i x ji (70) , where i = (t - o)oi(1 - oi)wi is the error term of hidden unit i. 6.6 Neuro-Extended-Q Algorithm 7 defines a new version of the Extended-Q algorithm where the Q-function is represented using a set of ANNs. Each ANN has the architecture defined in Section 6.4. Algorithm 7 shows only the learning step of the algorithm, the remaining steps are the same as in Algorithm 6. This means that Neuro-Extended-Q handles cooperative as well as weakly competitive MASs. For each agent j of which agent i is aware, perform the following steps: Given the experience tuple (s,ā,s',rj), update the Q-function based on this experience as follows: 1. Calculate a new estimate Qāj(s) as follows: Qaj (s) rj max Quj (s' ) (71) u A 2. 3. 4. Let x be a vector that describes the environment's state s. Then the new training example for the ANN concerned with the joint action ā will be < x , Qāj(s)>. Propagate the input forward through the network by presenting the input vector x to the network and compute the output ok of every unit k in the network. Propagate the error backward through the network: For the output unit, calculate its error term : Qaj (s) o (72) For each hidden unit h, calculate its error term h: h oh (1 oh )wh (73) Update each weight to the output unit as follows: w j w j x j (74) Update each weight to hidden unit h as follows: w jh w jh h x jh (75) Algorithm 7: Neuro-Extended-Q algorithm for agent i Algorithm 7 is actually a combination of Extended-Q with the BACKPROPAGATION algorithm. The BACKPROPAGATION algorithm learns the weights for a multilayer ANN, given a fixed set of units and interconnections. It employs gradient descent to attempt to minimize the squared error between the network output values with the target values for these outputs [25]. 7. Conclusions and Future work 7.1 Conclusions The major contributions provided by this paper are: A review of the RL paradigm and its basic elements is given. A review of MASs is given and a new taxonomy of them is proposed. There are two main classes: cooperative MASs and competitive MASs. Within each class there are two subclasses: weak and strong. This taxonomy is based on the following notions: rationality, ideal optimality, real optimality, and the type of action profiles (deterministic versus non-deterministic). Single-agent Q-learning is shown to lack convergence guarantee when used in the multi-agent setting. The key aspect of successful planning in MASs is that each agent becomes aware of the remaining agents. Four degrees of awareness are proposed in this paper. The research done assumes level-2. A mathematical formulation of MASs, Multi-Agent Markov Decision Process (MMDP), is provided. All subclasses of matrix games (a special type of MASs), according to the proposed MASs taxonomy, are formally defined. CMMDP (Cooperative Multi-agent Markov Decision Process) is defined. It is a mathematical formulation of the planning problem in cooperative MASs that greatly facilitates the task of finding a global joint optimal plan. CMMDP is based on the decomposition of the global planning problem into local matrix games planning problems. CMMDP is also shown, with little modification, to have the ability to formulate the planning problem in at least a subclass of weakly competitive MASs. A new learning algorithm, called Extended-Q, is proposed to solve CMMDPs. Each matrix game is solved using the Nash equilibrium concept. It is shown that, in cooperative MASs rational play (action choice according to the concept of Nash equilibrium) is completely compatible with ideal optimality for all agents involved in the MAS. It is conjectured that the optimal equilibrium local solutions of the matrix games will yield globally optimal solution, that is globally optimal joint plan. Another version of Extended-Q is proposed that can handle weakly competitive MASs as well as cooperative ones. It is conjectured that, for the new version to give optimal results, the agents should ignore the equilibrium solutions (that is the rational play) in some local situations of the weakly competitive MASs. Experiments are performed on some grid games, as examples of MASs, these experiments show: (1) the convergence of the first version of Extended-Q to an ideal optimal joint plan in cooperative MASs for both deterministic and non-deterministic environments, (2) the weakness of the Nash equilibrium solution concept in some weakly competitive MASs, (3) the convergence of the second version of Extended-Q in weakly competitive MASs, and (4) a decaying learning rate is necessary for the convergence of Extended-Q in non-deterministic environments. Both versions of Extended-Q mentioned above assume using a lookup table for the representation of the Q-function. Unfortunately, this representation is infeasible for large problems in terms of space required for storage and time and data required for accurate estimation. It is shown that in MASs there are three sources for this problem: (1) continuous or large discrete state space, (2) continuous or large discrete action space, and (3) large number of agents. A new version of Extended-Q, called Neuro-Extended-Q, is proposed to handle the first cause of these, the large state space. This algorithm assumes using an ANN as a function approximator for each joint action. It combines Extended-Q with BACKPROPAGATION, which is a gradient descent learning technique for feedforward ANNs. The research done in this paper may also be of importance to the Game Theory community, it provides: (1) a taxonomy of MASs or games, (2) formal frameworks for MASs or games, (3) algorithms for general classes of MASs or games, (4) experiments and their thorough analysis are done for some interesting games, (5) a critic is given of the Nash equilibrium solution concept, and (6) a generalization technique for a class of MASs or games. 7.2 Future Work Some suggestions for future work are: In this paper new notions were defined: ideal optimality and real optimality, and a new MAS taxonomy was proposed based on these notions. These new concepts should be examined thoroughly to clarify their importance in the study of MASs. It is required to analytically prove the convergence of the first version of Extended-Q to an ideal optimal rational joint plan in cooperative MASs. This analytical proof also validates the correctness of CMMDP as a framework for cooperative multi-agent planning. The CMMDP framework should be extended to formalize the planning problem in competitive MASs as well as cooperative ones. An attempt was made in this paper to extend it to the weakly competitive setting (the second version of Extended-Q). Algorithms based on this extension of CMMDP (essentially the update rule of the Q-function) should also be developed. In this paper it was shown that rational play (playing Nash equilibrium profiles) may not be the best choice in competitive MASs; an informal alternative play was proposed in weakly competitive MASs that was shown to give better results for repeated play (repeatedly facing the planning problem). This proposal should be formally investigated. Other alternatives for action choice in competitive MASs should also be investigated. Game Theory plays a crucial role in this investigation. The work in this paper assumes that the environment is completely observable, that is each agent can perfectly sense the complete state of the environment. This is not always true in real applications where the environment is partially observable. Each agent then must have the ability to predict the current environmental state. Some work has been done for partially observable single agent systems, this work needs to be extended to the multi-agent setting. A pre-imposed lexicographic convention was proposed to overcome the problem of equilibrium selection. Another possible method to overcome this problem is through fictitious play, where each agent model the behaviors of others (level-3 awareness). Extended-Q may be integrated with fictitious play and the new resulting algorithm be tested. Experimental study should be done to test the performance of Neuro-Extended-Q. Neuro-Extended-Q handles only large state spaces assuming that the size of the joint action space is manageable. Extended-Q should be combined with generalization techniques over large joint action spaces as well as large state spaces. All versions of the Extended-Q algorithm developed in this paper assume offline learning. Exploration techniques that support online learning should be developed and integrated with Extended-Q. Acknowledgments This work is supported in part by E-JUST Research Fellowship awarded to Dr. Mohamed A. Khamis. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. Aha D. W., "Machine Learning," A tutorial presented at the Fifth International Workshop on Artificial Intelligence & Statistics, 1995. Benda M., Jagannathan V., and Dodhiawala R., "On Optimal Cooperation of Knowledge Sources An Empirical Investigation," Technical Report BCS--G2010--28, Boeing Advanced Technology Center, Boeing Computing Services, Seattle, Washington, 1986. Bose N. K. and Liang P., "Neural Network Fundamentals with Graphs, Algorithms, and Applications," McGraw-Hill International Editions, 1996. Boutilier C., "Planning, Learning and Coordination in Multiagent Decision Processes," In Proceedings of the Sixth Conference on Theoretical Aspects of Rationality and Knowledge, pp. 195210, Amsterdam, 1996. Boutilier C., Dean T., and Hanks S., "Decision Theoretic Planning: Structural Assumptions and Computational Leverage," Journal of Artificial Intelligence, Vol. 11, pp. 1-94, 1999. Bowling M. and Veloso M., "An Analysis of Stochastic Game Theory for Multiagent Reinforcement Learning," Technical Report CMU-CS-00-165, 2000. Cao Y. U., Fukunaga A. S., and Kahng A. B., "Cooperative mobile robotics: Antecedents and directions," Autonomous Robots, Vol. 4, pp. 7-27, 1997. Charniak E. and McDermott D., "Introduction to Artificial Intelligence," Addison-Wesley Publishing Company, 1985. Claus C. and Boutilier C., "The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems," In Proceedings of the Fifteenth National Conference on Artificial Intelligence, Madison, WI, 1998. Dautenhahn K., "Getting to know each other-artificial social intelligence for autonomous robots," Robotics and Autonomous Systems, Vol. 16, pp. 333-356, 1995. Dietterich T. G., "Machine Learning," ACM Computing Surveys, Vol. 28A, No. 4, December 1996. Garcia A., Reaume D., and Smith R. L., "Fictitious Play for Finding System Optimal Routings in Dynamic Traffic Networks," Transportation Research Part B: Methodological, Vol. 34, No. 2, pp. 147-156, 2000. Garrido L. and Sycara K., "Multiagent Meeting Scheduling: Preliminary Experimental Results," In Proceedings of the Second International Conference on Multiagent Systems, pp. 95-102, Menlo Park, Calif.: AAAI Press, 1996. Gurney K., "An Introduction to Neural Networks," UCL Press, UK, 1996. Harmon M. E. and Harmon S. S., "Reinforcement Learning: A Tutorial," URL: http://citeseer.nj.nec.com/harmon96reinforcement.html, 1996. Haykin S., "Neural Networks - A Comprehensive Foundation," Second Edition, Prentice-Hall, Inc, 1999. Hu J. and Wellman M. P., "Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm," In Proceedings of the Fifteenth International Conference on Machine Learning, pp. 242250, San Francisco, CA, 1998. Hu J. and Wellman M. P., "Experimental Results on Q-Learning for General-Sum Stochastic Games," In Proceedings of the 17th International Conference on Machine Learning, pp. 407-414, 2000. Jung D. and Zelinsky A., "Grounded symbolic communication between heterogeneous cooperating robots," Autonomous Robots, 2000. Kaelbling L. P., Littman M. L., and Moore A. W., "Reinforcement Learning: A Survey," Journal of AI Research, Vol. 4, pp. 237-258, 1996. Littman M. L., "Markov games as a framework for multi-agent reinforcement learning," In Proceedings of the Eleventh International Conference on Machine Learning, pp. 157-163, New Brunswick, NJ, 1994. 22. Ljunberg M. and Lucas A., "The OASIS Air-Traffic Management System," In Proceedings of the Second Pacific Rim International Conference on AI (PRICAI-92), Seoul, Korea, 1992. 23. Luger G. F. and Stubblefield W. A., "Artificial Intelligence Structures and Strategies for Complex Problem Solving," Third Edition, Addison Wesley Longman, Inc, 1998. 24. McCarthy J., "What is Artificial Intelligence," URL: http://wwwformal.stanford.edu/jmc/whatisai.html, 2001. 25. Mitchell T. M., "Machine Learning," McGraw-Hill International Edition, 1997. 26. Moder J. J. and Elmaghraby S. E., Editors, "Handbook of Operations Research - Foundations and Fundamentals," Van Nostrand Reinhold Company, 1978. 27. Nwana H. S., "Software Agents: An Overview," Knowledge Engineering Review, Vol. 11, No. 3, pp. 1-40, Cambridge University Press, 1996. 28. Nwana H. S. and Ndumu D. T., "A Perspective on Software Agents Research," Knowledge Engineering Review, Vol. 14, No. 2, pp. 125-142, 1999. 29. Sandholm T. W. and Crites R. H., "Multiagent Reinforcement Learning in the Iterated Prisoner's Dilemma," Biosystems Vol. 37, pp. 147-166, 1995. 30. Sen S., "Multiagent Systems: Milestones and New Horizons," Trends in Cognitive Science, Vol. 1, No. 9, pp. 334-339, 1997. 31. Sheppard J. W., "Multi-Agent Reinforcement Learning in Markov Games," PhD Paper, Johns Hopkins University, 1997. 32. Stone P. and Veloso M., "Multiagent Systems: A Survey from a Machine Learning Perspective," Autonomous Robots, Vol. 8, No. 3, July 2000. 33. Sutton R. S., "Learning to Predict by the Methods of Temporal Differences," Machine Learning Vol. 3, pp. 9-44, 1988. 34. Sutton R. S., "On the Significance of Markov Decision Processes," In Proceedings of ICANN97, pp. 273-282, Springer, 1997. 35. Sutton R. S. and Barto A. G., "Reinforcement Learning: An Introduction," MIT Press, Cambridge, MA, 1998. 36. Sycara K., Decker K., Pannu A., Williamson M., and Zeng D., "Distributed Intelligent Agents," IEEE Expert Vol. 11, No. 6, pp. 36-46, 1996. 37. Sycara K., "Multiagent Systems," AI magazine, Vol. 19, No. 2, pp. 79-92, 1998. 38. Sycara K., Decker K., and Zeng D., "Intelligent Agents in Portfolio Management," In Agent Technology: Foundations, Applications, and Markets, eds, Wooldridge M. and Jennings N. R., pp. 267-283. Berlin: Springer, 1998. 39. Tan M., "Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents," In Proceedings of the Tenth International Conference on Machine Learning, pp. 330-337, 1993. 40. Thrun S. and Schwartz A., "Issues in Using Function Approximation for Reinforcement Learning," In Proceedings of the Fourth Connectionist Models Summer School. Lawrence Erlbaum, Hillsdale, NJ, 1993. 41. Watkins C. J., "Learning from Delayed Rewards," PhD paper, Cambridge University, Cambridge, England 1989. 42. Winston P. H., "Artificial Intelligence," Second Edition, Addison-Wesley Publishing Company, 1984. 43. Walid E. Gomaa, Amani A. Saad, and Mohamed A. Ismail, Learning Joint Coordinated Plans in Multi-agent Systems, P.W.H. Chung, C.J. Hinde, M. Ali (Eds.): IEA/AIE 2003, LNAI 2718, pp. 154–165, 2003. Springer-Verlag Berlin Heidelberg 2003. Appendix A 1. Bellman equation for V: V (s) E Rt st s k E rt k 1 st s k 0 E rt 1 k rt k 2 st s k 0 k E rt 1 st s E rt k 2 st s k 0 ( s, a) Ert 1 st s, at a aA k ( s, a) E rt k 2 st s, at a aA k 0 ( s, a) R( s, a) aA k ( s, a) T ( s, a, s' ) E rt k 2 st 1 s' aA s 'S k 0 ( s, a) R( s, a) aA ( s, a) T ( s, a, s' ) E Rt 1 st 1 s' s 'S ( s, a) R( s, a) aA aA ( s, a) T ( s, a, s' )V ( s' ) aA s 'S ( s, a) R( s, a) T ( s, a, s' )V ( s' ) aA s 'S 2. Bellman optimality equation for V*: V * ( s) max Q* (s, a) aA max E * Rt st s, at a aA k max E * rt k 1 st s, at a aA k 0 max E * rt 1 k rt k 2 st s, at a aA k 0 max E * rt 1 st s, at a aA k E * rt k 2 st s, at a k 0 max R( s, a) aA k T ( s, a, s' ) E * rt k 2 st 1 s' s 'S k 0 max R( s, a) T ( s, a, s' ) E * Rt 1 st 1 s' aA s 'S max R( s, a) T ( s, a, s' )V * ( s' ) aA s 'S
© Copyright 2024 Paperzz