i i i i Achieving Socially Optimal Outcomes in Multiagent Systems with Reinforcement Social Learning JIANYE HAO and HO-FUNG LEUNG, The Chinese University of Hong Kong In multiagent systems, social optimality is a desirable goal to achieve in terms of maximizing the global efficiency of the system. We study the problem of coordinating on socially optimal outcomes among a population of agents, in which each agent randomly interacts with another agent from the population each round. Previous work [Hales and Edmonds 2003; Matlock and Sen 2007, 2009] mainly resorts to modifying the interaction protocol from random interaction to tag-based interactions and only focus on the case of symmetric games. Besides, in previous work the agents’ decision making processes are usually based on evolutionary learning, which usually results in high communication cost and high deviation on the coordination rate. To solve these problems, we propose an alternative social learning framework with two major contributions as follows. First, we introduce the observation mechanism to reduce the amount of communication required among agents. Second, we propose that the agents’ learning strategies should be based on reinforcement learning technique instead of evolutionary learning. Each agent explicitly keeps the record of its current state in its learning strategy, and learn its optimal policy for each state independently. In this way, the learning performance is much more stable and also it is suitable for both symmetric and asymmetric games. The performance of this social learning framework is extensively evaluated under the testbed of two-player general-sum games comparing with previous work [Hao and Leung 2011; Matlock and Sen 2007]. The influences of different factors on the learning performance of the social learning framework are investigated as well. 15 Categories and Subject Descriptors: I.2.11 [Artificial Intelligence]: Distributed Artificial Intelligence General Terms: Algorithms, Experimentation, Performance Additional Key Words and Phrases: Reinforcement social learning, social optimality, multiagent coordination, general-sum games ACM Reference Format: Hao, J. and Leung, H.-F. 2013. Achieving socially optimal outcomes in multiagent systems with reinforcement social learning. ACM Trans. Auton. Adapt. Syst. 8, 3, Article 15 (September 2013), 23 pages. DOI:http://dx.doi.org/10.1145/2517329 1. INTRODUCTION In multiagent systems (MASs), an agent usually needs to coordinate effectively with other agents in order to achieve desirable outcomes, since the outcome not only depends on the action it takes but also the actions taken by others. How to achieve effective coordination between agents in MASs is significant and challenging especially when the environment is unknown and each agent only has access to a limited amount of information. Multiagent learning has been proved to be a promising and effective approach to allow the agents to coordinate their actions on optimal outcomes through repeated interactions [Hoen et al. 2005; Panait and Luke 2005]. Authors’ address: J. Hao and H.-F. Leung, Department of Computer Science and Engineering, The Chinese University of Hong Kong; email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c 2013 ACM 1556-4665/2013/09-ART15 $15.00 DOI:http://dx.doi.org/10.1145/2517329 ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013. i i i i i i i i 15:2 J. Hao and H.-F. Leung Fig. 1. An instance of the prisoner’s dilemma game. Fig. 2. An instance of the anticoordination game. The most commonly used solution concepts in current multiagent learning literature are Nash equilibrium and its variants (e.g., correlated equilibrium) adopted from Game theory [Osborne and Rubinstein 1994], and various learning algorithms have been proposed targeting at converging to these solution concepts [Greenwald and Hall 2003; Hu and Wellman 1998; Kapetanakis and Kudenko 2002; Littman 1994]. Convergence is a desirable property to pursue in multiagent system, however, converging to Nash equilibrium solutions may not be the most preferred outcome since they do not guarantee that Pareto optimal solutions can be produced. For example, in the well-known prisoner’s dilemma game (see Figure 1), the agents will converge to mutual defection (D, D) if the Nash equilibrium solution is adopted, while both players could have obtained much higher payoffs by reaching the Pareto optimal outcome of mutual cooperation (C, C), which is also optimal from the overall system’s perspective. One desirable alternative solution concept is called social optimality which targets at maximizing the sum of all agents’ payoffs involved. A socially optimal outcome is not only optimal from the system’s perspective but also always Pareto optimal. One major challenge that the agents are faced with is how they should select their corresponding actions to prevent miscoordination, when there coexist multiple socially optimal joint actions. One supporting example is the anticoordination game in Figure 2, in which two socially optimal joint actions ((C, D) and (D, C)) coexist. In this game the agents need to carefully coordinate and select their actions to avoid miscoordination on (C, C) or (D, D) through repeated interactions. The difficulty of coordination can be further increased when there exist a large number of agents in the system in which the interaction partners of each agent are not fixed. In this case, an agent’s current strategy under which it can coordinate on one socially optimal outcome with its current interaction partner may fail when it interacts with another agent next time. For example, considering the anticoordination game again, an agent with the policy of choosing action D can successfully coordinate on the socially optimal outcome (D, C) or (C, D), when it interacts with another agent choosing action C next time. However, miscoordination (achieving (D, D)) occurs when it interacts with another agent choosing action D. Thus, it is a particularly challenging problem to ensure that ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013. i i i i i i i i Achieving Socially Optimal Outcomes in Multiagent Systems 15:3 all agents can learn to converge to a consistent policy of achieving one socially optimal outcome based on their local information. One line of previous work is that a number of learning strategies [Crandall and Goodrich 2005; Hao and Leung 2012; Verbeeck et al. 2006] have been proposed for agents to achieve socially optimal outcomes in the context of two-player repeated normal-form games. However, in the real world environments, the interactions among agents coexisting in a system are usually not very frequent. It may take a long time before an agent has the opportunity to interact with a previously interacted agent again. To more accurately model the practical interactions among agents in a system, we consider the problem of coordinating on socially optimal outcomes in an alternative framework different from the fixed-player repeated interaction scenario. Under this framework, there exists a population of agents in which each agent randomly interacts with another agent from the population each round. The interaction between each pair of agents (in the same group) is still modeled as a two-player normal-form game, and the payoff each agent receives in each round only depends on the joint action of itself and its interaction partner. Each agent learns its policy through repeated interactions with multiple agents, which is termed as social learning [Sen and Airiau 2007], in contrast to the case of learning from repeated interactions against the same opponent in the two-agents case. This learning framework is similar to the one commonly adopted in evolutionary game theory [Maynard 1982], except that the learning strategies of the agents are not necessarily based on evolutionary learning (copy and mutation). Compared with the fixed-player repeated interaction scenario, this social learning framework can better reflect the realistic interactions among agents and has been commonly adopted as the framework for investigating different multiagent problems such as norm emergence problem [Sen and Airiau 2007; Villatoro et al. 2011]. Besides, as previously mentioned, it is more challenging to ensure that all agents can learn a consistent policy of achieving socially optimal outcome(s) under the social learning framework compared with the two-agent repeated interaction framework. Under this social learning framework, previous work [Hales and Edmonds 2003; Matlock and Sen 2007, 2009] mainly focuses on modifying the interaction protocol from random interaction to tag-based interaction to evolve coordination on socially optimal outcomes. However, the agents’ decision making processes in previous work are based on evolutionary learning (i.e., imitation and mutation) following evolutionary game theory and thus the deviation of percentage of coordination is significant even though the average coordination rate is good. Another limitation of previous approaches is that it is explicitly assumed that each agent has access to all other agents’ information (i.e., their payoffs and actions) in previous rounds, which may be unrealistic in many practical situations due to communication limitations. Finally, all previous work only focuses on two representative symmetric games: prisoner dilemma game and anticoordination game, and cannot be directly applied in the settings of asymmetric games. In asymmetric games, the roles of the agents (row agent or column agent) can have significant influences on the agents’ behaviors and the game outcomes, which, however, are not distinguished in previous work’s learning approaches [Hales and Edmonds 2003; Matlock and Sen 2007, 2009]. To handle the aforementioned problems, in our social learning framework, we introduce the observation mechanism, under which each agent is only allowed to have access to a limited number of agents’ information in the system, in contrast to the global observation assumption commonly adopted in previous work. In this way, the amount of communication required among agents can be greatly reduced, making our learning framework more applicable for practical scenarios with high communication cost. Besides, the learning strategy adopted by each agent in our learning framework is based on reinforcement learning technique instead of evolutionary learning. Each ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013. i i i i i i i i 15:4 J. Hao and H.-F. Leung agent explicitly keeps the record of its current state (row or column player) in its learning strategy, and learn its optimal policy for each state independently. Therefore, our learning framework is suitable for the settings of both symmetric and asymmetric games. Specifically, the learning strategy we design here is an extension of the traditional Q-learning algorithm, in which we augment each agent’s learning process with an additional step of determining how to update its Q-values, that is, individually rational update or socially rational update. Individually rational update refers to the way of updating based on each agent’s individual payoff only, while socially rational update refers to the way of updating based on the sum of all the agents’ payoffs in the same group. Which way is better to update the agents’ strategies depends on the type of game they are currently playing, and the agents should be able to adaptively adjust their update schemes. To this end, each agent is associated with a weighting factor for each update scheme, and a greedy strategy is proposed to adjust the weighting factors adaptively in a periodical way. The successful achievement of socially optimal outcomes requires the cooperative coordinations among agents on their update schemes. By allowing the agents to adaptively adjust their update schemes, there are two major benefits as follows. First, the agents can automatically determine the appropriate way of updating their strategies in different learning environments (games) to the best interest of themselves. Second, when the system is within an open environment in which there may exist some uncooperatively purely selfish agents which always update their strategies in the individually rational manner, the agents can be resistant to the exploitation of those uncooperative agents. The agents can always avoid being exploited by those selfish agents by adopting individually rational update scheme when there is no benefit for adopting socially rational update scheme. We empirically evaluate the performance of this social learning framework extensively in the testbed of two-player two action games, and simulation results show that the agents can learn to achieve full coordination on socially optimal outcomes in most of the games. We also compare the performance with two previous learning frameworks [Hao and Leung 2011; Matlock and Sen 2007] and show that higher level of coordination under our social learning framework can be achieved with no deviation. Besides, we also show that the agents in the social learning framework are partially resistant to the exploitation from purely selfish agents: they can automatically select individually rational update scheme to play against those purely selfish agents when there are no benefits for them to adopt socially rational update anymore. The remainder of this article is organized as follows. In Section 2, we give an overview of different learning frameworks proposed in previous work to achieve coordination on (socially) optimal outcomes. In Section 3.1, we review some background in game theory and introduce the problem we target at. In Section 3, the social learning framework we propose is introduced in details. In Section 4, we present our simulation results in a large set of two-player games and also the influences of different factors are also considered and investigated. Lastly, conclusion and future work are given in Section 5. 2. PREVIOUS WORK Learning to achieve coordination on (socially) optimal outcomes in the setting of a population of agents has been received a lot of attention in multiagent learning literature [Chao et al. 2008; Hales and Edmonds 2003; Hao and Leung 2011; Matlock and Sen 2005, 2007, 2009]. Previous work mainly focuses on the coordination problem in two different games: prisoner’s dilemma game and anticoordination games, which are representatives of two different types of coordination scenarios. For the first type of games, ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013. i i i i i i i i Achieving Socially Optimal Outcomes in Multiagent Systems 15:5 the agents need to coordinate on the outcomes with identical actions to achieve socially optimal outcomes, while in the second type of games the achievement of socially optimal outcomes requires the agents to coordinate on outcomes with complementary actions. In previous work, tag has been extensively used as an effective mechanism to bias the interactions among agents in the population. Using tags to bias the interaction among agents in prisoner dilemma game is first proposed in Holland et al. [1986] and Allison [1992], and a number of further investigations on the tag mechanism in one-shot or iterated prisoner dilemma game are conducted in, Riol and Cohen [2001], Hales [2000], and Howley and O’Riordan [2005]. Hales and Edmonds [2003] firstly introduce the tag mechanism originated in other fields such as artificial life and biological science into multiagent systems research to design effective interaction mechanism for autonomous agents. In their model for prisoner dilemma game, each agent is represented by L + 1 bits. The first bit indicates the agent’s strategy (i.e., playing C or D), and the remaining L bits are the tag bits, which are used for biasing the interaction among the agents and are assumed to be observable by all agents. In each generation, each agent is allowed to play the prisoner dilemma game with another agent owning the same tag string. If an agent cannot find another agent with the same tag string, then it will play the prisoner dilemma game with a randomly chosen agent in the population. The agents in the next generation are formed via fitness (i.e., the payoff received in the current generation) proportional reproduction scheme. Besides, low level of mutation is applied on both the agents’ strategies and tags. This mechanism is demonstrated to be effective in promoting high level of cooperations among agents when the length of tag is large enough. They also apply the same technique to other more complex task domains which shares the same dilemma with prisoner dilemma game (i.e., the socially rational action cannot be achieved if the agents are making purely individually rational decisions), and show that the tag mechanism can motivate the agents towards choosing socially rational behaviors instead of individually rational ones. However, there are some limitations of this tag mechanism. In their work, the key point in supporting the successful emergence of socially optimal outcomes is that the agents mimic the strategy and tags of other more successful agents, which finally leads to the achievement of socially optimal outcomes among most of the interacting agents in the system. However, since the interactions among agents are based on self-matching tag scheme, each agent can only interact with others with the same tag, and thus always play the same strategy with the interacting partners. Therefore, socially optimal outcomes can be obtained only in the cases where the socially optimal outcome corresponds to a pair of identical actions (e.g., (C, C) in prisoner dilemma game). Chao et al. [2008] propose a tag mechanism similar with the one proposed in Hales and Edmonds [2003] and the only difference is that mutation is applied on the agents’ tags only. They evaluate the performance of their tag mechanism with a number of learning mechanisms commonly used in the field of multiagent system such as Generalized Tit-For-Tat [Chao et al. 2008], WSLS (Win stay, loose shift) [Nowak and Sigmund 1993], basic reinforcement learning algorithm [Wakano and Yamamura 2001] and evolutionary learning. They conduct their evaluations using two games: prisoner dilemma game and coordination game. Simulation results show that their tag mechanism can have comparable or better performance than other learning algorithms they compared under both games. However, this tag mechanism suffers from similar limitations with Hales and Edmonds’ model [Hales and Edmonds 2003]: it only works when the agents are only needed to learn the identical strategies in order to achieve socially optimal outcomes and this is indeed the case in both games they evaluated. It cannot handle the case when complementary strategies are needed to achieve coordination on socially optimal outcomes. ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013. i i i i i i i i 15:6 J. Hao and H.-F. Leung Matlock and Sen [2005] reinvestigate the Hales and Edmonds’ model [Hales and Edmonds 2003] and the reason why large tags are required to sustain cooperation. The main explanation they come up with is that in order to keep high level of cooperation sufficient number of cooperative groups is needed to prevent mutation from infecting most of the cooperative groups with defectors at the same time. There are two ways to guarantee the coexistence of sufficient number of cooperative groups, that is, adopting high tag mutation rate or using large tags, which are all verified by simulation results. The limitation of Hales and Edmonds’ model [Hales and Edmonds 2003] is also pointed out in this article, and they also provide an example (i.e., anticoordination game (Figure 2) in which Hales and Edmonds’ model fails. In anticoordination game, complementary actions (either (C, D) or (D, C)) are requires for agents to coordinate on socially optimal outcomes. However, under the self-matching tag mechanism in Hales and Edmonds’ model, each agent only interacts with others with the same tag strings, and each interacting pair of agents usually share the same strategy. Therefore, their mechanism only works when identical actions are needed to achieve socially optimal outcomes, while fails when it comes to games like anticoordination game. Considering the limitation of Hales and Edmonds’ model, Matlock and Sen [2007, 2009] propose three new tag mechanisms to tackle the limitation. The first tag mechanism is called Tag matching patterns (one and two-sided). Each agent is equipped with both a tag and a tag-matching string, and agent i is allowed to interact with agent j if its tag-matching string match agent j ’s tag in one-sided matching. For two-sided matching, it also requires agent j ’s tag-matching string to match agent i’s tag in order to allow the interaction between agent i and j. By introducing this tag mechanism, the agents are allowed to play different strategy with their interacting partners. The second mechanism is payoff sharing mechanism, which requires each agent to share part of its payoff with its opponent. This mechanism is shown to be effective in promoting socially optimal outcomes in both prisoner dilemma game and anticoordination game, however, it can be applied only when side-payment is allowed. The last mechanism they propose is called paired reproduction mechanism. It is a special reproduction mechanism that makes copies of matching pairs of individuals with mutation at corresponding place on the tag of one and the tag-matching string of the other at the same time. The purpose of this mechanism is to preserve the matching between this pair of agents after mutation in order to promote the survival rate of cooperators. Simulation results show that this mechanism can help sustaining the percentage of agents coordinating on socially optimal outcomes at a high level in both the prisoner dilemma and anticoordination games, and this mechanism can be applied in more general multiagent settings where payoff sharing is not allowed. However, similar to Hales and Edmonds’ model [Hales and Edmonds 2003], these mechanisms all heavily depend on mutation to sustain the diversity of groups in the system, thus this leads to the undesired result that the variation of the percentage of coordination on socially optimal outcomes is very high all the time (shown in details later in the experimental Section 4.1.3). Besides, the paired reproduction mechanism is shown to be effective in two types of symmetric games (prisoner dilemma game and anticoordination game), but it cannot be applied to asymmetric games. The reason is that in asymmetric games, given a pair of interacting agents and their corresponding strategies, each agent may receive different payoffs depending on its current role (row or column agent). For example, considering the asymmetric game in Figure 3, suppose that a pair of interacting agents (agent A and agent B) choosing action D and C, respectively. In this situation, agent A’s payoff is different when it acts as the row (payoff of 2) or column player (payoff of 4), even though both agents’ strategies keep unchanged. However, in Matlock and Sen’s learning framework, the role of each agent during interaction is not distinguished. They implicitly assume that all ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013. i i i i i i i i Achieving Socially Optimal Outcomes in Multiagent Systems 15:7 Fig. 3. Asymmetric anticoordination game: the agents need to coordinate on the outcome (D, C) to achieve socially rational solution. agents’ roles are always the same in accordance with the definition of symmetric game. Thus the payoffs of the interacting agents become nondeterministic when it comes to asymmetric games. Therefore, their learning framework is only applicable to those games that are agent-symmetric (prisoner’s dilemma game and anticoordination game), that is, the roles of the agents have no influence on their payoffs received. Hao and Leung [2011] propose a novel tag-based learning framework in which each agent employs a reinforcement learning strategy instead of evolutionary learning to make its decision. In their learning framework, they propose two different types of update schemes available for the agents to update their policy: individual learning update and social learning update. For each agent, each learning update scheme is associated with a weighting factor to determine which update scheme to use to update its policy, and these two weighting factors are adjusted adaptively based on the agent’s past performance in a periodic way. Another difference from pervious tag-based framework is that the reinforcement learning strategy the agents adopt is used for controlling and updating their action choices only and the agents’ tags are given at the beginning and not evolving with time. Compared with previous approach based on evolutionary learning [Matlock and Sen 2007, 2009], the agents under their learning framework are able to achieve more stable and high-level coordination on socially optimal outcomes in both the prisoner dilemma game and anticoordination game. However, similar with Matlock and Sen’s model [Matlock and Sen 2007], in their learning framework the role of each agent during interaction is also not distinguished. Therefore, their framework also can only be applied in the context of symmetric games for the same reason previously explained. To sum up, most of the previous work [Chao et al. 2008; Hales and Edmonds 2003; Matlock and Sen 2005, 2007, 2009] focuses mainly on modifying the interaction protocol of the agents through introducing the concept of tag-based mechanism, to handles the problem of coordinating on socially optimal outcomes among a population of agent. However, in this work, we handle this problem mainly from the perspective of learning strategy design, that is, reinforcement social learning with adaptive selection between socially rational update and individually rational update. The interaction protocol of the agents is not modified in our learning framework and still follows the most commonly adopted setting, that is, each agent randomly interacts with another agent in the system each round. 3. SOCIAL LEARNING FRAMEWORK 3.1. Problem Description: Reinforcement Learning in Multiagent Systems Normal-form games have been commonly used to model the interactions between agents when studying multiagent learning and coordination problems in multiagent systems. Usually, the agents are assumed to repeatedly play a single stage game ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013. i i i i i i i i 15:8 J. Hao and H.-F. Leung simultaneously and independently, and receive the corresponding reward which is determined by the joint action of all agents each round. We omit the basic concepts of game theory and interested readers may refer to Osborne and Rubinstein [1994] for details. The multiagent learning criteria proposed in Bowling and Veloso [2003] require that an agent should be able to converge to a stationary policy against some class of opponent (convergence) and the best-response policy against any stationary opponent (rationality). If both agents adopt rational learning strategies and their strategies converge, then they will converge to the Nash equilibrium of the stage game. Indeed, a large bundle of previous work adopts the solution concept of Nash equilibrium or its variants as the learning goal when they design their learning algorithms [Bowling and Veloso 2003; Claus and Boutilier 1998; Conitzer and Sandholm 2006; Greenwald and Hall 2003; Hu and Wellman 1998; Littman 1994]. Unfortunately, in many cases Nash equilibrium is not optimal in that it does not guarantee that the agents receive their best payoffs. One well-known example is the prisoner’s dilemma game shown in Figure 1. By converging to the Nash equilibrium (D, D), both agents obtain a low payoff of 1, and also the sum of the agents’ payoffs is minimized. On the other hand, both agents could have received better payoffs by converging to the nonequilibrium outcome of (C, C) in which the sum of the players’ payoffs is maximized. An alternative solution concept is that of social optimality, that is, those outcomes under which the sum of all players’ payoffs is maximized. This solution concept is optimal from the system’s perspective, but may not be in the best interest of all individual agents involved. For example, consider a two-player anticoordination game in Figure 2. There exist two socially optimal outcomes (C, D) and (D, C), and the agents have conflicting interests in that agent A prefers the outcome (C, D), while agent B prefers another outcome (D, C). This problem can be solved by proposing that both socially optimal outcomes are achieved in an alternative way [Hao and Leung 2010; Verbeeck et al. 2006], and thus both agents can obtain both equal and highest payoff on average. Until now most previous work studies the multiagent learning problem within the setting of repeated (Markov) games, which is based on an implicit assumption that two (or more) fixed agents have to interact with each other repeatedly for a long time to learn the optimal policy from past interaction experience. However, the interactions among agents may be very sparse in practice, and it may take a long time before an agent can have the chance to interact with another previously met agent again. Therefore, the agents need to be able to learn the best policy through repeated interacting with different agents each time. To this end, we consider an alternative setting involving a population of agents in which each agent interacts with another agent randomly chosen from the population each round. The interaction between each pair of agents is modeled as a single-stage normal-form game, and the payoff each agent receives in each round only depends on the joint action of itself and its interaction partner. Under this learning framework, each agent learns its policy through repeated interactions with multiple agents, which is termed as social learning [Sen and Airiau 2007], in contrast to the case of learning from repeated interactions against the same opponent in the two-agents case. In this work, we aim at solving the problem of how the agents are able to learn to coordinate on socially optimal outcomes in general-sum games through social learning. The overall interaction protocol under the social learning framework is presented in Figure 4. During each round, each agent interacts with another agent randomly chosen from the population. The interactions between agents is modeled as generalsum games, and during each interaction one agent is randomly assigned as the row player and the other agent as the column player. At the end of each round, each agent updates its policy based on the learning experience it receives so far. ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013. i i i i i i i i Achieving Socially Optimal Outcomes in Multiagent Systems 1: 2: 3: 4: 5: 6: 7: 8: 9: 15:9 for a number of rounds do repeat two agents are randomly chosen from the population, and one of them is assigned as the row player and the other one as the column player. both agents play a two-player normal-form game and receive their corresponding payoffs until all agents in the population have been selected for each agent in the population do update its policy based on its past experience end for end for Fig. 4. Overall interaction protocol of the social learning framework. 3.2. Observation In each round, each agent is only allowed to interact with one other agent randomly chosen from the population, and each pair of interacting agents is regarded as being in the same group. On one hand, during each (local) interaction, each agent typically can perceive its own group’s information (i.e., the action choices and payoffs of itself and its interaction partner), which is also known as the perfect monitoring assumption [Brafman and Tennenholtz 2004]. This assumption is commonly adopted in previous work in two-player repeated interaction settings. On the other hand, in the social learning framework, different from the two-player interaction scenario, different agents may be exposed to information of agents from other groups. Allowing agents to observe the information of other agents outside their direct interactions may result in faster learning rate and facilitate coordinating on optimal solutions. In previous work [Hales and Edmonds 2003; Matlock and Sen 2007], it is assumed that each agent can have access to all the other agents’ information (their action choices and payoffs) in the system (global observation), which may be unpractical in realistic environments due to the high communication cost involved. To balance the tradeoff between global observations and local interactions, here we only allow each agent to have access to the information of other M groups randomly chosen in the system at the end of each round. If M = N/2 (N is the total number of agents), it becomes equivalent with the case of allowing global observation; if the value of M is zero, it reduces to the case of local interactions. The value of M represents the connectivity degree in terms of information sharing among different groups. By spreading this information into the population, it serves as an alternative biased exploration mechanism to accelerate the agents’ convergence to the desired outcomes while keeping communication overhead at a low level. As will be shown in the experimental part (Section 4), each agent only needs a small amount of information from other groups (M N/2) to guarantee the coordination on socially optimal outcomes among all agents, thus the communication cost in the system is significantly reduced compared with previous work [Hales and Edmonds 2003; Matlock and Sen 2007]. 3.3. Learning Strategy When an agent interacts with its interacting partner in each round, it needs to employ a learning strategy to make its decisions. In general-sum games especially considering asymmetric games, the payoff matrices of the row and column players are usually different, thus each agent may need to behave differently according to its current role (either as the row player or column player) to reach socially optimal outcomes. To this end, we propose that each agent has a pair of strategies, one used as the row ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013. i i i i i i i i 15:10 J. Hao and H.-F. Leung player and the other used as the column player, to play with any other agent from the population. The strategy we develop here is based on Q-learning technique [Watkins and Dayan 1992], in which each agent holds a Q-value Q(s, a) for each action a in each state s ∈ {Row, Column}, which keeps record of the action’s past performance and serves as the basis for making decisions. We assume that each agent knows its current role during each interaction. Accordingly, each agent chooses its action based on the corresponding set of Q-values of its current role during each interaction. 3.3.1. How to Select the Action. Each agent chooses its action based on -greedy mechanism, which is commonly adopted for performing biased action selection in reinforcement learning literature. Under this Mechanism each agent chooses its action with the highest Q-value with probability 1 − for exploiting the action with best performance currently (randomly selection in case of a tie), and makes random choices with probability for the purpose of exploring new actions with potentially better performance. 3.3.2. How to Update the Q-Values. According to Section 3.2, we know that in each round the information that each agent has access to consists of its own group’s information and the information of other M groups randomly chosen from the population. The traditionally and commonly adopted approach for a selfish agent is to use its individual payoff received from performing an action as the reinforcement signal to update its strategy (Q-values). However, since an agent is situated in a social context, its selfish behaviors can have implicit and detrimental effects on the behaviors of other agents coexisting in the environment, which may in return lead to undesirable results for both individual agents and the overall system. Accordingly, an alternative notion proposed for solving this problem is social rationality [Hogg and Jennings 1997, 2001], under which each agent evaluates the performance of each action in terms of its worth to the overall group instead of itself. We consider an agent as an individually rational agent if its strategy is updated based on its own payoff only, and this kind of update scheme is called individually rational update; while we consider an agent as a socially rational entity if its strategy is updated based on the sum of all the agents’ payoffs in the same group, and this kind of update scheme is called socially rational update. Adopting which way is better to update the agents’ strategies depends on the type of games being played. For cooperative games, the difference between socially rational update and individually rational update is irrelevant, since the agents share the common interests; for competitive (zero-sum) games, it is meaningless to perform socially rational update since the sum of both players’ payoffs under every outcome is always equal to zero and it is more reasonable for the agents to update their strategies based on their individual payoffs only; for mixed games, it can be beneficial for the agents to update their strategies in the socially rational manner, which can facilitate the agents to coordinate on win-win outcomes rather than loss-loss outcomes (e.g., the prisoner’s dilemma game). On the other hand, it may be better to adopt individually rational update to prevent malicious exploitations when interacting against individually rational agents. Since we assume that the agents have no prior information of the game they are playing, each agent should learn to determine the appropriate way of updating its strategy adaptively, that is, updating its strategy based on either its individual payoff or the joint payoff(s) of the group(s) or the combination of both. To handle the individually/socially rational update selection problem, we propose that each agent i is equipped with two weighting factors wi, il and wi, sl representing its individual rational and socially rational degrees satisfying the constraint wi, il + wi, sl = 1. In general, each ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013. i i i i i i i i Achieving Socially Optimal Outcomes in Multiagent Systems 15:11 ALGORITHM 1: Adaptive Weighting Factors Updating Strategy for Agent i 1: initialize the value of wi,il and wi,sl . 2: initialize the updating direction UD0 , i.e., towards increasing the weight on individual learning or social learning. 3: for at the end of each period T do 4: keep record of its own performance (e.g.,the average payoff avgT ) within this period.1 5: if it is not the first adjustment period (T > 1) then 6: |avgT −avgT−1 | ≤ Threshold then max(avgT ,avgT−1 ) T keep the updating direction UD unchanged if avgT > avgT−1 or 7: 8: else 9: reverse the updating direction UDT 10: end if 11: else 12: keep the updating direction UDT unchanged 13: end if 14: if UDT equals towards social learning then 15: wi,sl = wi,sl + δ 16: wi,il = wi,il − δ 17: else 18: wi,il = wi,il + δ 19: wi,sl = wi,sl − δ 20: end if 21: end for agent’s Q-values are updated in the individually rational and socially rational manner as follows, Qit+1 (s, a) = Qti (s, a) + α t (s)wti, il (ri, il (s, a) − Qti (s, a)), s ∈ {Row, Column} (1) Qit+1 (s, a) = Qti (s, a) + α t (s)wti, sl (ri, sl (s, a) − Qti (s, a)), s ∈ {Row, Column}. (2) Here Qti (s, a) is agent i’s Q-function on action a at time step t in state s, α t (s) is the learning rate in state s at time step t, ri, il (s, a) is the payoff agent i uses to update its Q-value of action a in the individually rational manner, and ri, sl (s, a) is the payoff agent i uses to update its Q-value in the socially rational manner. We will discuss and introduce how to determine the values of ri, il (s, a) and ri, sl (s, a) later. Notice that each agent shares the same set of weighting factors wi, il and wi, sl no matter whether it is the row player or the column player. The value of α t (s) is decreased gradually: 1 , where No.(s) is the number of times that state s has been visited until α t (s) = No.(s) time t. In order to optimize its own performance in different environments, the values of each agent’s weighting factors should be able to be adaptively updated based on its own past performance. Therefore, the natural question is how to update the values of wi,il and wi,sl for each agent. Here we propose an adaptive strategy for agents to make their decisions shown in Algorithm 1. Each agent adaptively updates its two weighting factors in a greedy way based on its own past performance periodically. 1 In the current implementation, we only take each agent’s payoff during the second-half period into consideration in order to get a more accurate evaluation of the actual performance of the period. ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013. i i i i i i i i 15:12 J. Hao and H.-F. Leung To update its weighting factors, each agent maintains an updating direction, that is, towards either increasing the weight on individual rational update or the weight on socially rational update. We consider a fixed number m of steps as a period. At the end of each period, each agent i will adaptively adjust these two weighting factors by δ following the current updating direction, where δ is the adjustment degree. For example, if the current updating direction is to increase the weight on socially rational update, its weighting factor wi, sl will be increased by δ and the value of wi, il will be decreased by δ accordingly (Line 14–20). During each period, each agent also keeps record of its performance, that is, the average payoff obtained within the period. The updating direction is adjusted by comparing the relative performance between two consecutive periods. Each agent can switch its current updating direction to the opposite one if its performance in the current round is worse than that of previous period (Line 5–10). For the first adjustment period, since there is no previous performance record for comparison, each agent simply keeps its initial updating direction unchanged and update the weighting factors accordingly (Line 11–12 and 14–20). If an agent i updates its Q-values in the individually rational manner, it acts as an individual rational entity which uses the information of possible individual payoffs received by performing different actions. Let us denote St as the set of information consisting of the action choices and payoffs of itself and the agents of the same role from the other M groups, that is, Silt = {ati , rt , bt1 , rt1 , bt2 , rt2 , . . . , btM , rtM }. Here ati , rt is the action and payoff of agent i itself and the rest are the actions and payoffs of other M agents with the same role from other M groups. At the end of each round t, agent i picks action a with the highest payoff (randomly choosing one action in case of a tie) from the set Silt , and updates its Q-value of action a according to Eq. (1) by assigning ri,il to the value of r(a) ∗ freq(a). Here r(a) is the highest payoff of action a among all elements in the set Silt , freq(a) is the frequency that action a occurs with the highest reward r(a) among the set Silt . If an agent i learns its Q-values based on socially rational update, it updates its Q-values based on the joint benefits of agents in the same group. Let us denote St as the set of information consisting of the action choices and group payoffs of itself and the agents of the same role from the other M groups, that is, Sslt = {ati , Rt , bt1 , Rt1 , bt2 , Rt2 , . . . , btM , RtM }. Here ati , Rt is the action and sum of the payoffs of agent i and its interaction partner, and the rest are the actions and group payoffs of the agents with the same role from other M groups. Similar to the case of individually rational update, at the end of each round t, each agent i picks the action a with the highest group payoff (randomly choosing one action in case of a tie) from the set Stsl , and updates the value of Q(s, a) according to Eq. (2) by assigning ri, sl to the value of R(a) ∗ freq(a). Here R(a) is the highest group payoff of action a among all elements in the set Stsl , freq(a) is the frequency that action a occurs with the highest group reward R(a) among the set Stsl . When all agents behave as socially rational entities to perform update on their Q-functions, it is not difficult to see that for any general-sum game G, the agents are actually learning over its corresponding fully cooperative game G where the payoff profile (px , py ) of each outcome (x, y) in G is changed to (px + py , px + py ) in G . From the definition of socially rational outcome, we know that every socially rational outcome in the original game G corresponds to a Pareto-optimal Nash equilibrium in the transformed game G , and vice-versa. Therefore, the goal of learning to achieve socially rational outcomes in game G becomes equivalent to that of learning to achieve Pareto-optimal Nash equilibrium in the transformed cooperative game G . ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013. i i i i i i i i Achieving Socially Optimal Outcomes in Multiagent Systems 15:13 3.4. Property of the Social Learning Framework From Section 3.3, we have known that in the social learning framework each agent holds a pair of strategies, one used as the row player and the other used as the column player, to play with any other agent from the population. The rationale behind is that the optimal strategies for an agent to take to achieve socially optimal outcomes under different roles (row or column player) are not necessarily the same, and this is especially true when it comes to asymmetric games. In general, given one socially optimal outcome (xs , ys ) in game G where action xs and action ys are not necessarily the same, let us suppose that the agents have finally learned a consistent optimal policy of coordinating on this outcome (xs , ys ). Specifically, under this optimal policy, each agent always chooses action xs as the row player and action ys as the column player. Since each agent is assigned as the row or column player randomly each round as specified by the interaction protocol in Figure 4, the average payoff of each agent would approach p +p p̄ = xs 2 ys in the limit, where pxs and pys are the payoffs of the row and column agents under the outcome (xs , ys ). This indicates that the social learning framework has the desirable property that the agents are able to achieve both fair (equal) and highest payoff whenever they can successfully learn a consistent policy of coordinating on one socially optimal outcome. Formally, we can have the following theorem. P ROPERTY 3.1. If the agents can finally learn a consistent policy of coordinating on one socially optimal outcome (xs , ys ), then each agent will obtain both equal and highest payoff of p̄ on average. P ROOF. It is obvious that the agents’ average payoffs p̄ are fair (equal) from this p +p analysis which is equal to xs 2 ys , and we only need to prove that it is also the highest payoff that each agent can obtain. Let us suppose that there exists an agent i which can p +p achieve an average payoff of p̄ > xs 2 ys by coordinating on an outcome (xs , ys ) different from (xs , ys ). Since the agents within the system are symmetric, all other agents in the system should receive the same average payoff of p̄ as well. By denoting pxs and pys as the payoffs of the row and column players, respectively, under the outcome (xs , ys ), p +p p +p we have xs 2 ys > xs 2 ys , which contradicts with the definition of socially optimal p +p outcome and thus no agent can ever achieve an average payoff higher than xs 2 ys . In previous work [Matlock and Sen 2007, 2009], even though the authors have proved that their tag-based paired reproduction mechanism favors the most fair outcome among all possible socially optimal outcomes, some agents’ average payoffs can still be much lower than others due to the nature of their mechanism. For example, considering the symmetric anticoordination game in Figure 2, under the paired reproduction mechanism, it is expected that the population of agents are divided into a number of subgroups, and within each subgroup, some agents always choose action C, while the rest always chooses action D. This would result in unfairness in that the agents choosing action D would always obtain much lower payoff of 1 than those agents choosing action C (payoff of 2). However, in our social learning framework, this kind of unfairness can be easily eliminated by assigning different roles to each agent with equal probability and also allowing each agent to learn different policies for different roles. For example, in the anticoordination game in Figure 2, under our social learning framework, each agent learns the policy of choosing action C as the row player and choosing action D as the column player, and also each agent is assigned as either row or column player with equal probability. Therefore, the average payoff of each agent is the same in the long run. Moreover, under our social learning framework, the agents ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013. i i i i i i i i 15:14 J. Hao and H.-F. Leung Fig. 5. Frequency of reaching different joint actions in prisoner’s dilemma game averaged over 100 times (the length of each period m = 500, adjustment degree δ = 1.0, wi, il = 0, and wi, sl = 1). are able to achieve both fair and highest average payoffs in both symmetric and asymmetric games. Compared with previous work [Matlock and Sen 2007, 2009], another advantage of our social learning framework is that the communication cost is significantly reduced. As mentioned in Section 3.2, in our social learning framework, each agent is only allowed to observe the information of other M (M N/2) groups of agents in the system, and this can also be verified from the simulation part in Section 4. However, previous work [Matlock and Sen 2007, 2009] implicitly requires that each agent can observe the information of all agents in the system. 4. SIMULATION In this section, we empirically evaluate the performance of the social learning framework and investigate the influences of different factors on the learning performance. First, we use two well-known examples (the prisoner’s dilemma game and learning “rules of the road”) to illustrate the learning dynamics and the performance of the learning framework by comparing with previous work in Section 4.1.2 In Section 4.2, to illustrate the generality of our social learning framework, we further evaluate the performance of the social learning framework under a larger set of general-sum games which consists of different types of symmetric and asymmetric games. Finally, to evaluate the robustness and sensitivity of the social learning framework, the influences of different factors (against purely selfish agents, information sharing degree and against fixed policy agents) on the learning performance are investigated from Section 4.3 to Section 4.5. 4.1. Examples 4.1.1. Prisoner’s Dilemma Game. Prisoner’s dilemma game is one of the most wellknown games which has attracted much attention in both game theory and multiagent system literatures. One typical example of the prisoner’s dilemma game is shown in Figure 1. For any individually rational player, choosing defection (action D) is always its dominating strategy, however, mutual defection will lead the agents to achieve the inefficient outcome (D, D), while there exists another socially optimal outcome (C, C) in which each agent could obtain a much higher and fair payoff. Figure 5 shows the 2 Note that these two examples are the most commonly adopted testbeds in previous work [Hales and Edmonds 2003; Hao and Leung 2011; Matlock and Sen 2007, 2009]. ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013. i i i i i i i i Achieving Socially Optimal Outcomes in Multiagent Systems 15:15 Fig. 6. Frequency of reaching different joint actions in anticoordination game averaged over 100 times (the length of each period m = 500, adjustment degree δ = 1.0, wi, il = 0, and wi, sl = 1). average frequency of each joint action that the agents learn to coordinate on under the social learning framework in a population of 100 agents.3 We can see that the agents can eventually learn to converge to the outcome (C, C), which corresponds to the policy of “choosing action C as both row and column players”. During the first period, all agents update their policies based on socially rational update scheme and the policy of “always choosing action C” is reached after approximately 250 rounds. It is worth noting that the outcome (C, C) actually corresponds to the only Nash equilibrium in the cooperative game after transformation. Starting from the second period, the agents change their update scheme to individually rational update, and they learn to converge to the policy of “always choosing action D” and thus converging to outcome (D, D) starting from the middle point of the second period. Since the average payoff obtained from mutual cooperation in the first period is much higher than that obtained from mutual defection in the second period, the agents changed their update scheme back to socially rational update at the beginning of the third period, and will stick to this scheme forever. Therefore mutual cooperation (C, C) is always achieved among the agents starting from the middle point of the third period. 4.1.2. Learning “Rules of the Road”. The problem of learning “rules of the road” [Sen and Airiau 2007] can be used to model the practical car-driving coordination scenario: who yields if two drivers arrive at an intersection at the same time from neighborhood roads. Both drivers would like to go across the intersection first without waiting, however, if both drivers choose to not yield to the other side, they will collide with each other. If both of them choose to wait, then neither of them can go across the intersection and it is a waste of their time. Their behaviors are well coordinated if and only if one of them chooses to wait and the other driver chooses to go first. This problem can be naturally modeled as a two-player two-action anticoordination game (see Figure 2 where action C corresponds to “Yield” and action D corresponds to “Go”). The natural solution of this problem is that the agents adopt the policy of “yields to the driver on left (right)”, and this solution is optimal in that both agents can obtain equal and highest payoff on average provided that their chances of driving on the left or right road are the same. Figure 6 shows the average frequency of different joint actions that the agents learn 3 In Figure 5, only the frequencies of the relevant action profiles (C, C) and (D, D) are shown for the sake of clarification. ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013. i i i i i i i i 15:16 J. Hao and H.-F. Leung Fig. 7. Average frequency of achieving socially optimal outcome (D, C) under the learning framework in asymmetric anticoordination game averaged over 100 runs (the length of each period m = 500, adjustment degree δ = 1.0, wi, il = 0, and wi, sl = 1). to converge to within a population of 100 agents.4 The results are obtained averaging over 100 independent runs, in which half of the runs the agents learn to acquire the policy of “yielding to the driver on the left” ((D, C)), and acquire the policy of “yielding to the driver on the right” ((C, D)) the rest of the times. Another observation is that the agents learn to converge to one of the previous two optimal policies under individual learning update schemes (transition from social learning update to individual learning update at 500th round). This is in accordance with the results found in previous work [Sen and Airiau 2007], in which the authors investigate whether the agents can learn to coordinate on one of the previous two optimal policies if they learn based on their private information only (equivalent with the individually rational update here). Next we consider an asymmetric version of the anticoordination game shown in Figure 3. Different from the symmetric anticoordination game in which it is sufficient for the agents to coordinate on the complementary actions (either (C, D) or (D, C)), here the row and column agents need to coordinate their joint actions on the unique outcome (D, C) only in order to achieve the socially optimal solution. Specifically, each agent needs to learn the unique policy of row → D, column → C, that is, selecting action D as the row player and selecting action C as the column player. Figure 7 shows the frequency of each outcome that the agents achieve averaged over 100 runs under the social learning framework. We can observe that the agents learn to converge to the socially rational outcome (C, D) during the first period under socially rational update scheme, and each agent receives an average payoff of 8.5 after convergence. Starting from the second period, the agents begin to adopt individually rational update for the purpose of exploration and either (C, D) or (D, C) is converged to with probability of approximately 0.6 and 0.4 respectively. If the agents learn to converge to (C, D) during the second period, then the average payoff each agent receives after convergence is only 4.5. Accordingly the agents will have the incentive to change their update scheme back to socially rational update and they will stick to this scheme thereafter since they can always receive the higher average payoff of 8.5. Notice that different from the symmetric anticoordination game, only using individually rational update is not sufficient to guarantee the convergence to socially optimal outcome (D, C) in asymmetric case. Under individually rational update scheme, (C, D) has much higher opportunities to be converged to even though (D, C) is the only socially 4 In Figure 6, only the frequencies of the relevant action profiles (D, C) and (C, D) are shown for the sake of clarification. ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013. i i i i i i i i Achieving Socially Optimal Outcomes in Multiagent Systems 15:17 Fig. 8. Learning performance of previous work in prisoner’s dilemma game. Fig. 9. Learning performance of previous work in anticoordination game. optimal outcome. Therefore socially rational update is necessary to guarantee that socially optimal outcome (D, C) can be achieved in this example. From previous experimental results in Sections 4.1.1 and 4.1.2, we have shown that different update schemes (social learning update or individual learning update) may be required to achieve socially optimal solutions in different interaction scenarios, and also the agents using our adaptive update strategy are able to successfully learn to choose the most suitable update scheme accordingly. Next, we are going to evaluate the learning performance of the social learning framework by comparing it with previous work using the previous examples. 4.1.3. Comparison with Previous Work. In this section, we compare the performance under the social learning framework with two learning frameworks proposed in previous work [Hao and Leung 2011; Matlock and Sen 2007]. Since all previous work is only applicable in symmetric games, here we evaluate their performance under two types of representative games: the prisoner’s dilemma game (Figure 1) and anti-coordination game (Figure 2), which are the most commonly adopted testbed in previous work [Chao et al. 2008; Hao and Leung 2011; Matlock and Sen 2007]. Figure 8 and Figure 9 show the dynamics of the percentage of agents coordinating on socially optimal outcomes as a function of rounds for the learning frameworks in both Hao and Leung [2011] and Matlock and Sen [2007] under the prisoner’s dilemma ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013. i i i i i i i i 15:18 J. Hao and H.-F. Leung Table I. Success Rate in Terms of Percentage of Games in which Full Coordination on Socially Optimal Outcomes Can be Achieved in Randomly Generated General-Sum Games for Different Values of m and thres success rate (m/thres) 1000 1500 2000 0 0.586 0.504 0.97 0.001 0.624 0.554 0.968 0.01 0.79 0.856 0.964 0.05 0.916 0.92 0.962 0.1 0.812 0.84 0.806 0.2 0.676 0.69 0.738 game and anticoordination game, respectively. We can see that both previous learning frameworks can lead high percentage of agents to achieve coordination on socially optimal outcomes on average, however, there is always still a small proportion of agents that fails to do so all the time. This phenomenon is particularly obvious for the paired reproduction based learning framework in Matlock and Sen [2007], in which the deviation of the percentage of agents coordinating socially optimal outcomes is very high since the agents under their learning framework use evolutionary learning (copy and mutation) to make their decisions. In their approach, mutation plays an important role in evolving high level of coordinations on socially optimal outcome, while the corresponding side effect is that the deviation of the percentage of socially optimal outcome is very high. In contrast, under our socially learning framework, all the agents can stably learn to coordinate on socially optimal outcomes in both the prisoner’s dilemma game (see Figure 5) and anticoordination game (see Figure 6). 4.2. General Performance in Two-Player General-Sum Games The previous section has shown the effectiveness of the social learning framework compared with previous work in two representative symmetric games, which are the most commonly adopted testbed in previous work. In this section, we further evaluate the general performance of the social learning framework in a large set of two-player general-sum games, which consists of different types of symmetric and asymmetric games. The payoffs of all games are generated randomly within the range of 0 and 100. Table I lists the average success rate in terms of the percentage of games in which full coordination on socially optimal outcomes can be achieved when the values of adjustment period length m and the acceptance threshold thres vary. All the results are averaged over 100 randomly generated games. First, we can see that under our social learning framework the agents can achieve a high success rate with the highest value of 0.97 when m = 2000 and thres = 0. However, there still exist 3% of games in which the agents cannot achieve full coordination on socially optimal outcomes. The underlying reason is that in those games there exist a nonsocially optimal outcome such that the agents’ average payoff by coordinating on this outcome is very close to that by coordinating on the socially optimal outcome, and thus some agents may fail to distinguish which one is better, and finally learn the incorrect update scheme and fail to learn their individual actions corresponding to the socially optimal outcome, which thus results in certain degrees of miscoordinations on nonsocially optimal outcomes. Next we analyze the influence of the values of thres and m on the success rate of coordinating on socially optimal outcomes. The value of the threshold thres reflects the tolerance degree of the performance of the current learning period. The larger the length of each adjustment period (m) is, the more accurate estimation of the actual performance of each period we can obtain. When the estimation of the actual performance of each period is not accurate enough (the value of m is relatively small), setting the value of the threshold thres too large or small can result in high rate of miscoordinations on nonsocially optimal outcomes. If the value of thres is too small, it is likely that some agents may choose the wrong update scheme as their next-round update schemes due to the noise of inaccurate estimations of the performance in each round; ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013. i i i i i i i i Achieving Socially Optimal Outcomes in Multiagent Systems 15:19 Table II. Average Payoffs of Agents Following the Strategy in Section 3.3 and SAs in the Prisoner’s Dilemma Game when the Percentage of SAs Varies Average Payoffs Agents following strategy in Section 3.3 Selfish agents (SAs) 10% 2.75 4.39 20% 2.45 4.15 30% 2.2 3.8 40% 1.81 3.43 50% 1.48 3.02 60% 1.08 1.28 ≥70% 1 1 if the value of thres is too large, it is obvious that some agents may make wrong judgements on the optimal update scheme due to high tolerance degree, and thus result in certain degrees of miscoordinations. Indeed, this can be verified from Table I that when the length of adjustment period is small (m = 1000 or 1500), the highest success rate of achieving full coordination on socially optimal outcomes is achieved when the value of the threshold is set to 0.05. either decreasing or increasing the value of the threshold thres will decrease the success rate of coordinating on socially optimal outcomes. When the length of adjustment period is further increased to 2000, the agents are able to get more accurate estimation of the performance of each period, thus the highest success rate is achieved when the threshold is set to the minimum value of 0. 4.3. Against Purely Individually Rational Agents The successful coordination on socially optimal outcomes in the social learning framework requires the cooperation of all agents in the system, that is, each agent should have the intention to follow the learning strategy proposed in Section 3.3. However, in an open multiagent environment, it is possible that there exist some selfish agents that would like to use this opportunity to exploit those agents that update their Q-values in the socially rational manner to the benefits of their own. For example, consider the prisoner’s dilemma game, to achieve socially optimal outcome (C, C), it requires all agents cooperatively to learn to adopt socially rational update scheme and finally learn to choose action C each round. However, it is obvious that they can be easily exploited by the small amount of purely individually rational agents, which can learn to choose action D and thus obtain a much higher payoff. In this section, we empirically show that the agents following the strategy in Section 3.3 can automatically learn to adopt the individually rational update scheme back when the number of selfish agents is too much such that there is no benefits for them to adopt socially rational update scheme any more. In this way, the agents can effectively prevent from being exploited by those selfish agents. Table II shows the average payoffs of the agents following the strategy in Section 3.3 and purely selfish agents when the percentage of purely selfish agents varies in the prisoner’s dilemma game. If all agents are purely selfish (corresponding to the case of 100% of purely selfish agents), then all agents will learn to converge to choose action D, and obtain an average payoff of 1. However, when there is only a relatively small percentage of selfish agents, the rest of agents can still obtain an average payoff higher than 1 by performing socially rational update. This corresponds to the cases when the percentage of purely selfish agents are less than 70% in Table II, in which the rest of agents still learn to adopt socially rational update and finally choose action C, and thus obtain an average payoff higher than 1.0. When the percentage of purely selfish agents becomes larger than 70%, then they can realize that there is no benefit for them to adopt socially rational update any more and thus switch to individually rational update, and finally learn to choose action D, and all agents obtain an average payoff of 1. 4.4. Influences of the Amount of Information Shared Among Agents One major difference between the social learning framework and the commonly adopted setting of two-agents repeated interactions is that the agents have more information (other groups of interaction agents in the population) at their disposal when ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013. i i i i i i i i 15:20 J. Hao and H.-F. Leung Fig. 10. Dynamics of the average payoff of agents with different values of M during the period of social learning update. they learn their coordination policies. Therefore, it is particularly important to have a clear understanding of the influence of the amount of information shared among the agents (the value of M) on the system-level’s performance. To this end, in this section, we evaluate the influence of the value of M on the learning performance of the system using the symmetric anticoordination game in Figure 2. Figure 10 shows the dynamics of the average payoff of each agent as a function of rounds with different values of M under the socially rational update scheme. We can see that each agent’s average payoff increases more quickly with the increase of the value of M. Intuitively, when the value of M becomes larger, it means the agents are exposed to more information each round and it is expected that they are able to learn their policies in a more informed way and thus coordinate their actions on socially optimal outcomes more efficiently. However, it is worth noticing that the time required before reaching the maximum average payoff is approximately the same for different values of M. 4.5. Influence of Fixed Agents We have shown that both socially optimal outcomes (C, D) and (D, C) have equal chances to be converged to in the symmetric anticoordination game in Section 4.1.2. This is reasonable and expected since the game itself is symmetric and all agents are identical, thus there is no bias towards any of the two outcomes. We also have investigated the influence of game structure on the convergence results of the system when the game itself is not symmetric. In this section, we consider another situation when there exist some agents that adopt a fixed policy without any learning ability, and evaluate how the existence of small number of fixed agents can influence the convergence outcome of the system. Figure 11 shows the frequency that each socially optimal outcome can be converged to when different number of fixed agents are inserted into the system using the symmetric anticoordination game (Figure 2). We assume that the fixed agents always make their decision following the policy of “choosing action D as the row player and choosing action C as the column player”. When there is no fixed agents inserted into the system, both socially optimal outcomes (C, D) and (D, C) can be converged to with equal probability. When we insert only one fixed agent into the system, it is surprising to find that the probability of converging to the outcome (D, C) (corresponding to converging to the policy of the fixed agent) is increased to 0.7! When the number of fixed agents is further increased, the probability of converging to (D, C) is increased as well. With only 4 fixed agents inserted into the system, the agents converge to the outcome (D, C) ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013. i i i i i i i i Achieving Socially Optimal Outcomes in Multiagent Systems 15:21 Fig. 11. Average frequency of socially optimal outcomes converged to under the learning framework in asymmetric anticoordination game with different number of fixed agents (averaged over 100 runs). Fig. 12. Average frequency of achieving each outcome under the learning framework in asymmetric anticoordination game with one fixed agent (averaged over 100 runs). with approximately 100% chance. There might thus be some truth to the adage that most fashion trends are decided by a handful of trend setters in Paris! [Sen and Airiau 2007]. Figure 12 illustrates the detailed dynamics of the frequency changes of different outcomes in a population of 100 agents when only one fixed agent is inserted into the system. First, we can see that the agent with fixed policy has significant influence on the convergence outcomes in the first period under socially rational update, that is, about 70% chances of converging to (C, D) instead of 50% without any fixed agent. During the second period under individually rational update, there is about 37% chances that the agents miscoordinate their action choices (reaching (D, D)) due to the existence of the fixed policy agent, and also the chance of converging to (C, D) is higher than that of converging to (D, C). The agents change their update scheme back to socially rational update in the third period, and the chance of converging to (C, D) is gradually increased to the maximum value (approximately 70% of the runs) by the end of the 4th period. 5. CONCLUSION AND FUTURE WORK In this work, we propose a social learning framework for a population of agents to coordinate on socially optimal outcomes in the context of general-sum games. We introduce the observation mechanism into the social learning framework to reduce the ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013. i i i i i i i i 15:22 J. Hao and H.-F. Leung amount of communication required among agents. The agents’ learning strategies are developed based on reinforcement learning technique instead of evolutionary learning. Each agent explicitly keeps the record of its current state (row or column player) in its learning strategy, and learn its optimal policy for each state independently. In this way, the agents are able to achieve much more stable coordination on socially optimal outcomes compared with previous work and also our social learning framework can be suitable for both the settings of symmetric and asymmetric games. The performance of this social learning framework is extensively evaluated under the testbed of two-player general-sum games and better performance can be achieved comparing with previous work [Hao and Leung 2011; Matlock and Sen 2007]. We investigate the influences of different factors on the learning performance of the social learning framework in details. The successful coordination on socially optimal outcomes requires the cooperative coordinations on the selection between individual learning update and social learning update among agents, which largely depends on the values of adjustment period length and the acceptance threshold. We also show that agents in our social learning framework can be resistant to the exploitations of purely noncooperative selfish agents and also a small amount of fixed policy agents can have tremendous influence on which socially optimal outcome the agents converge to. In the current social learning framework we adopt, the interactions among agents are determined randomly and modeled as two-player normal-form games. As future work, some generalizations are necessary to make it more applicable in real-world complex environments. First, some practical problems involve the interactions among more than two players and each agent may have numerous actions, which thus needs to be modeled as n-player games (n > 2) rather than two-player games. For example, the resource allocation scenarios in ad hoc networks and sensor networks need to be modeled as n-player public good games [Pitt et al. 2012]. Second, the interactions among agent may not necessarily be random, and may follow some patterns determined by the structure of their underlying social network (e.g., lattice or scale-free network). Therefore, another interesting direction is to explicitly take into consideration the underlying topology of the agents’ interacting network. In this way, more efficient coordination towards socially optimal outcomes may be achieved due to the availability of alternative mechanisms, that is, rewiring mechanism (allowing agents themselves to determine their interaction partners). REFERENCES Allison, P. D. 1992. The cultural evolution of beneficent norms. Social Forces. 279–301. Bowling, M. H. and Veloso, M. M. 2003. Multiagent learning using a variable learning rate. Artif. Intell. 136, 215–250. Brafman, R. I. and Tennenholtz, M. 2004. Efficient learning equilibrium. Artif. Intell. 159, 27–47. Chao, I., Ardaiz, O., and Sanguesa, R. 2008. Tag mechanisms evaluated for coordination in open multi-agent systems. In Proceedings of the 8th International Workshop on Engineering Societies in the Agents World. 254–269. Claus, C. and Boutilier, C. 1998. The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of AAAI’98. 746–752. Conitzer, V. and Sandholm, T. 2006. AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. In Proceedings of ICML’06. 83–90. Crandall, J. W. and Goodrich, M. A. 2005. Learning to teach and follow in repeated games. In Proceedings of the AAAI Workshop on Multiagent Learning. Greenwald, A. and Hall, K. 2003. Correlated Q-Learning. In Proceedings of ICML’03. 242–249. Hales, D. 2000. Cooperation without space or memory-tag, groups and the Prisoner’s Dilemma. In MultiAgent-Based Simulation, Lecture Notes in Artificial Intelligence. Hales, D. and Edmonds, B. 2003. Evolving social rationality for MAS using “Tags”. In Proceedings of AA-MAS’03. 497–503. ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013. i i i i i i i i Achieving Socially Optimal Outcomes in Multiagent Systems 15:23 Hao, J. Y. and Leung, H. F. 2010. Strategy and fairness in repeated two-agent interaction. In Proceedings of ICTAI’10. 3–6. Hao, J. Y. and Leung, H. F. 2011. Learning to achieve social rationality using tag mechanism in repeated interactions. In Proceedings of ICTAI’11. 148–155. Hao, J. Y. and Leung, H. F. 2012. Learning to achieve socially optimal solutions in general-sum games. In Proceedings of the PRICAI’12. 88–99. Hoen, P. J., Tuyls, K.l, Panait, L., Luke, S., and Poutr, J. A. L. 2005. An overview of cooperative and competitive multiagent learning. In Proceedings of the 1st International Workshop on Learning and Adaption in Multi-Agent Systems. 1–46. Hogg, L. M. and Jennings, N. R. 1997. Socially rational agents. In Proceedings of the AAAI Fall Symposium on Socially Intelligent Agents. 61–63. Hogg, L. M. J. and Jennings, N. R. 2001. Socially intelligent reasoning for autonomous agents. IEEE Trans. Syst. Man. Cybernetics, Part A: Syst. Humans. 381–393. Holland, J. H., Holyoak, K., Nisbett, R., and Thagard, P. 1986. Induction: Processes of Inferences, Learning, and Discovery. MIT Press, Cambridge, MA. Howley, E. and O’Riordan, C. 2005. The emergence of cooperation among agents using simple fixed bias tagging. In Proceedings of the IEEE Congress on Evolutionary Computation. Hu, J. and Wellman, M. 1998. Multiagent reinforcement learning: theoretical framework and an algorithm. In Proceedings of the ICML’98. Kapetanakis, S. and Kudenko, D. 2002. Reinforcement learning of coordination in cooperative multi-agent systems. In Proceedings of the AAAI’02. 326–331. Littman, M. 1994. Markov games as a framework for multi-agent reinforcement learning. In Proceedings of ICML’94. 322–328. Matlock, M. and Sen, S. 2005. The success and failure of tag-mediated evolution of cooperation. In Proceedings of the 1st International Workshop on Learning and Adaption in Multi-Agent Systems. Springer, 155–164. Matlock, M. and Sen, S. 2007. Effective tag mechanisms for evolving coordination. In Proceedings of AAMAS’07. 1340–1347. Matlock, M. and Sen, S. 2009. Effective tag mechanisms for evolving coperation. In Proceedings of AAMAS’09. 489–496. Maynard, J. S. 1982. Evolution and The Theory of Games. Cambridge University Press, Cambridge, UK. Nowak, M. and Sigmund, K. 1993. A strategy of winstay, lose-shit that outperforms tit-for-tat in the prisoner’s dilemma game. Nature, 56–58. Osborne, M. J. and Rubinstein, A. 1994. A Course in Game Theory. MIT Press, Cambridge, MA. Panait, L. and Luke, S. 2005. Cooperative multi-agent learning: The state of the art. In Proceedings of AAMAS. 387–434. Pitt, J., Schaumeier, J., Busquets, D., and Macbeth, S. 2012. Self-organising common-pool resource allocation and canons of distributive justice. In Proceedings of SASO’12. 119–128. Riol, R. and Cohen, M. D. 2001. Cooperation withour reciprocity. Nature, 441–443. Sen, S. and Airiau, S. 2007. Emergence of norms through social learning. In Proceedings of IJCAI’07. 1507–1512. Verbeeck, K., Nowé, A., Parent, J., and Tuyls, K. 2006. Exploring selfish reinforcement learning in repeated games with stochastic rewards. In Proceedings of AAMAS’06. 239–269. Villatoro, D., Sabater-Mir, J., and Sen, S. 2011. Social instruments for robust convention emergence. In Proceedings of IJCAI’11. 420–425. Wakano, J. Y. and Yamamura, N. 2001. A simple learning strategy that realizes robust cooperation better than Pavlov in iterated Prisoner’s Dilemma. J. Ethol. 19, 9–15. Watkins, C. J. C. H. and Dayan, P. D. 1992. Q-learning. Mach. Learn. 279–292. Received January 2013; revised May 2013; accepted August 2013 ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013. i i i i
© Copyright 2026 Paperzz