Achieving Socially Optimal Outcomes in Multiagent Systems with

i
i
i
i
Achieving Socially Optimal Outcomes in Multiagent Systems with
Reinforcement Social Learning
JIANYE HAO and HO-FUNG LEUNG, The Chinese University of Hong Kong
In multiagent systems, social optimality is a desirable goal to achieve in terms of maximizing the global
efficiency of the system. We study the problem of coordinating on socially optimal outcomes among a
population of agents, in which each agent randomly interacts with another agent from the population
each round. Previous work [Hales and Edmonds 2003; Matlock and Sen 2007, 2009] mainly resorts to
modifying the interaction protocol from random interaction to tag-based interactions and only focus on
the case of symmetric games. Besides, in previous work the agents’ decision making processes are usually
based on evolutionary learning, which usually results in high communication cost and high deviation on
the coordination rate. To solve these problems, we propose an alternative social learning framework with
two major contributions as follows. First, we introduce the observation mechanism to reduce the amount of
communication required among agents. Second, we propose that the agents’ learning strategies should be
based on reinforcement learning technique instead of evolutionary learning. Each agent explicitly keeps the
record of its current state in its learning strategy, and learn its optimal policy for each state independently.
In this way, the learning performance is much more stable and also it is suitable for both symmetric and
asymmetric games. The performance of this social learning framework is extensively evaluated under the
testbed of two-player general-sum games comparing with previous work [Hao and Leung 2011; Matlock and
Sen 2007]. The influences of different factors on the learning performance of the social learning framework
are investigated as well.
15
Categories and Subject Descriptors: I.2.11 [Artificial Intelligence]: Distributed Artificial Intelligence
General Terms: Algorithms, Experimentation, Performance
Additional Key Words and Phrases: Reinforcement social learning, social optimality, multiagent coordination, general-sum games
ACM Reference Format:
Hao, J. and Leung, H.-F. 2013. Achieving socially optimal outcomes in multiagent systems with reinforcement social learning. ACM Trans. Auton. Adapt. Syst. 8, 3, Article 15 (September 2013), 23 pages.
DOI:http://dx.doi.org/10.1145/2517329
1. INTRODUCTION
In multiagent systems (MASs), an agent usually needs to coordinate effectively with
other agents in order to achieve desirable outcomes, since the outcome not only depends on the action it takes but also the actions taken by others. How to achieve effective coordination between agents in MASs is significant and challenging especially
when the environment is unknown and each agent only has access to a limited amount
of information. Multiagent learning has been proved to be a promising and effective
approach to allow the agents to coordinate their actions on optimal outcomes through
repeated interactions [Hoen et al. 2005; Panait and Luke 2005].
Authors’ address: J. Hao and H.-F. Leung, Department of Computer Science and Engineering, The Chinese
University of Hong Kong; email: [email protected].
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights
for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component
of this work in other works requires prior specific permission and/or a fee. Permissions may be requested
from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or [email protected].
c 2013 ACM 1556-4665/2013/09-ART15 $15.00
DOI:http://dx.doi.org/10.1145/2517329
ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013.
i
i
i
i
i
i
i
i
15:2
J. Hao and H.-F. Leung
Fig. 1. An instance of the prisoner’s dilemma game.
Fig. 2. An instance of the anticoordination game.
The most commonly used solution concepts in current multiagent learning literature are Nash equilibrium and its variants (e.g., correlated equilibrium) adopted
from Game theory [Osborne and Rubinstein 1994], and various learning algorithms
have been proposed targeting at converging to these solution concepts [Greenwald
and Hall 2003; Hu and Wellman 1998; Kapetanakis and Kudenko 2002; Littman
1994]. Convergence is a desirable property to pursue in multiagent system, however,
converging to Nash equilibrium solutions may not be the most preferred outcome since
they do not guarantee that Pareto optimal solutions can be produced. For example, in
the well-known prisoner’s dilemma game (see Figure 1), the agents will converge to
mutual defection (D, D) if the Nash equilibrium solution is adopted, while both players
could have obtained much higher payoffs by reaching the Pareto optimal outcome of
mutual cooperation (C, C), which is also optimal from the overall system’s perspective.
One desirable alternative solution concept is called social optimality which targets at
maximizing the sum of all agents’ payoffs involved. A socially optimal outcome is not
only optimal from the system’s perspective but also always Pareto optimal. One major
challenge that the agents are faced with is how they should select their corresponding
actions to prevent miscoordination, when there coexist multiple socially optimal
joint actions. One supporting example is the anticoordination game in Figure 2, in
which two socially optimal joint actions ((C, D) and (D, C)) coexist. In this game the
agents need to carefully coordinate and select their actions to avoid miscoordination
on (C, C) or (D, D) through repeated interactions. The difficulty of coordination can
be further increased when there exist a large number of agents in the system in
which the interaction partners of each agent are not fixed. In this case, an agent’s
current strategy under which it can coordinate on one socially optimal outcome with
its current interaction partner may fail when it interacts with another agent next
time. For example, considering the anticoordination game again, an agent with the
policy of choosing action D can successfully coordinate on the socially optimal outcome
(D, C) or (C, D), when it interacts with another agent choosing action C next time.
However, miscoordination (achieving (D, D)) occurs when it interacts with another
agent choosing action D. Thus, it is a particularly challenging problem to ensure that
ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013.
i
i
i
i
i
i
i
i
Achieving Socially Optimal Outcomes in Multiagent Systems
15:3
all agents can learn to converge to a consistent policy of achieving one socially optimal
outcome based on their local information.
One line of previous work is that a number of learning strategies [Crandall and
Goodrich 2005; Hao and Leung 2012; Verbeeck et al. 2006] have been proposed for
agents to achieve socially optimal outcomes in the context of two-player repeated
normal-form games. However, in the real world environments, the interactions among
agents coexisting in a system are usually not very frequent. It may take a long time
before an agent has the opportunity to interact with a previously interacted agent
again. To more accurately model the practical interactions among agents in a system,
we consider the problem of coordinating on socially optimal outcomes in an alternative framework different from the fixed-player repeated interaction scenario. Under
this framework, there exists a population of agents in which each agent randomly interacts with another agent from the population each round. The interaction between
each pair of agents (in the same group) is still modeled as a two-player normal-form
game, and the payoff each agent receives in each round only depends on the joint action of itself and its interaction partner. Each agent learns its policy through repeated
interactions with multiple agents, which is termed as social learning [Sen and Airiau
2007], in contrast to the case of learning from repeated interactions against the same
opponent in the two-agents case. This learning framework is similar to the one commonly adopted in evolutionary game theory [Maynard 1982], except that the learning
strategies of the agents are not necessarily based on evolutionary learning (copy and
mutation). Compared with the fixed-player repeated interaction scenario, this social
learning framework can better reflect the realistic interactions among agents and has
been commonly adopted as the framework for investigating different multiagent problems such as norm emergence problem [Sen and Airiau 2007; Villatoro et al. 2011].
Besides, as previously mentioned, it is more challenging to ensure that all agents
can learn a consistent policy of achieving socially optimal outcome(s) under the social
learning framework compared with the two-agent repeated interaction framework.
Under this social learning framework, previous work [Hales and Edmonds 2003;
Matlock and Sen 2007, 2009] mainly focuses on modifying the interaction protocol from
random interaction to tag-based interaction to evolve coordination on socially optimal
outcomes. However, the agents’ decision making processes in previous work are based
on evolutionary learning (i.e., imitation and mutation) following evolutionary game
theory and thus the deviation of percentage of coordination is significant even though
the average coordination rate is good. Another limitation of previous approaches is
that it is explicitly assumed that each agent has access to all other agents’ information
(i.e., their payoffs and actions) in previous rounds, which may be unrealistic in many
practical situations due to communication limitations. Finally, all previous work only
focuses on two representative symmetric games: prisoner dilemma game and anticoordination game, and cannot be directly applied in the settings of asymmetric games. In
asymmetric games, the roles of the agents (row agent or column agent) can have significant influences on the agents’ behaviors and the game outcomes, which, however, are
not distinguished in previous work’s learning approaches [Hales and Edmonds 2003;
Matlock and Sen 2007, 2009].
To handle the aforementioned problems, in our social learning framework, we introduce the observation mechanism, under which each agent is only allowed to have
access to a limited number of agents’ information in the system, in contrast to the
global observation assumption commonly adopted in previous work. In this way, the
amount of communication required among agents can be greatly reduced, making our
learning framework more applicable for practical scenarios with high communication
cost. Besides, the learning strategy adopted by each agent in our learning framework
is based on reinforcement learning technique instead of evolutionary learning. Each
ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013.
i
i
i
i
i
i
i
i
15:4
J. Hao and H.-F. Leung
agent explicitly keeps the record of its current state (row or column player) in its learning strategy, and learn its optimal policy for each state independently. Therefore, our
learning framework is suitable for the settings of both symmetric and asymmetric
games.
Specifically, the learning strategy we design here is an extension of the traditional
Q-learning algorithm, in which we augment each agent’s learning process with an additional step of determining how to update its Q-values, that is, individually rational
update or socially rational update. Individually rational update refers to the way of
updating based on each agent’s individual payoff only, while socially rational update
refers to the way of updating based on the sum of all the agents’ payoffs in the same
group. Which way is better to update the agents’ strategies depends on the type of
game they are currently playing, and the agents should be able to adaptively adjust
their update schemes. To this end, each agent is associated with a weighting factor for
each update scheme, and a greedy strategy is proposed to adjust the weighting factors
adaptively in a periodical way. The successful achievement of socially optimal outcomes requires the cooperative coordinations among agents on their update schemes.
By allowing the agents to adaptively adjust their update schemes, there are two major
benefits as follows. First, the agents can automatically determine the appropriate way
of updating their strategies in different learning environments (games) to the best interest of themselves. Second, when the system is within an open environment in which
there may exist some uncooperatively purely selfish agents which always update their
strategies in the individually rational manner, the agents can be resistant to the exploitation of those uncooperative agents. The agents can always avoid being exploited
by those selfish agents by adopting individually rational update scheme when there is
no benefit for adopting socially rational update scheme.
We empirically evaluate the performance of this social learning framework extensively in the testbed of two-player two action games, and simulation results show that
the agents can learn to achieve full coordination on socially optimal outcomes in most
of the games. We also compare the performance with two previous learning frameworks [Hao and Leung 2011; Matlock and Sen 2007] and show that higher level of
coordination under our social learning framework can be achieved with no deviation.
Besides, we also show that the agents in the social learning framework are partially
resistant to the exploitation from purely selfish agents: they can automatically select
individually rational update scheme to play against those purely selfish agents when
there are no benefits for them to adopt socially rational update anymore.
The remainder of this article is organized as follows. In Section 2, we give an
overview of different learning frameworks proposed in previous work to achieve coordination on (socially) optimal outcomes. In Section 3.1, we review some background
in game theory and introduce the problem we target at. In Section 3, the social learning
framework we propose is introduced in details. In Section 4, we present our simulation
results in a large set of two-player games and also the influences of different factors
are also considered and investigated. Lastly, conclusion and future work are given in
Section 5.
2. PREVIOUS WORK
Learning to achieve coordination on (socially) optimal outcomes in the setting of a population of agents has been received a lot of attention in multiagent learning literature
[Chao et al. 2008; Hales and Edmonds 2003; Hao and Leung 2011; Matlock and Sen
2005, 2007, 2009]. Previous work mainly focuses on the coordination problem in two
different games: prisoner’s dilemma game and anticoordination games, which are representatives of two different types of coordination scenarios. For the first type of games,
ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013.
i
i
i
i
i
i
i
i
Achieving Socially Optimal Outcomes in Multiagent Systems
15:5
the agents need to coordinate on the outcomes with identical actions to achieve socially
optimal outcomes, while in the second type of games the achievement of socially optimal outcomes requires the agents to coordinate on outcomes with complementary
actions. In previous work, tag has been extensively used as an effective mechanism to
bias the interactions among agents in the population. Using tags to bias the interaction among agents in prisoner dilemma game is first proposed in Holland et al. [1986]
and Allison [1992], and a number of further investigations on the tag mechanism in
one-shot or iterated prisoner dilemma game are conducted in, Riol and Cohen [2001],
Hales [2000], and Howley and O’Riordan [2005].
Hales and Edmonds [2003] firstly introduce the tag mechanism originated in other
fields such as artificial life and biological science into multiagent systems research to
design effective interaction mechanism for autonomous agents. In their model for prisoner dilemma game, each agent is represented by L + 1 bits. The first bit indicates
the agent’s strategy (i.e., playing C or D), and the remaining L bits are the tag bits,
which are used for biasing the interaction among the agents and are assumed to be
observable by all agents. In each generation, each agent is allowed to play the prisoner
dilemma game with another agent owning the same tag string. If an agent cannot
find another agent with the same tag string, then it will play the prisoner dilemma
game with a randomly chosen agent in the population. The agents in the next generation are formed via fitness (i.e., the payoff received in the current generation) proportional reproduction scheme. Besides, low level of mutation is applied on both the
agents’ strategies and tags. This mechanism is demonstrated to be effective in promoting high level of cooperations among agents when the length of tag is large enough.
They also apply the same technique to other more complex task domains which shares
the same dilemma with prisoner dilemma game (i.e., the socially rational action cannot
be achieved if the agents are making purely individually rational decisions), and show
that the tag mechanism can motivate the agents towards choosing socially rational
behaviors instead of individually rational ones. However, there are some limitations of
this tag mechanism. In their work, the key point in supporting the successful emergence of socially optimal outcomes is that the agents mimic the strategy and tags of
other more successful agents, which finally leads to the achievement of socially optimal outcomes among most of the interacting agents in the system. However, since the
interactions among agents are based on self-matching tag scheme, each agent can only
interact with others with the same tag, and thus always play the same strategy with
the interacting partners. Therefore, socially optimal outcomes can be obtained only in
the cases where the socially optimal outcome corresponds to a pair of identical actions
(e.g., (C, C) in prisoner dilemma game).
Chao et al. [2008] propose a tag mechanism similar with the one proposed in Hales
and Edmonds [2003] and the only difference is that mutation is applied on the agents’
tags only. They evaluate the performance of their tag mechanism with a number
of learning mechanisms commonly used in the field of multiagent system such as
Generalized Tit-For-Tat [Chao et al. 2008], WSLS (Win stay, loose shift) [Nowak and
Sigmund 1993], basic reinforcement learning algorithm [Wakano and Yamamura
2001] and evolutionary learning. They conduct their evaluations using two games:
prisoner dilemma game and coordination game. Simulation results show that their
tag mechanism can have comparable or better performance than other learning
algorithms they compared under both games. However, this tag mechanism suffers
from similar limitations with Hales and Edmonds’ model [Hales and Edmonds 2003]:
it only works when the agents are only needed to learn the identical strategies in
order to achieve socially optimal outcomes and this is indeed the case in both games
they evaluated. It cannot handle the case when complementary strategies are needed
to achieve coordination on socially optimal outcomes.
ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013.
i
i
i
i
i
i
i
i
15:6
J. Hao and H.-F. Leung
Matlock and Sen [2005] reinvestigate the Hales and Edmonds’ model [Hales and
Edmonds 2003] and the reason why large tags are required to sustain cooperation. The
main explanation they come up with is that in order to keep high level of cooperation
sufficient number of cooperative groups is needed to prevent mutation from infecting
most of the cooperative groups with defectors at the same time. There are two ways to
guarantee the coexistence of sufficient number of cooperative groups, that is, adopting
high tag mutation rate or using large tags, which are all verified by simulation results.
The limitation of Hales and Edmonds’ model [Hales and Edmonds 2003] is also
pointed out in this article, and they also provide an example (i.e., anticoordination
game (Figure 2) in which Hales and Edmonds’ model fails. In anticoordination game,
complementary actions (either (C, D) or (D, C)) are requires for agents to coordinate on
socially optimal outcomes. However, under the self-matching tag mechanism in Hales
and Edmonds’ model, each agent only interacts with others with the same tag strings,
and each interacting pair of agents usually share the same strategy. Therefore, their
mechanism only works when identical actions are needed to achieve socially optimal
outcomes, while fails when it comes to games like anticoordination game.
Considering the limitation of Hales and Edmonds’ model, Matlock and Sen [2007,
2009] propose three new tag mechanisms to tackle the limitation. The first tag mechanism is called Tag matching patterns (one and two-sided). Each agent is equipped
with both a tag and a tag-matching string, and agent i is allowed to interact with
agent j if its tag-matching string match agent j ’s tag in one-sided matching. For
two-sided matching, it also requires agent j ’s tag-matching string to match agent i’s
tag in order to allow the interaction between agent i and j. By introducing this tag
mechanism, the agents are allowed to play different strategy with their interacting
partners. The second mechanism is payoff sharing mechanism, which requires each
agent to share part of its payoff with its opponent. This mechanism is shown to be
effective in promoting socially optimal outcomes in both prisoner dilemma game and
anticoordination game, however, it can be applied only when side-payment is allowed.
The last mechanism they propose is called paired reproduction mechanism. It is a
special reproduction mechanism that makes copies of matching pairs of individuals
with mutation at corresponding place on the tag of one and the tag-matching string
of the other at the same time. The purpose of this mechanism is to preserve the
matching between this pair of agents after mutation in order to promote the survival
rate of cooperators. Simulation results show that this mechanism can help sustaining
the percentage of agents coordinating on socially optimal outcomes at a high level
in both the prisoner dilemma and anticoordination games, and this mechanism can
be applied in more general multiagent settings where payoff sharing is not allowed.
However, similar to Hales and Edmonds’ model [Hales and Edmonds 2003], these
mechanisms all heavily depend on mutation to sustain the diversity of groups in the
system, thus this leads to the undesired result that the variation of the percentage of
coordination on socially optimal outcomes is very high all the time (shown in details
later in the experimental Section 4.1.3). Besides, the paired reproduction mechanism
is shown to be effective in two types of symmetric games (prisoner dilemma game and
anticoordination game), but it cannot be applied to asymmetric games. The reason is
that in asymmetric games, given a pair of interacting agents and their corresponding
strategies, each agent may receive different payoffs depending on its current role
(row or column agent). For example, considering the asymmetric game in Figure 3,
suppose that a pair of interacting agents (agent A and agent B) choosing action D
and C, respectively. In this situation, agent A’s payoff is different when it acts as the
row (payoff of 2) or column player (payoff of 4), even though both agents’ strategies
keep unchanged. However, in Matlock and Sen’s learning framework, the role of
each agent during interaction is not distinguished. They implicitly assume that all
ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013.
i
i
i
i
i
i
i
i
Achieving Socially Optimal Outcomes in Multiagent Systems
15:7
Fig. 3. Asymmetric anticoordination game: the agents need to coordinate on the outcome (D, C) to achieve
socially rational solution.
agents’ roles are always the same in accordance with the definition of symmetric
game. Thus the payoffs of the interacting agents become nondeterministic when it
comes to asymmetric games. Therefore, their learning framework is only applicable to
those games that are agent-symmetric (prisoner’s dilemma game and anticoordination
game), that is, the roles of the agents have no influence on their payoffs received.
Hao and Leung [2011] propose a novel tag-based learning framework in which each
agent employs a reinforcement learning strategy instead of evolutionary learning to
make its decision. In their learning framework, they propose two different types of update schemes available for the agents to update their policy: individual learning update
and social learning update. For each agent, each learning update scheme is associated
with a weighting factor to determine which update scheme to use to update its policy, and these two weighting factors are adjusted adaptively based on the agent’s past
performance in a periodic way. Another difference from pervious tag-based framework
is that the reinforcement learning strategy the agents adopt is used for controlling
and updating their action choices only and the agents’ tags are given at the beginning
and not evolving with time. Compared with previous approach based on evolutionary
learning [Matlock and Sen 2007, 2009], the agents under their learning framework are
able to achieve more stable and high-level coordination on socially optimal outcomes
in both the prisoner dilemma game and anticoordination game. However, similar with
Matlock and Sen’s model [Matlock and Sen 2007], in their learning framework the
role of each agent during interaction is also not distinguished. Therefore, their framework also can only be applied in the context of symmetric games for the same reason
previously explained.
To sum up, most of the previous work [Chao et al. 2008; Hales and Edmonds 2003;
Matlock and Sen 2005, 2007, 2009] focuses mainly on modifying the interaction protocol of the agents through introducing the concept of tag-based mechanism, to handles
the problem of coordinating on socially optimal outcomes among a population of agent.
However, in this work, we handle this problem mainly from the perspective of learning
strategy design, that is, reinforcement social learning with adaptive selection between
socially rational update and individually rational update. The interaction protocol of
the agents is not modified in our learning framework and still follows the most commonly adopted setting, that is, each agent randomly interacts with another agent in
the system each round.
3. SOCIAL LEARNING FRAMEWORK
3.1. Problem Description: Reinforcement Learning in Multiagent Systems
Normal-form games have been commonly used to model the interactions between
agents when studying multiagent learning and coordination problems in multiagent
systems. Usually, the agents are assumed to repeatedly play a single stage game
ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013.
i
i
i
i
i
i
i
i
15:8
J. Hao and H.-F. Leung
simultaneously and independently, and receive the corresponding reward which is
determined by the joint action of all agents each round. We omit the basic concepts of
game theory and interested readers may refer to Osborne and Rubinstein [1994] for
details.
The multiagent learning criteria proposed in Bowling and Veloso [2003] require that
an agent should be able to converge to a stationary policy against some class of opponent (convergence) and the best-response policy against any stationary opponent
(rationality). If both agents adopt rational learning strategies and their strategies converge, then they will converge to the Nash equilibrium of the stage game. Indeed, a
large bundle of previous work adopts the solution concept of Nash equilibrium or its
variants as the learning goal when they design their learning algorithms [Bowling
and Veloso 2003; Claus and Boutilier 1998; Conitzer and Sandholm 2006; Greenwald
and Hall 2003; Hu and Wellman 1998; Littman 1994]. Unfortunately, in many cases
Nash equilibrium is not optimal in that it does not guarantee that the agents receive
their best payoffs. One well-known example is the prisoner’s dilemma game shown in
Figure 1. By converging to the Nash equilibrium (D, D), both agents obtain a low payoff of 1, and also the sum of the agents’ payoffs is minimized. On the other hand, both
agents could have received better payoffs by converging to the nonequilibrium outcome
of (C, C) in which the sum of the players’ payoffs is maximized.
An alternative solution concept is that of social optimality, that is, those outcomes
under which the sum of all players’ payoffs is maximized. This solution concept is optimal from the system’s perspective, but may not be in the best interest of all individual
agents involved. For example, consider a two-player anticoordination game in Figure 2.
There exist two socially optimal outcomes (C, D) and (D, C), and the agents have
conflicting interests in that agent A prefers the outcome (C, D), while agent B prefers
another outcome (D, C). This problem can be solved by proposing that both socially
optimal outcomes are achieved in an alternative way [Hao and Leung 2010; Verbeeck
et al. 2006], and thus both agents can obtain both equal and highest payoff on average.
Until now most previous work studies the multiagent learning problem within the
setting of repeated (Markov) games, which is based on an implicit assumption that
two (or more) fixed agents have to interact with each other repeatedly for a long time
to learn the optimal policy from past interaction experience. However, the interactions
among agents may be very sparse in practice, and it may take a long time before
an agent can have the chance to interact with another previously met agent again.
Therefore, the agents need to be able to learn the best policy through repeated interacting with different agents each time. To this end, we consider an alternative setting
involving a population of agents in which each agent interacts with another agent
randomly chosen from the population each round. The interaction between each pair
of agents is modeled as a single-stage normal-form game, and the payoff each agent
receives in each round only depends on the joint action of itself and its interaction
partner. Under this learning framework, each agent learns its policy through repeated
interactions with multiple agents, which is termed as social learning [Sen and Airiau
2007], in contrast to the case of learning from repeated interactions against the same
opponent in the two-agents case. In this work, we aim at solving the problem of how
the agents are able to learn to coordinate on socially optimal outcomes in general-sum
games through social learning.
The overall interaction protocol under the social learning framework is presented
in Figure 4. During each round, each agent interacts with another agent randomly
chosen from the population. The interactions between agents is modeled as generalsum games, and during each interaction one agent is randomly assigned as the row
player and the other agent as the column player. At the end of each round, each agent
updates its policy based on the learning experience it receives so far.
ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013.
i
i
i
i
i
i
i
i
Achieving Socially Optimal Outcomes in Multiagent Systems
1:
2:
3:
4:
5:
6:
7:
8:
9:
15:9
for a number of rounds do
repeat
two agents are randomly chosen from the population, and one of them is assigned as the row player and the other one as the column player.
both agents play a two-player normal-form game and receive their corresponding payoffs
until all agents in the population have been selected
for each agent in the population do
update its policy based on its past experience
end for
end for
Fig. 4. Overall interaction protocol of the social learning framework.
3.2. Observation
In each round, each agent is only allowed to interact with one other agent randomly
chosen from the population, and each pair of interacting agents is regarded as being in
the same group. On one hand, during each (local) interaction, each agent typically can
perceive its own group’s information (i.e., the action choices and payoffs of itself and
its interaction partner), which is also known as the perfect monitoring assumption
[Brafman and Tennenholtz 2004]. This assumption is commonly adopted in previous
work in two-player repeated interaction settings. On the other hand, in the social
learning framework, different from the two-player interaction scenario, different
agents may be exposed to information of agents from other groups. Allowing agents to
observe the information of other agents outside their direct interactions may result in
faster learning rate and facilitate coordinating on optimal solutions. In previous work
[Hales and Edmonds 2003; Matlock and Sen 2007], it is assumed that each agent can
have access to all the other agents’ information (their action choices and payoffs) in
the system (global observation), which may be unpractical in realistic environments
due to the high communication cost involved.
To balance the tradeoff between global observations and local interactions, here we
only allow each agent to have access to the information of other M groups randomly
chosen in the system at the end of each round. If M = N/2 (N is the total number
of agents), it becomes equivalent with the case of allowing global observation; if the
value of M is zero, it reduces to the case of local interactions. The value of M represents the connectivity degree in terms of information sharing among different groups.
By spreading this information into the population, it serves as an alternative biased
exploration mechanism to accelerate the agents’ convergence to the desired outcomes
while keeping communication overhead at a low level. As will be shown in the experimental part (Section 4), each agent only needs a small amount of information from
other groups (M N/2) to guarantee the coordination on socially optimal outcomes
among all agents, thus the communication cost in the system is significantly reduced
compared with previous work [Hales and Edmonds 2003; Matlock and Sen 2007].
3.3. Learning Strategy
When an agent interacts with its interacting partner in each round, it needs to employ
a learning strategy to make its decisions. In general-sum games especially considering
asymmetric games, the payoff matrices of the row and column players are usually
different, thus each agent may need to behave differently according to its current role
(either as the row player or column player) to reach socially optimal outcomes. To
this end, we propose that each agent has a pair of strategies, one used as the row
ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013.
i
i
i
i
i
i
i
i
15:10
J. Hao and H.-F. Leung
player and the other used as the column player, to play with any other agent from the
population. The strategy we develop here is based on Q-learning technique [Watkins
and Dayan 1992], in which each agent holds a Q-value Q(s, a) for each action a in
each state s ∈ {Row, Column}, which keeps record of the action’s past performance and
serves as the basis for making decisions. We assume that each agent knows its current
role during each interaction. Accordingly, each agent chooses its action based on the
corresponding set of Q-values of its current role during each interaction.
3.3.1. How to Select the Action. Each agent chooses its action based on -greedy mechanism, which is commonly adopted for performing biased action selection in reinforcement learning literature. Under this Mechanism each agent chooses its action with
the highest Q-value with probability 1 − for exploiting the action with best performance currently (randomly selection in case of a tie), and makes random choices
with probability for the purpose of exploring new actions with potentially better
performance.
3.3.2. How to Update the Q-Values. According to Section 3.2, we know that in each
round the information that each agent has access to consists of its own group’s information and the information of other M groups randomly chosen from the population.
The traditionally and commonly adopted approach for a selfish agent is to use its
individual payoff received from performing an action as the reinforcement signal to
update its strategy (Q-values). However, since an agent is situated in a social context,
its selfish behaviors can have implicit and detrimental effects on the behaviors of
other agents coexisting in the environment, which may in return lead to undesirable
results for both individual agents and the overall system. Accordingly, an alternative
notion proposed for solving this problem is social rationality [Hogg and Jennings 1997,
2001], under which each agent evaluates the performance of each action in terms of
its worth to the overall group instead of itself. We consider an agent as an individually
rational agent if its strategy is updated based on its own payoff only, and this kind of
update scheme is called individually rational update; while we consider an agent as
a socially rational entity if its strategy is updated based on the sum of all the agents’
payoffs in the same group, and this kind of update scheme is called socially rational
update.
Adopting which way is better to update the agents’ strategies depends on the type
of games being played. For cooperative games, the difference between socially rational
update and individually rational update is irrelevant, since the agents share the
common interests; for competitive (zero-sum) games, it is meaningless to perform
socially rational update since the sum of both players’ payoffs under every outcome
is always equal to zero and it is more reasonable for the agents to update their
strategies based on their individual payoffs only; for mixed games, it can be beneficial
for the agents to update their strategies in the socially rational manner, which can
facilitate the agents to coordinate on win-win outcomes rather than loss-loss outcomes
(e.g., the prisoner’s dilemma game). On the other hand, it may be better to adopt
individually rational update to prevent malicious exploitations when interacting
against individually rational agents.
Since we assume that the agents have no prior information of the game they are
playing, each agent should learn to determine the appropriate way of updating its
strategy adaptively, that is, updating its strategy based on either its individual payoff
or the joint payoff(s) of the group(s) or the combination of both. To handle the individually/socially rational update selection problem, we propose that each agent i is
equipped with two weighting factors wi, il and wi, sl representing its individual rational
and socially rational degrees satisfying the constraint wi, il + wi, sl = 1. In general, each
ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013.
i
i
i
i
i
i
i
i
Achieving Socially Optimal Outcomes in Multiagent Systems
15:11
ALGORITHM 1: Adaptive Weighting Factors Updating Strategy for Agent i
1: initialize the value of wi,il and wi,sl .
2: initialize the updating direction UD0 , i.e., towards increasing the weight on individual
learning or social learning.
3: for at the end of each period T do
4:
keep record of its own performance (e.g.,the average payoff avgT ) within this period.1
5:
if it is not the first adjustment period (T > 1) then
6:
|avgT −avgT−1 |
≤ Threshold then
max(avgT ,avgT−1 )
T
keep the updating direction UD unchanged
if avgT > avgT−1 or
7:
8:
else
9:
reverse the updating direction UDT
10:
end if
11:
else
12:
keep the updating direction UDT unchanged
13:
end if
14:
if UDT equals towards social learning then
15:
wi,sl = wi,sl + δ
16:
wi,il = wi,il − δ
17:
else
18:
wi,il = wi,il + δ
19:
wi,sl = wi,sl − δ
20:
end if
21: end for
agent’s Q-values are updated in the individually rational and socially rational manner
as follows,
Qit+1 (s, a) = Qti (s, a) + α t (s)wti, il (ri, il (s, a) − Qti (s, a)), s ∈ {Row, Column}
(1)
Qit+1 (s, a) = Qti (s, a) + α t (s)wti, sl (ri, sl (s, a) − Qti (s, a)), s ∈ {Row, Column}.
(2)
Here Qti (s, a) is agent i’s Q-function on action a at time step t in state s, α t (s) is the
learning rate in state s at time step t, ri, il (s, a) is the payoff agent i uses to update
its Q-value of action a in the individually rational manner, and ri, sl (s, a) is the payoff
agent i uses to update its Q-value in the socially rational manner. We will discuss
and introduce how to determine the values of ri, il (s, a) and ri, sl (s, a) later. Notice that
each agent shares the same set of weighting factors wi, il and wi, sl no matter whether
it is the row player or the column player. The value of α t (s) is decreased gradually:
1
, where No.(s) is the number of times that state s has been visited until
α t (s) = No.(s)
time t.
In order to optimize its own performance in different environments, the values of
each agent’s weighting factors should be able to be adaptively updated based on its
own past performance. Therefore, the natural question is how to update the values
of wi,il and wi,sl for each agent. Here we propose an adaptive strategy for agents to
make their decisions shown in Algorithm 1. Each agent adaptively updates its two
weighting factors in a greedy way based on its own past performance periodically.
1 In the current implementation, we only take each agent’s payoff during the second-half period into consideration in order to get a more accurate evaluation of the actual performance of the period.
ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013.
i
i
i
i
i
i
i
i
15:12
J. Hao and H.-F. Leung
To update its weighting factors, each agent maintains an updating direction, that is,
towards either increasing the weight on individual rational update or the weight on
socially rational update. We consider a fixed number m of steps as a period. At the
end of each period, each agent i will adaptively adjust these two weighting factors
by δ following the current updating direction, where δ is the adjustment degree.
For example, if the current updating direction is to increase the weight on socially
rational update, its weighting factor wi, sl will be increased by δ and the value of
wi, il will be decreased by δ accordingly (Line 14–20). During each period, each agent
also keeps record of its performance, that is, the average payoff obtained within the
period. The updating direction is adjusted by comparing the relative performance
between two consecutive periods. Each agent can switch its current updating direction
to the opposite one if its performance in the current round is worse than that of
previous period (Line 5–10). For the first adjustment period, since there is no previous
performance record for comparison, each agent simply keeps its initial updating
direction unchanged and update the weighting factors accordingly (Line 11–12 and
14–20).
If an agent i updates its Q-values in the individually rational manner, it acts as
an individual rational entity which uses the information of possible individual payoffs
received by performing different actions. Let us denote St as the set of information
consisting of the action choices and payoffs of itself and the agents of the same role from
the other M groups, that is, Silt = {ati , rt , bt1 , rt1 , bt2 , rt2 , . . . , btM , rtM }. Here ati , rt is
the action and payoff of agent i itself and the rest are the actions and payoffs of other
M agents with the same role from other M groups. At the end of each round t, agent
i picks action a with the highest payoff (randomly choosing one action in case of a tie)
from the set Silt , and updates its Q-value of action a according to Eq. (1) by assigning
ri,il to the value of r(a) ∗ freq(a). Here r(a) is the highest payoff of action a among all
elements in the set Silt , freq(a) is the frequency that action a occurs with the highest
reward r(a) among the set Silt .
If an agent i learns its Q-values based on socially rational update, it updates its
Q-values based on the joint benefits of agents in the same group. Let us denote
St as the set of information consisting of the action choices and group payoffs
of itself and the agents of the same role from the other M groups, that is, Sslt =
{ati , Rt , bt1 , Rt1 , bt2 , Rt2 , . . . , btM , RtM }. Here ati , Rt is the action and sum of the payoffs of agent i and its interaction partner, and the rest are the actions and group payoffs
of the agents with the same role from other M groups. Similar to the case of individually rational update, at the end of each round t, each agent i picks the action a with the
highest group payoff (randomly choosing one action in case of a tie) from the set Stsl ,
and updates the value of Q(s, a) according to Eq. (2) by assigning ri, sl to the value of
R(a) ∗ freq(a). Here R(a) is the highest group payoff of action a among all elements in
the set Stsl , freq(a) is the frequency that action a occurs with the highest group reward
R(a) among the set Stsl .
When all agents behave as socially rational entities to perform update on their
Q-functions, it is not difficult to see that for any general-sum game G, the agents
are actually learning over its corresponding fully cooperative game G where the
payoff profile (px , py ) of each outcome (x, y) in G is changed to (px + py , px + py ) in
G . From the definition of socially rational outcome, we know that every socially
rational outcome in the original game G corresponds to a Pareto-optimal Nash
equilibrium in the transformed game G , and vice-versa. Therefore, the goal of
learning to achieve socially rational outcomes in game G becomes equivalent to that of
learning to achieve Pareto-optimal Nash equilibrium in the transformed cooperative
game G .
ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013.
i
i
i
i
i
i
i
i
Achieving Socially Optimal Outcomes in Multiagent Systems
15:13
3.4. Property of the Social Learning Framework
From Section 3.3, we have known that in the social learning framework each agent
holds a pair of strategies, one used as the row player and the other used as the column
player, to play with any other agent from the population. The rationale behind is that
the optimal strategies for an agent to take to achieve socially optimal outcomes under
different roles (row or column player) are not necessarily the same, and this is especially true when it comes to asymmetric games. In general, given one socially optimal
outcome (xs , ys ) in game G where action xs and action ys are not necessarily the same,
let us suppose that the agents have finally learned a consistent optimal policy of coordinating on this outcome (xs , ys ). Specifically, under this optimal policy, each agent
always chooses action xs as the row player and action ys as the column player. Since
each agent is assigned as the row or column player randomly each round as specified by
the interaction protocol in Figure 4, the average payoff of each agent would approach
p +p
p̄ = xs 2 ys in the limit, where pxs and pys are the payoffs of the row and column agents
under the outcome (xs , ys ). This indicates that the social learning framework has the
desirable property that the agents are able to achieve both fair (equal) and highest
payoff whenever they can successfully learn a consistent policy of coordinating on one
socially optimal outcome. Formally, we can have the following theorem.
P ROPERTY 3.1. If the agents can finally learn a consistent policy of coordinating
on one socially optimal outcome (xs , ys ), then each agent will obtain both equal and
highest payoff of p̄ on average.
P ROOF. It is obvious that the agents’ average payoffs p̄ are fair (equal) from this
p +p
analysis which is equal to xs 2 ys , and we only need to prove that it is also the highest
payoff that each agent can obtain. Let us suppose that there exists an agent i which can
p +p
achieve an average payoff of p̄ > xs 2 ys by coordinating on an outcome (xs , ys ) different from (xs , ys ). Since the agents within the system are symmetric, all other agents in
the system should receive the same average payoff of p̄ as well. By denoting pxs and pys
as the payoffs of the row and column players, respectively, under the outcome (xs , ys ),
p +p
p +p
we have xs 2 ys > xs 2 ys , which contradicts with the definition of socially optimal
p +p
outcome and thus no agent can ever achieve an average payoff higher than xs 2 ys .
In previous work [Matlock and Sen 2007, 2009], even though the authors have
proved that their tag-based paired reproduction mechanism favors the most fair outcome among all possible socially optimal outcomes, some agents’ average payoffs can
still be much lower than others due to the nature of their mechanism. For example,
considering the symmetric anticoordination game in Figure 2, under the paired reproduction mechanism, it is expected that the population of agents are divided into
a number of subgroups, and within each subgroup, some agents always choose action
C, while the rest always chooses action D. This would result in unfairness in that the
agents choosing action D would always obtain much lower payoff of 1 than those agents
choosing action C (payoff of 2). However, in our social learning framework, this kind
of unfairness can be easily eliminated by assigning different roles to each agent with
equal probability and also allowing each agent to learn different policies for different
roles. For example, in the anticoordination game in Figure 2, under our social learning
framework, each agent learns the policy of choosing action C as the row player and
choosing action D as the column player, and also each agent is assigned as either row
or column player with equal probability. Therefore, the average payoff of each agent is
the same in the long run. Moreover, under our social learning framework, the agents
ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013.
i
i
i
i
i
i
i
i
15:14
J. Hao and H.-F. Leung
Fig. 5. Frequency of reaching different joint actions in prisoner’s dilemma game averaged over 100 times
(the length of each period m = 500, adjustment degree δ = 1.0, wi, il = 0, and wi, sl = 1).
are able to achieve both fair and highest average payoffs in both symmetric and asymmetric games.
Compared with previous work [Matlock and Sen 2007, 2009], another advantage of
our social learning framework is that the communication cost is significantly reduced.
As mentioned in Section 3.2, in our social learning framework, each agent is only
allowed to observe the information of other M (M N/2) groups of agents in the
system, and this can also be verified from the simulation part in Section 4. However,
previous work [Matlock and Sen 2007, 2009] implicitly requires that each agent can
observe the information of all agents in the system.
4. SIMULATION
In this section, we empirically evaluate the performance of the social learning framework and investigate the influences of different factors on the learning performance.
First, we use two well-known examples (the prisoner’s dilemma game and learning
“rules of the road”) to illustrate the learning dynamics and the performance of the
learning framework by comparing with previous work in Section 4.1.2 In Section 4.2,
to illustrate the generality of our social learning framework, we further evaluate the
performance of the social learning framework under a larger set of general-sum games
which consists of different types of symmetric and asymmetric games. Finally, to evaluate the robustness and sensitivity of the social learning framework, the influences of
different factors (against purely selfish agents, information sharing degree and against
fixed policy agents) on the learning performance are investigated from Section 4.3 to
Section 4.5.
4.1. Examples
4.1.1. Prisoner’s Dilemma Game. Prisoner’s dilemma game is one of the most wellknown games which has attracted much attention in both game theory and multiagent
system literatures. One typical example of the prisoner’s dilemma game is shown in
Figure 1. For any individually rational player, choosing defection (action D) is always
its dominating strategy, however, mutual defection will lead the agents to achieve the
inefficient outcome (D, D), while there exists another socially optimal outcome (C, C)
in which each agent could obtain a much higher and fair payoff. Figure 5 shows the
2 Note that these two examples are the most commonly adopted testbeds in previous work [Hales and
Edmonds 2003; Hao and Leung 2011; Matlock and Sen 2007, 2009].
ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013.
i
i
i
i
i
i
i
i
Achieving Socially Optimal Outcomes in Multiagent Systems
15:15
Fig. 6. Frequency of reaching different joint actions in anticoordination game averaged over 100 times (the
length of each period m = 500, adjustment degree δ = 1.0, wi, il = 0, and wi, sl = 1).
average frequency of each joint action that the agents learn to coordinate on under
the social learning framework in a population of 100 agents.3 We can see that the
agents can eventually learn to converge to the outcome (C, C), which corresponds to
the policy of “choosing action C as both row and column players”. During the first period, all agents update their policies based on socially rational update scheme and the
policy of “always choosing action C” is reached after approximately 250 rounds. It is
worth noting that the outcome (C, C) actually corresponds to the only Nash equilibrium in the cooperative game after transformation. Starting from the second period,
the agents change their update scheme to individually rational update, and they learn
to converge to the policy of “always choosing action D” and thus converging to outcome
(D, D) starting from the middle point of the second period. Since the average payoff obtained from mutual cooperation in the first period is much higher than that obtained
from mutual defection in the second period, the agents changed their update scheme
back to socially rational update at the beginning of the third period, and will stick to
this scheme forever. Therefore mutual cooperation (C, C) is always achieved among the
agents starting from the middle point of the third period.
4.1.2. Learning “Rules of the Road”. The problem of learning “rules of the road” [Sen
and Airiau 2007] can be used to model the practical car-driving coordination scenario:
who yields if two drivers arrive at an intersection at the same time from neighborhood roads. Both drivers would like to go across the intersection first without waiting,
however, if both drivers choose to not yield to the other side, they will collide with each
other. If both of them choose to wait, then neither of them can go across the intersection
and it is a waste of their time. Their behaviors are well coordinated if and only if one of
them chooses to wait and the other driver chooses to go first. This problem can be naturally modeled as a two-player two-action anticoordination game (see Figure 2 where
action C corresponds to “Yield” and action D corresponds to “Go”). The natural solution
of this problem is that the agents adopt the policy of “yields to the driver on left (right)”,
and this solution is optimal in that both agents can obtain equal and highest payoff on
average provided that their chances of driving on the left or right road are the same.
Figure 6 shows the average frequency of different joint actions that the agents learn
3 In Figure 5, only the frequencies of the relevant action profiles (C, C) and (D, D) are shown for the sake of
clarification.
ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013.
i
i
i
i
i
i
i
i
15:16
J. Hao and H.-F. Leung
Fig. 7. Average frequency of achieving socially optimal outcome (D, C) under the learning framework in
asymmetric anticoordination game averaged over 100 runs (the length of each period m = 500, adjustment
degree δ = 1.0, wi, il = 0, and wi, sl = 1).
to converge to within a population of 100 agents.4 The results are obtained averaging
over 100 independent runs, in which half of the runs the agents learn to acquire the
policy of “yielding to the driver on the left” ((D, C)), and acquire the policy of “yielding
to the driver on the right” ((C, D)) the rest of the times. Another observation is that the
agents learn to converge to one of the previous two optimal policies under individual
learning update schemes (transition from social learning update to individual learning
update at 500th round). This is in accordance with the results found in previous work
[Sen and Airiau 2007], in which the authors investigate whether the agents can learn
to coordinate on one of the previous two optimal policies if they learn based on their
private information only (equivalent with the individually rational update here).
Next we consider an asymmetric version of the anticoordination game shown in
Figure 3. Different from the symmetric anticoordination game in which it is sufficient
for the agents to coordinate on the complementary actions (either (C, D) or (D, C)),
here the row and column agents need to coordinate their joint actions on the unique
outcome (D, C) only in order to achieve the socially optimal solution. Specifically, each
agent needs to learn the unique policy of row → D, column → C, that is, selecting
action D as the row player and selecting action C as the column player. Figure 7 shows
the frequency of each outcome that the agents achieve averaged over 100 runs under
the social learning framework. We can observe that the agents learn to converge to
the socially rational outcome (C, D) during the first period under socially rational
update scheme, and each agent receives an average payoff of 8.5 after convergence.
Starting from the second period, the agents begin to adopt individually rational
update for the purpose of exploration and either (C, D) or (D, C) is converged to with
probability of approximately 0.6 and 0.4 respectively. If the agents learn to converge
to (C, D) during the second period, then the average payoff each agent receives after
convergence is only 4.5. Accordingly the agents will have the incentive to change their
update scheme back to socially rational update and they will stick to this scheme
thereafter since they can always receive the higher average payoff of 8.5. Notice that
different from the symmetric anticoordination game, only using individually rational
update is not sufficient to guarantee the convergence to socially optimal outcome
(D, C) in asymmetric case. Under individually rational update scheme, (C, D) has
much higher opportunities to be converged to even though (D, C) is the only socially
4 In Figure 6, only the frequencies of the relevant action profiles (D, C) and (C, D) are shown for the sake of
clarification.
ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013.
i
i
i
i
i
i
i
i
Achieving Socially Optimal Outcomes in Multiagent Systems
15:17
Fig. 8. Learning performance of previous work in prisoner’s dilemma game.
Fig. 9. Learning performance of previous work in anticoordination game.
optimal outcome. Therefore socially rational update is necessary to guarantee that
socially optimal outcome (D, C) can be achieved in this example.
From previous experimental results in Sections 4.1.1 and 4.1.2, we have shown that
different update schemes (social learning update or individual learning update) may
be required to achieve socially optimal solutions in different interaction scenarios, and
also the agents using our adaptive update strategy are able to successfully learn to
choose the most suitable update scheme accordingly. Next, we are going to evaluate the
learning performance of the social learning framework by comparing it with previous
work using the previous examples.
4.1.3. Comparison with Previous Work. In this section, we compare the performance
under the social learning framework with two learning frameworks proposed in
previous work [Hao and Leung 2011; Matlock and Sen 2007]. Since all previous
work is only applicable in symmetric games, here we evaluate their performance
under two types of representative games: the prisoner’s dilemma game (Figure 1)
and anti-coordination game (Figure 2), which are the most commonly adopted testbed
in previous work [Chao et al. 2008; Hao and Leung 2011; Matlock and Sen 2007].
Figure 8 and Figure 9 show the dynamics of the percentage of agents coordinating
on socially optimal outcomes as a function of rounds for the learning frameworks in
both Hao and Leung [2011] and Matlock and Sen [2007] under the prisoner’s dilemma
ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013.
i
i
i
i
i
i
i
i
15:18
J. Hao and H.-F. Leung
Table I. Success Rate in Terms of Percentage of Games in which Full Coordination on
Socially Optimal Outcomes Can be Achieved in Randomly Generated General-Sum
Games for Different Values of m and thres
success rate (m/thres)
1000
1500
2000
0
0.586
0.504
0.97
0.001
0.624
0.554
0.968
0.01
0.79
0.856
0.964
0.05
0.916
0.92
0.962
0.1
0.812
0.84
0.806
0.2
0.676
0.69
0.738
game and anticoordination game, respectively. We can see that both previous learning
frameworks can lead high percentage of agents to achieve coordination on socially
optimal outcomes on average, however, there is always still a small proportion of
agents that fails to do so all the time. This phenomenon is particularly obvious for the
paired reproduction based learning framework in Matlock and Sen [2007], in which
the deviation of the percentage of agents coordinating socially optimal outcomes is
very high since the agents under their learning framework use evolutionary learning
(copy and mutation) to make their decisions. In their approach, mutation plays an
important role in evolving high level of coordinations on socially optimal outcome,
while the corresponding side effect is that the deviation of the percentage of socially
optimal outcome is very high. In contrast, under our socially learning framework, all
the agents can stably learn to coordinate on socially optimal outcomes in both the
prisoner’s dilemma game (see Figure 5) and anticoordination game (see Figure 6).
4.2. General Performance in Two-Player General-Sum Games
The previous section has shown the effectiveness of the social learning framework
compared with previous work in two representative symmetric games, which are the
most commonly adopted testbed in previous work. In this section, we further evaluate
the general performance of the social learning framework in a large set of two-player
general-sum games, which consists of different types of symmetric and asymmetric
games. The payoffs of all games are generated randomly within the range of 0 and 100.
Table I lists the average success rate in terms of the percentage of games in which
full coordination on socially optimal outcomes can be achieved when the values of adjustment period length m and the acceptance threshold thres vary. All the results are
averaged over 100 randomly generated games. First, we can see that under our social learning framework the agents can achieve a high success rate with the highest
value of 0.97 when m = 2000 and thres = 0. However, there still exist 3% of games in
which the agents cannot achieve full coordination on socially optimal outcomes. The
underlying reason is that in those games there exist a nonsocially optimal outcome
such that the agents’ average payoff by coordinating on this outcome is very close to
that by coordinating on the socially optimal outcome, and thus some agents may fail to
distinguish which one is better, and finally learn the incorrect update scheme and fail
to learn their individual actions corresponding to the socially optimal outcome, which
thus results in certain degrees of miscoordinations on nonsocially optimal outcomes.
Next we analyze the influence of the values of thres and m on the success rate of
coordinating on socially optimal outcomes. The value of the threshold thres reflects
the tolerance degree of the performance of the current learning period. The larger the
length of each adjustment period (m) is, the more accurate estimation of the actual
performance of each period we can obtain. When the estimation of the actual performance of each period is not accurate enough (the value of m is relatively small), setting
the value of the threshold thres too large or small can result in high rate of miscoordinations on nonsocially optimal outcomes. If the value of thres is too small, it is likely
that some agents may choose the wrong update scheme as their next-round update
schemes due to the noise of inaccurate estimations of the performance in each round;
ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013.
i
i
i
i
i
i
i
i
Achieving Socially Optimal Outcomes in Multiagent Systems
15:19
Table II. Average Payoffs of Agents Following the Strategy in Section 3.3 and SAs in the Prisoner’s Dilemma
Game when the Percentage of SAs Varies
Average Payoffs
Agents following strategy in Section 3.3
Selfish agents (SAs)
10%
2.75
4.39
20%
2.45
4.15
30%
2.2
3.8
40%
1.81
3.43
50%
1.48
3.02
60%
1.08
1.28
≥70%
1
1
if the value of thres is too large, it is obvious that some agents may make wrong judgements on the optimal update scheme due to high tolerance degree, and thus result
in certain degrees of miscoordinations. Indeed, this can be verified from Table I that
when the length of adjustment period is small (m = 1000 or 1500), the highest success
rate of achieving full coordination on socially optimal outcomes is achieved when the
value of the threshold is set to 0.05. either decreasing or increasing the value of the
threshold thres will decrease the success rate of coordinating on socially optimal outcomes. When the length of adjustment period is further increased to 2000, the agents
are able to get more accurate estimation of the performance of each period, thus the
highest success rate is achieved when the threshold is set to the minimum value of 0.
4.3. Against Purely Individually Rational Agents
The successful coordination on socially optimal outcomes in the social learning framework requires the cooperation of all agents in the system, that is, each agent should
have the intention to follow the learning strategy proposed in Section 3.3. However, in
an open multiagent environment, it is possible that there exist some selfish agents that
would like to use this opportunity to exploit those agents that update their Q-values
in the socially rational manner to the benefits of their own. For example, consider
the prisoner’s dilemma game, to achieve socially optimal outcome (C, C), it requires all
agents cooperatively to learn to adopt socially rational update scheme and finally learn
to choose action C each round. However, it is obvious that they can be easily exploited
by the small amount of purely individually rational agents, which can learn to choose
action D and thus obtain a much higher payoff. In this section, we empirically show
that the agents following the strategy in Section 3.3 can automatically learn to adopt
the individually rational update scheme back when the number of selfish agents is too
much such that there is no benefits for them to adopt socially rational update scheme
any more. In this way, the agents can effectively prevent from being exploited by those
selfish agents.
Table II shows the average payoffs of the agents following the strategy in Section 3.3
and purely selfish agents when the percentage of purely selfish agents varies in the
prisoner’s dilemma game. If all agents are purely selfish (corresponding to the case of
100% of purely selfish agents), then all agents will learn to converge to choose action
D, and obtain an average payoff of 1. However, when there is only a relatively small
percentage of selfish agents, the rest of agents can still obtain an average payoff higher
than 1 by performing socially rational update. This corresponds to the cases when the
percentage of purely selfish agents are less than 70% in Table II, in which the rest
of agents still learn to adopt socially rational update and finally choose action C, and
thus obtain an average payoff higher than 1.0. When the percentage of purely selfish
agents becomes larger than 70%, then they can realize that there is no benefit for them
to adopt socially rational update any more and thus switch to individually rational update, and finally learn to choose action D, and all agents obtain an average payoff of 1.
4.4. Influences of the Amount of Information Shared Among Agents
One major difference between the social learning framework and the commonly
adopted setting of two-agents repeated interactions is that the agents have more information (other groups of interaction agents in the population) at their disposal when
ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013.
i
i
i
i
i
i
i
i
15:20
J. Hao and H.-F. Leung
Fig. 10. Dynamics of the average payoff of agents with different values of M during the period of social
learning update.
they learn their coordination policies. Therefore, it is particularly important to have a
clear understanding of the influence of the amount of information shared among the
agents (the value of M) on the system-level’s performance. To this end, in this section,
we evaluate the influence of the value of M on the learning performance of the system
using the symmetric anticoordination game in Figure 2. Figure 10 shows the dynamics
of the average payoff of each agent as a function of rounds with different values of M
under the socially rational update scheme. We can see that each agent’s average payoff
increases more quickly with the increase of the value of M. Intuitively, when the value
of M becomes larger, it means the agents are exposed to more information each round
and it is expected that they are able to learn their policies in a more informed way and
thus coordinate their actions on socially optimal outcomes more efficiently. However, it
is worth noticing that the time required before reaching the maximum average payoff
is approximately the same for different values of M.
4.5. Influence of Fixed Agents
We have shown that both socially optimal outcomes (C, D) and (D, C) have equal
chances to be converged to in the symmetric anticoordination game in Section 4.1.2.
This is reasonable and expected since the game itself is symmetric and all agents are
identical, thus there is no bias towards any of the two outcomes. We also have investigated the influence of game structure on the convergence results of the system when
the game itself is not symmetric. In this section, we consider another situation when
there exist some agents that adopt a fixed policy without any learning ability, and evaluate how the existence of small number of fixed agents can influence the convergence
outcome of the system.
Figure 11 shows the frequency that each socially optimal outcome can be converged
to when different number of fixed agents are inserted into the system using the symmetric anticoordination game (Figure 2). We assume that the fixed agents always make
their decision following the policy of “choosing action D as the row player and choosing
action C as the column player”. When there is no fixed agents inserted into the system,
both socially optimal outcomes (C, D) and (D, C) can be converged to with equal probability. When we insert only one fixed agent into the system, it is surprising to find
that the probability of converging to the outcome (D, C) (corresponding to converging
to the policy of the fixed agent) is increased to 0.7! When the number of fixed agents
is further increased, the probability of converging to (D, C) is increased as well. With
only 4 fixed agents inserted into the system, the agents converge to the outcome (D, C)
ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013.
i
i
i
i
i
i
i
i
Achieving Socially Optimal Outcomes in Multiagent Systems
15:21
Fig. 11. Average frequency of socially optimal outcomes converged to under the learning framework in
asymmetric anticoordination game with different number of fixed agents (averaged over 100 runs).
Fig. 12. Average frequency of achieving each outcome under the learning framework in asymmetric
anticoordination game with one fixed agent (averaged over 100 runs).
with approximately 100% chance. There might thus be some truth to the adage that
most fashion trends are decided by a handful of trend setters in Paris! [Sen and Airiau
2007].
Figure 12 illustrates the detailed dynamics of the frequency changes of different
outcomes in a population of 100 agents when only one fixed agent is inserted into the
system. First, we can see that the agent with fixed policy has significant influence on
the convergence outcomes in the first period under socially rational update, that is,
about 70% chances of converging to (C, D) instead of 50% without any fixed agent.
During the second period under individually rational update, there is about 37%
chances that the agents miscoordinate their action choices (reaching (D, D)) due to
the existence of the fixed policy agent, and also the chance of converging to (C, D) is
higher than that of converging to (D, C). The agents change their update scheme back
to socially rational update in the third period, and the chance of converging to (C, D)
is gradually increased to the maximum value (approximately 70% of the runs) by the
end of the 4th period.
5. CONCLUSION AND FUTURE WORK
In this work, we propose a social learning framework for a population of agents to
coordinate on socially optimal outcomes in the context of general-sum games. We
introduce the observation mechanism into the social learning framework to reduce the
ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013.
i
i
i
i
i
i
i
i
15:22
J. Hao and H.-F. Leung
amount of communication required among agents. The agents’ learning strategies are
developed based on reinforcement learning technique instead of evolutionary learning.
Each agent explicitly keeps the record of its current state (row or column player) in
its learning strategy, and learn its optimal policy for each state independently. In this
way, the agents are able to achieve much more stable coordination on socially optimal
outcomes compared with previous work and also our social learning framework can be
suitable for both the settings of symmetric and asymmetric games. The performance
of this social learning framework is extensively evaluated under the testbed of
two-player general-sum games and better performance can be achieved comparing
with previous work [Hao and Leung 2011; Matlock and Sen 2007].
We investigate the influences of different factors on the learning performance of the
social learning framework in details. The successful coordination on socially optimal
outcomes requires the cooperative coordinations on the selection between individual
learning update and social learning update among agents, which largely depends on
the values of adjustment period length and the acceptance threshold. We also show
that agents in our social learning framework can be resistant to the exploitations of
purely noncooperative selfish agents and also a small amount of fixed policy agents can
have tremendous influence on which socially optimal outcome the agents converge to.
In the current social learning framework we adopt, the interactions among agents
are determined randomly and modeled as two-player normal-form games. As future
work, some generalizations are necessary to make it more applicable in real-world
complex environments. First, some practical problems involve the interactions among
more than two players and each agent may have numerous actions, which thus needs
to be modeled as n-player games (n > 2) rather than two-player games. For example, the resource allocation scenarios in ad hoc networks and sensor networks need to
be modeled as n-player public good games [Pitt et al. 2012]. Second, the interactions
among agent may not necessarily be random, and may follow some patterns determined by the structure of their underlying social network (e.g., lattice or scale-free
network). Therefore, another interesting direction is to explicitly take into consideration the underlying topology of the agents’ interacting network. In this way, more
efficient coordination towards socially optimal outcomes may be achieved due to the
availability of alternative mechanisms, that is, rewiring mechanism (allowing agents
themselves to determine their interaction partners).
REFERENCES
Allison, P. D. 1992. The cultural evolution of beneficent norms. Social Forces. 279–301.
Bowling, M. H. and Veloso, M. M. 2003. Multiagent learning using a variable learning rate. Artif. Intell. 136,
215–250.
Brafman, R. I. and Tennenholtz, M. 2004. Efficient learning equilibrium. Artif. Intell. 159, 27–47.
Chao, I., Ardaiz, O., and Sanguesa, R. 2008. Tag mechanisms evaluated for coordination in open multi-agent
systems. In Proceedings of the 8th International Workshop on Engineering Societies in the Agents World.
254–269.
Claus, C. and Boutilier, C. 1998. The dynamics of reinforcement learning in cooperative multiagent systems.
In Proceedings of AAAI’98. 746–752.
Conitzer, V. and Sandholm, T. 2006. AWESOME: A general multiagent learning algorithm that converges
in self-play and learns a best response against stationary opponents. In Proceedings of ICML’06. 83–90.
Crandall, J. W. and Goodrich, M. A. 2005. Learning to teach and follow in repeated games. In Proceedings of
the AAAI Workshop on Multiagent Learning.
Greenwald, A. and Hall, K. 2003. Correlated Q-Learning. In Proceedings of ICML’03. 242–249.
Hales, D. 2000. Cooperation without space or memory-tag, groups and the Prisoner’s Dilemma. In MultiAgent-Based Simulation, Lecture Notes in Artificial Intelligence.
Hales, D. and Edmonds, B. 2003. Evolving social rationality for MAS using “Tags”. In Proceedings of
AA-MAS’03. 497–503.
ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013.
i
i
i
i
i
i
i
i
Achieving Socially Optimal Outcomes in Multiagent Systems
15:23
Hao, J. Y. and Leung, H. F. 2010. Strategy and fairness in repeated two-agent interaction. In Proceedings of
ICTAI’10. 3–6.
Hao, J. Y. and Leung, H. F. 2011. Learning to achieve social rationality using tag mechanism in repeated
interactions. In Proceedings of ICTAI’11. 148–155.
Hao, J. Y. and Leung, H. F. 2012. Learning to achieve socially optimal solutions in general-sum games. In
Proceedings of the PRICAI’12. 88–99.
Hoen, P. J., Tuyls, K.l, Panait, L., Luke, S., and Poutr, J. A. L. 2005. An overview of cooperative and competitive multiagent learning. In Proceedings of the 1st International Workshop on Learning and Adaption
in Multi-Agent Systems. 1–46.
Hogg, L. M. and Jennings, N. R. 1997. Socially rational agents. In Proceedings of the AAAI Fall Symposium
on Socially Intelligent Agents. 61–63.
Hogg, L. M. J. and Jennings, N. R. 2001. Socially intelligent reasoning for autonomous agents. IEEE Trans.
Syst. Man. Cybernetics, Part A: Syst. Humans. 381–393.
Holland, J. H., Holyoak, K., Nisbett, R., and Thagard, P. 1986. Induction: Processes of Inferences, Learning,
and Discovery. MIT Press, Cambridge, MA.
Howley, E. and O’Riordan, C. 2005. The emergence of cooperation among agents using simple fixed bias
tagging. In Proceedings of the IEEE Congress on Evolutionary Computation.
Hu, J. and Wellman, M. 1998. Multiagent reinforcement learning: theoretical framework and an algorithm.
In Proceedings of the ICML’98.
Kapetanakis, S. and Kudenko, D. 2002. Reinforcement learning of coordination in cooperative multi-agent
systems. In Proceedings of the AAAI’02. 326–331.
Littman, M. 1994. Markov games as a framework for multi-agent reinforcement learning. In Proceedings of
ICML’94. 322–328.
Matlock, M. and Sen, S. 2005. The success and failure of tag-mediated evolution of cooperation. In Proceedings of the 1st International Workshop on Learning and Adaption in Multi-Agent Systems. Springer,
155–164.
Matlock, M. and Sen, S. 2007. Effective tag mechanisms for evolving coordination. In Proceedings of
AAMAS’07. 1340–1347.
Matlock, M. and Sen, S. 2009. Effective tag mechanisms for evolving coperation. In Proceedings of
AAMAS’09. 489–496.
Maynard, J. S. 1982. Evolution and The Theory of Games. Cambridge University Press, Cambridge, UK.
Nowak, M. and Sigmund, K. 1993. A strategy of winstay, lose-shit that outperforms tit-for-tat in the prisoner’s dilemma game. Nature, 56–58.
Osborne, M. J. and Rubinstein, A. 1994. A Course in Game Theory. MIT Press, Cambridge, MA.
Panait, L. and Luke, S. 2005. Cooperative multi-agent learning: The state of the art. In Proceedings of
AAMAS. 387–434.
Pitt, J., Schaumeier, J., Busquets, D., and Macbeth, S. 2012. Self-organising common-pool resource allocation
and canons of distributive justice. In Proceedings of SASO’12. 119–128.
Riol, R. and Cohen, M. D. 2001. Cooperation withour reciprocity. Nature, 441–443.
Sen, S. and Airiau, S. 2007. Emergence of norms through social learning. In Proceedings of IJCAI’07.
1507–1512.
Verbeeck, K., Nowé, A., Parent, J., and Tuyls, K. 2006. Exploring selfish reinforcement learning in repeated
games with stochastic rewards. In Proceedings of AAMAS’06. 239–269.
Villatoro, D., Sabater-Mir, J., and Sen, S. 2011. Social instruments for robust convention emergence. In
Proceedings of IJCAI’11. 420–425.
Wakano, J. Y. and Yamamura, N. 2001. A simple learning strategy that realizes robust cooperation better
than Pavlov in iterated Prisoner’s Dilemma. J. Ethol. 19, 9–15.
Watkins, C. J. C. H. and Dayan, P. D. 1992. Q-learning. Mach. Learn. 279–292.
Received January 2013; revised May 2013; accepted August 2013
ACM Transactions on Autonomous and Adaptive Systems, Vol. 8, No. 3, Article 15, Publication date: September 2013.
i
i
i
i