APPROXIMATE ACTION SELECTION FOR LARGE, COORDINATING,
MULTIAGENT SYSTEMS
by
SCOTT T. SOSNOWSKI
Submitted in partial fulfillment of the requirements
For the degree of Master of Science
Thesis Advisor: Dr. Soumya Ray
Department of Electrical Engineering and Computer Science
CASE WESTERN RESERVE UNIVERSITY
May, 2016
CASE WESTERN RESERVE UNIVERSITY
SCHOOL OF GRADUATE STUDIES
We hereby approve the thesis of
Scott T. Sosnowski
candidate for the
Master of Science
degree*.
Soumya Ray
(signed)
(chair of the committee)
Marc Buchner
M. Cenk Çavuşoğlu
Michael Lewicki
(date)
February 26, 2016
*We also certify that written approval has been obtained for any proprietary material
contained therein
c Copyright by Scott T. Sosnowski May, 2016
All Rights Reserved
2
TABLE OF CONTENTS
Page
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.1
2
Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1
2.2
2.3
2.4
2.5
3
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Markov Decision Processes . . . . . . . . . . . . .
2.1.1 Multi-agent Extension of the MDP . . . . .
Reinforcement Learning: Q-Learning and SARSA
State Approximation . . . . . . . . . . . . . . . .
Coordination Graphs . . . . . . . . . . . . . . . .
Related Work . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
22
23
27
29
30
Action Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1
3.2
3.3
3.4
Exact Action Selectors . . . . . . . . . . . . . .
3.1.1 Brute Force Action Selector . . . . . . .
3.1.2 Agent Elimination Action Selector . . . .
3.1.2.1 Advantages and Drawbacks . .
Max Plus Action Selector . . . . . . . . . . . . .
3.2.1 Example . . . . . . . . . . . . . . . . .
3.2.2 Advantages, Drawbacks, and Comparison
Mean Field Action Selector . . . . . . . . . . . .
3.3.1 Advantages, Drawbacks, and Comparison
Approximate Coordination Through Local Search
3.4.1 Action Graphs . . . . . . . . . . . . . .
3.4.2 Action Graph Traversal Action Selector .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
36
36
37
42
42
43
49
49
56
57
57
58
3
Page
3.4.2.1
3.4.2.2
3.4.2.3
3.4.2.4
4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
61
61
61
62
Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.1
4.2
4.3
4.4
4.5
5
Epsilon-Greedy Successor . . . . . . . . .
Boltzmann Successor . . . . . . . . . . .
Greedy-With-Restarts Successor . . . . .
Advantages, Drawbacks, and Comparison
Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Predator-and-Prey Domain . . . . . . . . . . . . . . . . . . .
4.1.1.1 State and Action Space . . . . . . . . . . . . . . .
4.1.2 Sepia Domain . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.2.1 State and Action Space . . . . . . . . . . . . . . .
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Discoordinated Action Selection . . . . . . . . . . . . . . . .
4.2.1.1 Advantages, Drawbacks, and Comparison . . . . .
4.2.1.2 Optimistic Variant . . . . . . . . . . . . . . . . . .
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Sample Generation . . . . . . . . . . . . . . . . . . . . . . .
4.3.1.1 Sampling Directly from Learning Results . . . . . .
4.3.1.2 Sampling Indirectly from Multiple Learning Results
Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.1 Scaling Agent Elimination . . . . . . . . . . . . . . . . . . .
4.5.2 Tuning Action Graph Traversal . . . . . . . . . . . . . . . .
4.5.2.1 Tuning Greediness . . . . . . . . . . . . . . . . . .
4.5.2.2 Comparing the Successor Algorithms . . . . . . . .
4.5.2.3 Restarting Epsilon-Greedy . . . . . . . . . . . . .
4.5.2.4 Choosing a Starting Point . . . . . . . . . . . . . .
4.5.3 Action Graph Traversal vs. Other Action Selectors . . . . . .
4.5.4 Coordination Requirements . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
65
65
67
67
69
71
71
72
73
74
76
77
77
78
79
79
79
81
83
89
93
94
105
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.1
Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
APPENDICES
Appendix A:
Domain Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4
List of Figures
1.1
Humans participating in multi-agent situations . . . . . . . . . . . . . . . . . . . 15
1.2
An agent interacting with an environment . . . . . . . . . . . . . . . . . . . . . . 17
1.3
Three independent agents interacting with an environment . . . . . . . . . . . . . 17
1.4
Three centrally controlled agents interacting with an environment. The central
controller here can be a central computer where all actions are computed together
or just a representation of a high-bandwidth communication channel through which
independent agents can come to a mutually agreeable course of action prior to acting. 18
2.1
A simple coordination graph with five agents. This graph represents the decomposition Q (S, A) = Q1 (S, A1 , A2 ) + Q2 (S, A1 , A3 ) + Q3 (S, A2 , A3 ) +
Q4 (S, A3 , A4 ) + Q5 (S, A4 , A1 ) + Q6 (S, A0 , A4 ) . . . . . . . . . . . . . . . . 29
3.1
The coordination graph from Figure 2.1 with concrete Q value functions. It represents each Qi,j (S, Ai , Aj ) as a table for a specific state S and places the table
on the edge that links agent i and agent j. The cell at (row r, column c) in this
m × m table contains the value, Qi,j (S, r, c), that Qi,j contributes when Ai = r
and Aj = c. For this example, we have chosen two as the number of actions m
per agent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5
3.2
Duplicate of Figure 3.1. The coordination graph from Figure 2.1 with concrete Q
value functions. It represents each Qi,j (S, Ai , Aj ) as a table for a specific state
S and places the table on the edge that links agent i and agent j. The cell at
(row r, column c) in this m × m table contains the value, Qi,j (S, r, c), that Qi,j
contributes when Ai = r and Aj = c. For this example, we have chosen two as
the number of actions m per agent. . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3
The structure of an action graph with two agents that each have three actions. . . . 58
3.4
An example of a greedy Action Graph Traversal running over a sample state with
two agents and randomly generated feature values. It is restricted to only two
agents and ǫ is set to zero for ease of understanding. It starts at (2,0) with a Q
value of 1.52, picks the highest neighbor at (2,4) with a Q value of 2.31, then goes
to (10,4). (10,4) has a lower Q value of 2.26, because (2,4) is a local optimum,
but (10,4) is still the highest neighbor. From (10,4), it proceeds to (10,7) with
a Q value of 2.42, then to (5,7) with a Q value of 2.28. Like (10,4), this is a
decrease, but the next iteration would lead back to (10,7), which is, in fact, the
global optimum. With restarts enabled, this would cause it to start again at a fully
random action. Without restarts, it will bounce between these two actions until it
runs out of time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1
An example state in the predator-and-prey domain with sixteen agents and eight
prey (P: Predator, p: prey). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2
The initial state in the Sepia domain with sixteen agents. Each team has sixteen
archers arranged in two staggered columns. . . . . . . . . . . . . . . . . . . . . . 68
4.3
Example coordination graphs with the same topology used in our testing . . . . . . 75
6
4.4
The impact of greed on the action picking quality of the ǫ -greedy successor in
Sepia with sixteen agents. Smaller numbers (at the top) are better, since they are
closest to the best seen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.5
The impact of greed on the action picking quality of the ǫ -greedy successor in
predator-and-prey with sixteen agents. Smaller numbers (at the top) are better,
since they are closest to the best seen. . . . . . . . . . . . . . . . . . . . . . . . . 82
4.6
The impact of greed on the action picking quality of the Boltzmann successor in
Sepia with sixteen agents. Smaller numbers (at the top) are better, since they are
closest to the best seen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.7
The impact of greed on the action picking quality of the Boltzmann successor in
predator-and-prey with sixteen agents. Smaller numbers (at the top) are better,
since they are closest to the best seen. . . . . . . . . . . . . . . . . . . . . . . . . 83
4.8
Learning results of predator-and-prey with eight agents for various Action Graph
Traversal successor strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.9
Learning results of predator-and-prey with eight agents for various Action Graph
Traversal successor strategies, zoomed in to show the differences among the best. . 84
4.10 Learning results of Sepia with eight agents for various Action Graph Traversal
successor strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.11 Learning results of predator-and-prey with sixteen agents for various Action Graph
Traversal successor strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.12 Learning results of predator-and-prey with sixteen agents for various Action Graph
Traversal successor strategies, zoomed in to show the differences among the best. . 86
7
4.13 Learning results of Sepia with sixteen agents for various Action Graph Traversal
successor strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.14 Learning results of Sepia with sixteen agents for various Action Graph Traversal
successor strategies, zoomed in to show the differences among the best. . . . . . . 87
4.15 Comparing the abilities of various Action Graph Traversal successor strategies to
select good actions in Sepia with sixteen agents. . . . . . . . . . . . . . . . . . . 88
4.16 Comparing the abilities of various Action Graph Traversal successor strategies to
select good actions in predator-and-prey with sixteen agents. . . . . . . . . . . . . 88
4.17 Results of Sepia with 16 agents showing ǫ -greedy AGT with 20 iterations with
and without restarts. Smaller numbers (at the top) are better, since they are closest
to the best seen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.18 Results of predator-and-prey with 16 agents showing ǫ -greedy AGT with 20 iterations with and without restarts. Smaller numbers (at the top) are better, since
they are closest to the best seen. . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.19 Results of Sepia with 16 agents showing ǫ -greedy AGT with 200 iterations with
and without restarts. Smaller numbers (at the top) are better, since they are closest
to the best seen. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.20 Results of predator-and-prey with 16 agents showing ǫ -greedy AGT with 200
iterations with and without restarts. Smaller numbers (at the top) are better, since
they are closest to the best seen. . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.21 Results of predator-and-prey with 16 agents showing the effect of starting in different ways, then using the Greedy-With-Restarts Action Graph Traversal selector
with various numbers of iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8
4.22 Results of Sepia with 16 agents showing the effect of starting in different ways,
then using the Greedy-With-Restarts Action Graph Traversal selector with various
numbers of iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.23 Results of Sepia with 32 agents showing the effect of starting in different ways,
then using the Greedy-With-Restarts Action Graph Traversal selector with various
numbers of iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.24 Results of predator-and-prey with eight agents for Action Graph Traversal vs.
Agent Elimination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.25 Results of predator-and-prey with eight agents for Action Graph Traversal vs.
Agent Elimination, zoomed in to show the differences among the best. . . . . . . . 95
4.26 Results of Sepia with eight agents for Action Graph Traversal vs. Agent Elimination. 95
4.27 Results of predator-and-prey with eight agents for Action Graph Traversal vs.
other approximate methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.28 Results of predator-and-prey with eight agents for Action Graph Traversal vs.
other approximate methods, zoomed in to show the differences among the best. . . 97
4.29 Results of Sepia with eight agents for Action Graph Traversal vs. other approximate methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.30 Results of predator-and-prey with sixteen agents for Action Graph Traversal vs.
other approximate methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.31 Results of predator-and-prey with sixteen agents for Action Graph Traversal vs.
other approximate methods, zoomed in to show the differences among the best. . . 99
9
4.32 Results of Sepia with sixteen agents for Action Graph Traversal vs. other approximate methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.33 Results of Sepia with sixteen agents for Action Graph Traversal vs. other approximate methods, zoomed in to show the differences among the best. . . . . . . . . . 100
4.34 Results of Sepia with thirty-two agents for Action Graph Traversal vs. other approximate methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.35 The trade-off between values of actions picked and the amount of time required
for each of the action selectors for predator-and-prey with sixteen agents. . . . . . 101
4.36 The trade-off between values of actions picked and the amount of time required
for each of the action selectors for predator-and-prey with sixteen agents, shown
in terms of average difference from the optimum value for the state. This is shown
in a modified “log scale” that shows minor differences among near-zero values
like a true log scale, but with an unseen gap in the y-axis so that we can also show
zero itself. Up is better, since it is the closest to the best. . . . . . . . . . . . . . . 102
4.37 The trade-off between values of actions picked and the amount of time required
for each of the action selectors for Sepia with sixteen agents. . . . . . . . . . . . . 103
4.38 The trade-off between values of actions picked and the amount of time required
for each of the action selectors for Sepia with thirty-two agents. . . . . . . . . . . 103
4.39 The trade-off between values of actions picked and the amount of time required
for each of the action selectors for predator-and-prey with sixteen agents, shown
in terms of how frequently the selector finds the best action. . . . . . . . . . . . . 104
4.40 The trade-off between values of actions picked and the amount of time required
for each of the action selectors for Sepia with sixteen agents, shown in terms of
how frequently the selector finds the best action. . . . . . . . . . . . . . . . . . . 106
10
4.41 The trade-off between values of actions picked and the amount of time required
for each of the action selectors for Sepia with thirty-two agents, shown in terms of
how frequently the selector finds the best action. . . . . . . . . . . . . . . . . . . 106
4.42 The trade-off between values of actions picked and the amount of time required
for each of the action selectors for Sepia with thirty-two agents, shown in terms
of how frequently the selector finds the best action. This contains only the states
generated by a path using a random Q value function. . . . . . . . . . . . . . . . . 107
4.43 Results of predator-and-prey with eight agents at various coordination levels, showing how, at each level, more coordination is beneficial. . . . . . . . . . . . . . . . 107
4.44 A possible state in the predator-and-prey domain (P: Predator, p: prey). This is an
example of an advantage for coordination beyond pairs. Two of the agents should
deliberately choose to pursue a prey that is not the closest for the sole reason that
the other two agents will be pursuing it. . . . . . . . . . . . . . . . . . . . . . . . 108
11
List of Tables
3.1
The Q function Q1,2 (S, A1 , A2 ) from Figure 3.1. This is the basis of agent 2’s
message to agent 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2
The Q function Q2,3 (S, A2 , A3 ) from Figure 3.1. This is the basis of agent 2’s
message to agent 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3
Max Plus Example: The messages after one iteration of Max Plus, organized by
recipient. Agent 2’s messages, for which we have shown the calculations, are bolded. 46
3.4
A Q function made from Q1,2 (S, A1 , A2 ) from Figure 3.1 and agent 3’s message
to agent 2 (see Table 3.3). This is the basis of agent 2’s message to agent 1. . . . . 47
3.5
A Q function made from Q2,3 (S, A2 , A3 ) from Figure 3.1 and agent 1’s message
to agent 2 (see Table 3.3). This is the basis of agent 2’s message to agent 3. . . . . 47
3.6
Max Plus Example: The messages after two iterations of Max Plus, organized by
recipient. Agent 2’s messages, for which we have shown the calculations, are bolded. 48
3.7
Max Plus Example: The messages after many iterations, organized by recipient. . . 48
3.8
Mean Field Example: Probabilities are initialized to
3.9
Mean Field Example: Calculated probabilities after the first iteration. . . . . . . . 54
1
.
m
. . . . . . . . . . . . . . . 52
3.10 Mean Field Example: Calculated probabilities after the second iteration. . . . . . . 55
12
3.11 Mean Field Example: Calculated probabilities after the third iteration. . . . . . . . 55
3.12 Mean Field Example: Calculated probabilities after many iterations lead to convergence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1
Parameters for learning experiments. ǫ and γ are the overall exploration rate and
the discount factor, respectively. Size is not a relevant parameter for Sepia because
units only move toward each other. . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2
A comparison on the eight-agent Sepia domain between 300 iterations of Action
Graph Traversal using a Greedy-With-Restarts successor and Agent Elimination,
each with the agents coordinating with different numbers of other agents. These
numbers include a consistently roughly 2.5-3.5 ms of overhead per step for tasks
other than action selection. There is a scaling in action selection as the graph
becomes more complete with more decomposed functions and each decomposed
function becomes more complex. Unfortunately, we were unable to run Agent
Elimination fully when coordinating with all seven other agents due to the large
increase in the time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
13
ACKNOWLEDGMENTS
I would like to extend my gratitude to my adviser Dr. Soumya Ray. Without his insights,
his encouragement, his perseverance, and (fittingly) his advice, I never would have been able
to do this. His part, as well as those of Tim Ernsberger and Feng Cao, in the conception and
implementation of the Sepia domain central to my experiments was essential.
I want to thank my coworkers, the many experienced developers, designers, and testers who
have helped me grow so much over this time from an inexperienced undergraduate into what I
am now.
I would like to thank Joseph Sosnowski and Peter Hedman for reading through my drafts
and showing me what I was not saying and helping me to say what I wanted to say.
Finally, I would like to thank my friends and family for their unrelenting support and guidance. They have taught me so much over the years and have been instrumental in my success.
14
Approximate Action Selection For Large, Coordinating,
Multiagent Systems
SCOTT T. SOSNOWSKI
Abstract
Many practical decision-making problems involve coordinating teams of agents. In our
work, we focus on the problem of coordinated action selection in reinforcement learning with
large stochastic multi-agent systems that are centrally controlled. Previous work has shown
how to formulate coordination as exact inference in a Markov network, but this becomes intractable for large teams of agents. We investigate the idea of “approximate coordination” as a
solution to an approximate inference problem in a Markov network. We look at a pursuit domain and a simplified real-time strategy game and find that in these situations, such approaches
are able to find good policies when exact approaches become intractable.
15
Chapter 1
Introduction
Multi-agent problems, in which independent actors interact, are plentiful in life. As we
go about our lives, humans regularly participate in multi-agent problems. In daily life, these
agents are typically other people, but interactions with and between organizations, machines,
warehouses, and many other things also occur. Despite their ubiquity, the problem of selecting
the best course of action in a multi-agent situation can be difficult.
Figure 1.1 Humans participating in multi-agent situations
16
In a single-agent problem, an agent interacts only with the environment. A learning agent
interacts with the environment through a simple feedback loop (see Figure 1.2). The agent,
observing the state of the environment, picks a course of action. These actions have an impact
on how the state of the environment changes. The agent observes this change and, in a few
special states, it may receive a positive or negative reward. After seeing this reward and how
the action impacts the state, the agent revises its understanding of the value of that state or
the state-action pair, learning to seek or avoid such actions in the future, in accordance with
the value. These steps repeat, allowing the agent to refine its understanding of the values.
Moreover, every time it visits the state, it has more knowledge about the value of the states it
could end up in. Over the course of many repetitions, information about states many actions
in the future will propagate to earlier states. Actions that give access to later states and actions
that are rewarded thereby become associated with that anticipated reward, allowing the agent
to look beyond the immediate benefit and consider the long term consequences of its decision.
Though, together, the agent’s task of observing the environment’s response to its actions and
the agent’s task of picking future actions based on that response are the core of the learning
process, we focus on the latter.
In a multi-agent problem, there may be many such agents. Without central control, each
will have its own independent feedback loop, as in Figure 1.3. With these separate loops, each
agent is on its own. Despite this, agents can still seize upon consistent behavior by the others
and come to rely on it in their own behavior, just as humans have come to rely on gravity. This
is possible because each other agent still acts on the environment. This leads to agents adopting
a sort of “implied” coordination, where they can work together without modeling each other at
all. Two agents on a grid that want to push a box to a particular place, for example, might come
to a strategy where one of them pushes the box along the X axis to where the other is waiting
to immediately start pushing along the Y axis, with no time wasted moving around the box to
push it in the new direction. The first agent knows only that if it pushes the box to a specific
point, it will eventually be rewarded. The second agent knows only that if it waits in a specific
17
place until a box comes and then pushes the box to another point, it will be rewarded. Despite
this mutual ignorance, they are able to act in a sophisticated and cooperative way.
3. Reward
and
Observation
Environment
Agent
2. State
Transition
4. Learning
1. Action
Figure 1.2 An agent interacting with an environment
Agent 2
3. Reward
and
Observation
4. Learning
1. Action
3. Reward
and
Observation
1. Action
Agent 3
Environment
Agent 1
4. Learning
2. State
Transition
4. Learning
3. Reward
and
Observation
1. Action
Figure 1.3 Three independent agents interacting with an environment
In some situations, however, implied coordination is not enough. Humans often struggle
through just such a deficiency in a lightless intersection where two drivers approaching each
other at the same time both want to turn left. Each driver repeatedly starts and stops in reaction
to the other. Without asymmetry, this cannot be simply resolved by looking for differentiating
factors between the drivers’ situations. Without communication, the drivers are also unable to
collectively decide who has the right-of-way. It is this sort of situation that requires that each
agent’s course of action be evaluated not in a vacuum, but in the context of the actions of other
agents. A learning agent must expressly consider the other agents’ behavior to decide its own,
which is an “explicit” coordination.
18
Agent 2
4. Learning
Agent 3
Central Controller
Agent 1
4. Learning
4. Learning
3. Reward
and
Observation
1. Actions
Environment
2. State
Transition
Figure 1.4 Three centrally controlled agents interacting with an environment. The central
controller here can be a central computer where all actions are computed together or just a
representation of a high-bandwidth communication channel through which independent
agents can come to a mutually agreeable course of action prior to acting.
To allow for this, agents must be able to consult each other and pick actions together (as in
Figure 1.4). It is only with the introduction of communication that the “joint action” chosen by
the agents truly represents a joint consensus.
An exact resolution to this situation is exponential with the number of agents that must
work together and it rapidly becomes untenable to calculate with even moderate numbers of
agents. However, we show that approximation strategies can yield good empirical results, even
when scaled to large problems.
1.1
Contributions
We focus on the exponentially-scaling problem of action selection during the process of
learning. To mitigate this formidable challenge, we examine the concept of “approximate
coordination”. By using a graph-based approach introduced in prior work (described in Section
19
2.4), we reduce the problem of coordinated action selection to one of inference in a Markov
network.
The contribution of this paper is the investigation of an exploration-based methodology for
approximate action selection and a comparison to existing approximate methods. This method
treats action selection as a problem of inference on the Markov network, inspired by Gibbs
sampling [5], which walks through the Markov network as a graph of conditional probabilities,
inferring overall probability based on the frequency with which it sees (in our case) a joint
action. It also takes inspiration from local searches that mutate a candidate solution to investigate other similar possibilities. In this view, the action space is treated as a solution space
to be explored. We present an algorithm that, with appropriate parameterization, joins these
two seemingly disparate approaches. We found that it is able to rapidly arrive at good, even
optimal, actions, despite the great size of the exponential action space, of which it can cover
only a negligible fraction. During the remainder of this paper, we will show how advanced
representations and approximate action selection can make the problem of action selection
easier.
20
Chapter 2
Background and Related Work
Several concepts serve as the groundwork for our contribution. A multi-agent form of
the Markov Decision Process (see Section 2.1) formalism provides an environment. On this
environment, our agents use SARSA, a type of reinforcement learning related to Q-learning, to
discover effective strategies (see Section 2.2). The addition of state approximation (see Section
2.3) speeds up the learning and allows it to proceed on large problems. The agents learn to
work together in a pattern described by a coordination graph (see Section 2.4). Together, these
concepts underlie our work.
2.1
Markov Decision Processes
One of the foundations of our work rests on the single-agent Markov Decision Process
(MDP) and its extension, the Fully Cooperative Multi-agent Markov Decision Process (described in Section 2.1.1). The basic MDP, shown in Definition 2.1, provides a representation
of the world in which the agent lives and describes how the agent interacts with that world. It
drives the feedback loop of Figure 1.2. It defines the possible states that the agent can observe,
the available actions that the agent can send to the environment, the rewards that an agent receives when an action is taken in a particular state, and the probabilities with which one state
proceeds to the next when an action is taken.
The MDP is separate from the agent, but can define things that might be considered to be
part of what a human agent generally does, such as choosing goals or interpreting what it sees.
It also includes physical parts of the agent (for example, the position of an agent’s body) as part
21
of the environment. The agent is only responsible for associating reward with state and picking
actions. A state in the MDP may, therefore, include raw observation, processed perceptions,
or memory. Because the state is the only way for the agent to receive information about the
environment, it must contain everything relevant for the agent to make a decision. Similarly,
choosing an action is the sole way for the agent to affect its surroundings and must therefore
include everything that the agent is capable of doing to impact the state. The reward function
gives the agent a goal to pursue. The rewards play a vital part in any learning system, without
feedback for incentive or punishment, the agent cannot learn. A discounting factor determines
how farsighted the agents should be in pursuing reward, further refining the agent’s goal. It is
also vital in keeping the future reward well-defined and therefore learnable, even in the face
of an unbounded number of future states. Choices like how much processing to do on a state,
how abstract to make the actions (moving a leg vs. going to a location), and what to reward
can result in very different MDP representations for a single scenario, though this is outside of
the scope of learning.
The MDP is characterized by the Markov property, which guarantees that each transition
depends only on the current state (and the action to be taken in that state), not on the prior
path taken to reach that state. The transition is also time-independent, so that the dynamics of
the transition do not change with the passage of time, only with the state. Some domains may
have terminal states which are inescapable and in which nothing interesting happens. When
designing experiments with this kind of domain, one can safely stop execution when it reaches
a terminal state and start again, splitting the interaction into a number of episodes.
Definition 2.1. A Markov Decision Process is:
• The state space (the set of all possible states for the process): S
• The action space (the set of all possible actions for the agent): A
• The probabilities of transitions between the states when actions are undertaken: P(S ∈
S, A ∈ A, S ′ ∈ S) = P r (S ′ |S, A)
• The reward feedback that defines the agent’s goals: R(S ∈ S, A ∈ A) → r ∈ R
22
• The discounting factor for future rewards: γ ∈ [0, 1)
2.1.1 Multi-agent Extension of the MDP
The standard MDP formalism is capable of representing a multi-agent problem by expanding the state space to accommodate the agents and expanding the action space to be the Cartesian product of each individual agent’s actions, with similar changes to the transition probabilities and rewards. Such a representation does not include any understanding by the MDP that
its state and actions are made up of contributions from individual agents. In order to better describe the choices and interactions of multiple distinct agents, we use an extension that models
them explicitly.
Definition 2.2. A Fully Cooperative Multi-agent Markov Decision Process (MMDP) is:
• The state space (the set of all possible states for the process): S
• The number of agents: n
• The joint action space (the set of sets of all possible actions for each agent): A =
{Ai }n−1
0
• The probabilities of transitions between the states when joint actions are undertaken:
P(S ∈ S, A ∈ A, S ′ ∈ S) = P r (S ′ |S, A)
• The reward feedback that defines the agents’ goals: R(S ∈ S, A ∈ A) → r ∈ R
• The discounting factor for future rewards: γ ∈ [0, 1)
The Fully Cooperative Multi-agent Markov Decision Process in Definition 2.2 expands the
MDP from a single agent to many agents, adding a number of agents and adjusting the other
components accordingly. The state space, while unchanged in concept, will generally increase
dramatically in size to accommodate the permutations of placement and sensory input for the
additional agents. The action space becomes a joint action space made up of one individual
action space per agent, each one akin to the MDP’s concept of an action space. The joint action
can be considered a mapped set that is the union of one individual action for each agent. We
23
are focusing on multi-agent problems in which all agents are working toward the same goal
with no ulterior motives, so there is a single reward feedback for the whole team of agents,
which depends on the joint action of the whole team.
2.2
Reinforcement Learning: Q-Learning and SARSA
Reinforcement learning is another basis of our work. Reinforcement learning is a set of
techniques by which agents acquire the capability to navigate an environment through repeated
exposure to that environment. Through trial and error over many samples, the agents learn
how to react to each state. These learned reactions take the form of a policy, as expressed in
Definition 2.3.
Definition 2.3. A policy π : S → A is the function that maps from the state to the joint action
to be taken in that state. While every MMDP has a deterministic policy that is optimal, agents
may also adopt probabilistic policies, either from a desire to explore or from an inability to
achieve determinism. When we use a policy, it is generally a probabilistic policy π : S →
[0, 1]|A| , which maps to a probability distribution over the joint action space instead of to a
single joint action. This probability distribution describes how likely the agents in a state are
to choose any particular joint action.
The goal of reinforcement learning is to discover an “optimal” policy that always takes the
joint action that will maximize the expected value of all future rewards, discounted based on
how far in the future they are (Equation 2.1). We refer to this expected cumulative discounted
reward as the “value” of the state. The value of a state can only be expressed with reference
to a policy, since it is the policy that determines the actions, and through that, the rewards and
future states.
V π (S t ) = E
∞
X
c=0
γ c R(S t+c , π(S t+c ))
!
(2.1)
The value may also be expressed in terms of both state and action. In this form, shown as a
sum in Equation 2.2 and recursively in Equation 2.3, we call it the Q value. The optimal policy
24
is reached when a Q value function is found that satisfies Equation 2.4, which is a form of the
Bellman equation[2]. Policies can be learned by attempting to directly optimize the value or
the policy by iteratively recalculating Q values. This is referred to as offline learning, because
it does not rely on an active, running instance of the MMDP. Online algorithms, on the other
hand, get samples of states, actions, and rewards by repeatedly interacting with an MMDP,
moving from state to state according to the transition function.
Qπ (S, A) = R(S, A) + γR(S ′ , π(S ′ )) + γ 2 R(S ′′ , π(S ′′ )) + . . .
(2.2)
Qπ (S, A) = R(S, A) + γQπ (S ′ , π(S ′ )
(2.3)
∗
∗
Qπ (S, A) = R(S, A) + γ max
Qπ (S ′ , A′ )
′
A ∈A
(2.4)
Q-learning[20] (see Algorithm 2.1) is an online reinforcement learning algorithm that attempts to find the optimal Q value functions by progressively updating its understanding of
the Q values. It is model-free, meaning that it neither has nor attempts to learn the transition
function or the reward function, counting instead on incorporating their distributions into its
understanding of the Q values via the samples. These samples are drawn from the feedback
between the MMDP’s transitions and the Q-learning agents choosing joint actions based on
prior learning. At each state, the algorithm picks an optimizing joint action, then feeds that
and the resultant state into the update function on line 7. The update includes an error term
rt + γ maxB∈A Q(S t+1 , B) − Q(S t , At ) that is the difference between the discounted future
reward that was “observed” via the reward received by the agents and the value of the best
joint action in the observed next state (rt + γ maxB∈A Q(S t+1 , B)) and the discounted future
reward that was anticipated (Q(S t , At )). Thus, no change occurs to the estimation if the previous estimation was correct. If the observed value is higher than anticipated, the term will be
positive, so it will anticipate a higher value for that action the next time it is in that state. If
the observed value is lower than anticipated, then it will anticipate a lower value. Because we
25
do not want to be too trusting of our current estimation of the value of the later state, we don’t
set the anticipated value to be the observed value, we just nudge it in that direction. A learning
rate parameter α ∈ (0, 1) (also on line 7) defines how strong the nudge is.
Algorithm 2.1 Q-Learning
Require: π(S): Agent’s policy function
Require: P(S, A, S ′ ): MDP’s transition function
Require: R(S, A): MDP’s reward function
1:
S 0 ← Initial State
2:
t←0
3:
while S t ∈
/ Terminal States do
4:
At ← π(S t )
5:
rt ← R(S t , At )
6:
S t+1 ← sample from P(S t , At , S t+1 )
7:
Q(S t , At ) ← Q(S t , At ) + α(rt + γ maxB∈A Q(S t+1 , B) − Q(S t , At ))
8:
t←t+1
9:
end while
By generally choosing actions that it believes to have high values, the algorithm is able to
gain greater experience in the more valuable states, where it should spend the bulk of its time.
Such a policy, where the agents always choose the joint action that they believe to have the
highest value, is referred to as the “greedy” policy. However, to ensure that an optimal policy is
eventually reached, the greedy exploitation of what has already been learned must be balanced
by exploration which actively seeks experiences that may overturn current beliefs. The GLIE
(Greedy in the Limit of Infinite Exploration) condition fulfills the need for exploration and
allows Q-learning to find the optimal policy. GLIE requires that, over an infinite horizon, each
state-action pair can be visited an unbounded number of times and the action selection becomes
completely greedy. This can be satisfied by ǫ-greedy exploration, which takes a random action
with probability ǫ and chooses the optimal calculated action with the remaining probability
26
1 − ǫ, as long as the ǫ value is allowed to decay to zero. This ǫ-greedy approach is the policy
in line 4 of Algorithm 2.1.
Algorithm 2.2 SARSA Q-Learning Variant
Require: π(S): Agent’s policy function
Require: P(S, A, S ′ ): MDP’s transition function
Require: R(S, A): MDP’s reward function
1:
S 0 ← Initial State
2:
t←0
3:
while S t ∈
/ Terminal States do
4:
At ← π(S t )
5:
if t 6= 0 then {SARSA waits to see the actual action, updating the previous state’s Q}
6:
Q(S t−1 , At−1 ) ← Q(S t−1 , At−1 ) + α(rt−1 + γQ(S t , At ) − Q(S t−1 , At−1 ))
7:
end if
8:
rt ← R(S t , At )
9:
S t+1 ← sample from P(S t , At , S t+1 )
10:
t←t+1
11:
end while
12:
Q(S t−1 , At−1 ) ← Q(S t−1 , At−1 ) + α(rt − Q(S t−1 , At−1 )){Final state’s Q update}
We use the SARSA (State, Action, Reward, State, Action) variant of Q-learning, shown in
Algorithm 2.2. It differs from Q-learning only in that it uses on-policy learning: it calculates Q
values on line 6 (and line 12) using the actual joint action that was taken by the policy, rather
than the maximizing joint action. This means it includes the effect of exploration in its value
computation, causing it to be more cautious about states that have actions with both very high
and very low rewards, since exploration will cause it to sometimes pick those very low rewards.
By contrast, Q-learning values actions under the assumption that it will always be able to use
the optimal action in all future states. SARSA is no less capable of yielding the optimal policy
than Q-learning. In situations where the action that maximizes Q (S, A) is difficult to find,
27
SARSA will also be slightly more efficient by not needing to compute the maximization when
it explores.
2.3
State Approximation
A challenging aspect of reinforcement learning as described in Section 2.2 is the need to
learn separate values for each state. As complexity is added to the state, the size of the state
space increases exponentially. For example, in a world filled with n people, each with integer
coordinates, the state space must be able to describe the position of each. Every permutation
becomes a unique state, yielding xn y n states. The exponential growth, which has been referred
to as the curse of dimensionality, causes problems for Q-learning. The process of Q-learning
relies on the agent revisiting each state repeatedly and choosing each joint action there many
times, calculating a Q value for each permutation of state and joint action. As the state size
increases, it can easily become infeasible to do all this visitation in a reasonable amount of
time or even to store the agent’s understanding of all these Q values.
In order to continue learning by experience despite the exponential number of states, some
form of approximation is needed to update groups of similar states at the same time. This can
be done by learning according to features, f~, of the state and joint action, rather than the state
and joint action directly[13]. The features have values based on the state and joint action and
carry information about select salient aspects of that state-action pair, replacing the pair for the
purpose of representing the values. These features may be such things as a number of nearby
accomplices to an agent’s game piece or, more commonly, binary features like whether a target
is close by or if a game piece would walk into a boundary if the joint action were taken. The
features we use are described in Section A.2, but require an understanding of the domains in
Section 4.1 how the coordination graphs in Section 2.4 interact with feature approximation.
The agents, instead of learning values for the state-action pair, learn weights associated with
the features of that state-action pair.
In the simplest feature approximation (shown in Equation 2.5), the weights θ~ are coefficients of a linear combination of features. One feature is set to always be 1, to absorb any
28
consistent bias in the reward. The introduction of this approximation also requires that the
normal Q-learning value updates (line 7 of Algorithm 2.1) are replaced with Equation 2.6.
The SARSA update (line 6 of Algorithm 2.2) is similarly changed to Equation 2.7 (line 12 is
likewise altered) ], differing slightly because it occurs the next step, when t is one higher.
Q (S, A) = θ~ • f~ (S, A)
(2.5)
~θt+1 ← θ~t + α rt + γ max Q(S t+1 , B) − Q (S t , At ) f~t (S t , At )
(2.6)
θ~t ← θ~t−1 + α (rt−1 + γQ (S t , At ) − Q (S t−1 , At−1 )) f~t−1 (S t−1 , At−1 )
(2.7)
B∈A
Because features have similar values across similar states and dissimilar values in dissimilar states, an update to one state will impact all other states. A unexpectedly high reward
received on a state-action pair with a positive feature, for example, will increase the value of
all states-action pairs where that feature is positive, but decrease the value of all state-actionpairs where that feature is negative. This allows the learning procedure to assign values for a
state-action pair without ever having seen the state or ever having taken that joint action, just
by its similarities to and differences from others that have been explored. Because of this, the
time it takes to learn is drastically reduced.
Feature approximation could be used losslessly by having an identity feature for each stateaction pair (IE: binary features that say “this is state S and joint action A” and are 1 only for
that pair). This retains the state-by-state and action-by-action specificity that exact Q value
functions have, which is required to describe the most complicated possible Q value functions.
It does not, however, allow multiple states to be learned at the same time. With well chosen
features, the loss in descriptiveness may be kept to a minimum while speeding the learning
process. Picking the features is a task that is not normally part of the learning process. It is
difficult to automate, as it requires a preexisting knowledge of the salient aspects of the state.
As a result, it is typically done manually by a human expert.
29
2.4
Coordination Graphs
Just as the state space becomes exponential and must be approximated, so too does the
action space. To reduce this, one must relax the premise of central control that every agent
coordinates with all other agents at all times. When making decisions, the agent may only
want to consider in detail the actions of a much smaller number of other agents.
Agent 1
Agent 0
Agent 2
Agent 4
Agent 3
Figure 2.1 A simple coordination graph with five agents. This graph represents the
decomposition Q (S, A) = Q1 (S, A1 , A2 ) + Q2 (S, A1 , A3 ) + Q3 (S, A2 , A3 ) +
Q4 (S, A3 , A4 ) + Q5 (S, A4 , A1 ) + Q6 (S, A0 , A4 )
We represent this reduced coordination with a coordination graph[6][7], such as the one
shown in Figure 2.1. The coordination graph is an undirected graph with a node for each agent
and p edges connecting agents that must coordinate. For each edge between agent i and j,
there is a Q value function that represents the contribution of that particular pair of agents. We
vary the notation depending on the needs of the equation or algorithm and refer to this function
either by the agents in it, as Qi,j , or by an iterator, as Qh ; we always pass the state S ∈ S, but
depending again on the needs of the algorithm, we may pass it the pair of individual actions
(Ai ∈ Ai and Aj ∈ Aj ), a full joint action (A ∈ A), or even a partial joint action containing
the two individual actions and some number of other agents’ actions; regardless of what form
we use, however, it is always the same function using the state and the two individual actions
to compute the Q value. The coordination graph corresponds to a decomposition in which
these pairwise Q value functions sum up to the full Q value function as per Equation 2.8. This
30
decomposition is generally a lossy approximation. Decomposing into pairwise functions is
unable to represent all possible functions of n. Further losses may be seen unless the agents’
decisions are completely independent in all pairs of agents that do not share an edge, though
these losses can be minimized or eliminated with an appropriate coordination graph.
Q (S, A) =
p
X
Qh (S, A)
(2.8)
h=1
A coordination graph is a type of Markov network. A Markov network is an undirected
graphical representation of a multivariate probability function. In such a representation, each
of the nodes is a variable and each edge is associated with a contribution to the full probability.
The neighbors also represent a dependence relationship: the probability of a specific value for
a variable will generally depend on the values of its neighbors. A coordination graph works
the same way, but the variables are called agents and the values that they can take are called
actions. Maximizing the Q value of these actions is equivalent to the maximization of the
probability on a standard Markov network, a task of probabilistic inference.
The coordination graph decomposition can be used alongside with the feature approximation described in Section 2.3 by applying a separate linear approximation to each of the
pairwise Q functions. The features for each pairwise function can be a mix of features relevant
to each individual agent and its individual action and features about the pair of agents and the
pair of actions they can take. The features thereby provide a means to restrict the decomposed
Q value functions to only use a subset of the state concerning the two agents in the function.
2.5
Related Work
Our goal is to investigate known action selection algorithms, determining which are best.
Approximate action selection algorithms are of particular interest due to their potential to scale.
All of the techniques that we have described are in service of this investigation and our own
addition to the ranks of the action selectors. In the course of our research, we encountered
others that use techniques similar to our own.
31
One parallel thread in coordination graphs is utile coordination [11]. In utile coordination,
the coordination graph is assembled during the learning process by adding state-specific “value
rules”. The value rules specify the conditions (in terms of states and actions) to which they
apply and the value they contribute when those conditions are met. Value rules are added as
it becomes clear that an adjustment is needed. This process builds the coordination graph on
an as-needed basis to ensure not only that all necessary coordination can take place, but also
that valuable time is not wasted computing coordinated actions when they aren’t needed. This
approach alters the decomposed Q functions, but does not change how they are used, so it
is fully compatible with and cumulative in effect to our own approximate action selector. In
contrast, our context-independent coordination graph is fixed before learning and is the same
for all states. We made this choice to reduce the complexity of the model.
Zhang and Lesser[22] also attempts to learn a coordination structure. They use “coordination sets”. These represent a broader concept than coordination graphs, allowing unidirectional
coordination and functions of more than two agents. Each agent i tracks the decomposed Q
value functions for all possible groups it can coordinate with, as well as the probabilities of
each of their partial joint actions conditional on agent i’s action. The scaling of this is kept
in check through the use of domains that heavily restrict which agents can meaningfully interact. The agent uses these decomposed Q value functions in conjunction with the probabilities
to compute its own values. The probabilities also provide the ability for the agents to give a
meaningful estimate of how much potential value they might be missing out on by not coordinating (or how much they would lose by no longer coordinating), allowing them to modify
the coordination structure as the need arises. This, again, contrasts with our static coordination
structure, but also differs from it in other ways that make it less suitable for our situation. The
ability to use unidirectional coordination is less useful with our centralized control. With centralized control, there is little reason not to reuse a value calculated for one direction for the
other. Decomposing the Q value into functions of more than two agents allows it to be more
descriptive, reducing losses from the approximation. However, in domains such as ours where
32
any subset of agents is capable of meaningful iteration, an attempt to track rewards and probabilities for all possible groups would be infeasible. Instead, we would be left with a choice of
which groups to exclude from consideration, which is not much different from just using static
coordination structure as we did.
Another approach, ad hoc teamwork[16][1], considers the problem of cooperating in a quite
different way. When we approach cooperation, we try to train a team of agents to work together
to solve a problem. Ad hoc teamwork, by contrast, takes the path of developing an individual
agent that is a flexible teammate. When placed with agents that it has no experience with,
this kind of agent is capable of considering the capabilities and tendencies of those agents and
picking actions that are good in that situation. That ability is a powerful skill in many situations that humans find themselves in, but may struggle to overcome problems that genuinely
require coordination, just as an agent trained as part of a team may flounder when that team
is replaced by a group of unknown agents. The two approaches, though both aiming to create
good teammates, are applicable to very different circumstances.
The Decentralized Markov Decision Process (DEC-MDP) concept developed by Bernstein
et al.[3], like our own fully coordinated MMDP, expands the concept of an MDP to have
multiple agents. However, because these agents can be provided with individualized reward
signals, it differs from the shared reward signal of our fully coordinated MMDP. This makes
the DEC-MDP a generalization of the fully coordinated MMDP. By leaving the individual
rewards the same, it is able to represent anything that a fully coordinated MMDP can, but it
can also make the rewards different to help in attributing rewards to agents or even to make a
domain in which the agents compete. They further generalize this to a Decentralized Partially
Observable Markov Decision Process (DEC-POMDP), to allow each agent different localized
views of the state. These features make the DEC-MDP and especially the DEC-POMDP ideal
for situations that, because of costly inter-agent communication, demand decentralized control.
This contrasts with the centralized control that we use in order to focus on the computational
limits of action selection.
33
Gast et al.[4] attempts to tackle the problem of increasing the number of agents in a very
different way. Like us, they use a mean field approximation for action selection (our use of
a mean field approximation is explained in Section 3.3). However, rather than adapting the
mean field methods to work in an MDP, they adapt the MDP to be suitable for a mean field
method. They do this by defining an “occupancy measure” of the number of agents in specific
state. Using this measure, they define a representation of the evolution of the measure over
multiple time steps, and a value function expressed in terms of the measure. Using these,
the occupancy measure can be approximated by its mean field limit. This affords them far
better scaling. However, they must make assumptions of homogeneity in the construction of
an occupancy measure. In contrast, our emphasis on coordination is based on the expectation
of heterogeneous agents with complex interactions that rely on their identity.
34
Chapter 3
Action Selection
One of the core tasks of learning is leveraging that learning into a course of action, turning
new input into new output. Greedy action selection consists of a single maximization, as shown
in in Equation 3.1. While this may be a trivial maximization in a small environment with few
agents, scaling up the task makes action selection into a challenge in its own right.
A∗ = arg max Q (S, A)
(3.1)
A∈A
With an online algorithm like Q-learning or SARSA, action selection gains even greater
importance, as the learning process relies on action selection to guide it to high-reward portions of the state space. An action selection method on a large multi-agent system needs to be
fast and scalable, but cannot sacrifice the ability to choose high-quality actions. There are a
number of considerations in the evaluation of action selectors. Quality of results is extremely
important, with exact selectors naturally finding the optimal actions. Equally important is the
time complexity, which, if sufficiently poor, can immediately exclude a selector from consideration. Other factors include space complexity and compatibility with parallelization. As the
time constraint is the primary reason to use an approximate action selector, it is also preferable
that the selector be adjustable to whatever time is available, such as through the anytime property defined in Definition 3.1. Such a selector can exploit the full calculation window without
risking not providing an action if the calculation takes too long.
35
Definition 3.1. Anytime property: An action selector has the anytime property if it can be
stopped partway through its execution and provide an interim solution. To satisfy this property,
a method must maintain a complete proposed solution at all times.
We will discuss several existing action selection methods before introducing our own.
These methods represent the state-of-the-art in action selection. They approach the problem
very differently, but they have potential to go beyond what is possible with a naı̈ve brute-force
algorithm.
In characterizing the action selection methods, we use the following terms and notation:
• n: The number of agents, from the MMDP definition.
• m: The number of actions per agent. This may vary by agent, with mi actions for agent
i. For notational purposes, we typically assume that each agent has m actions.
• Qh : One of the decomposed Q value functions that are summed to make the full Q value
function Q. These are numbered from 1 to p.
• k: The connectivity of the coordination graph. This is a measure of how many edges
connect each agent.
• A partial joint action with respect to a set of agents is a member of a smaller joint action
space consisting only of individual actions corresponding to the set of agents.
• agents\i is the set of agents without agent i.
• A\i represents a partial joint action consisting of only the actions from the agents in
agents\i.
• Γ (node): The neighbors of a node on a graph structure. This is the set of all other nodes
that have an edge connecting them to the specified node.
• ΓC (i ∈ agents) : agents → 2agents : A function mapping from an agent i to all agents
sharing a decomposed Q value function with i. These are its neighbors on the coordination graph.
36
• ΓA (A ∈ A): A function mapping from a joint action A to the set of other joint actions
that share all but one individual action. This is also the neighbors on the Action Graph,
a graph of joint actions described in more detail in section 3.4.1. The set notation is
B B ∈ A & B 6= A & ∃i A\i = B\i .
3.1
Exact Action Selectors
Exact action selectors will always pick the best possible action, so they can only be equaled,
never surpassed.
3.1.1 Brute Force Action Selector
The most simple action selection algorithm is, of course, an exhaustive search of the action
space. As each permutation of actions for each agent must be checked and each joint action
takes
nk
2
to add together the decomposed Q value functions into the full Q value, this has time
complexity O(mn nk), which is woefully unacceptable on all but the smallest problems.
Algorithm 3.1 Brute Force Algorithm
Require: S: The current state
P
Require: Q = ph=1 Qh : The Q value function
Require: A = {Ai }n1 : The action space
1:
Initialize bestA to the first joint action
2:
for each A in A do
3:
4:
5:
if Q (S, bestA) < Q (S, A) then
bestA ← A
end if
6:
end for
7:
return bestA
The Brute Force algorithm benefits slightly from the decomposed Q value functions of the
coordination graph from Section 2.4. Because the parameters to the decomposed functions
37
are heavily shared between similar actions (Qi,j (S, Ai , Aj ) won’t change value unless agents
i or j have different actions), they can be reused in the value calculations of different joint
actions. Two joint actions that with the same individual actions for all but one agent will differ
only by the k decomposed Q value functions that that agent is part of, one for each neighbor
on the coordination graph. An order where all consecutively computed joint actions differ
by only a single agent’s individual action, such as the order defined by an m-ary Gray code,
can guarantee that only k operations will need to be done for each joint action computation.
This improves the time complexity to O(mn k). This is, of course, not nearly enough to allow
Brute Force to be a viable action selector, but the same concept can be used whenever we must
compare Q values while varying the individual actions of several agents.
In the most general case, each joint action’s value shares no component of or correlation to
any other joint action, making the Brute Force algorithm impossible to improve on. However,
the coordination graph approximation constrains the problem in a predictable way, which we
can take advantage of with more specialized solutions.
3.1.2 Agent Elimination Action Selector
Agent Elimination (called Variable Elimination by Guestrin et al.[6]) is an exact action
selector that was presented in conjunction with coordination graphs as a way to take advantage
of the structure to speed up selection.
Instead of treating the full Q value function as a single function of all agents’ actions and
optimizing that function, the Agent Elimination method embraces the decomposition. One
agent at a time, it rewrites the decomposed Q value functions containing that agent, combining
them into single functions with that agent’s contribution maximized.
In order to demonstrate Agent Elimination (shown in Algorithm 3.2), we use an example
based on Figure 2.1. The maximization of the value of this coordination graph on a given state
S is expressed in Equation 3.2.
38
Algorithm 3.2 Agent Elimination
Require: S: The current state
P
Require: Q = ph=1 Qh : The Q value function
Require: agents: The set of agents
Require: A = {Ai }n1 : The action space
1:
F ← {Qh }ph=1 {F starts as Q, decomposed}
2:
for i ← 1...agents do {Eliminate agents one at a time}
3:
neighbors ← ∅
4:
neighborF unctions ← ∅
5:
for Fitr ∈ F do {Find functions with agent i and save other agents in them}
6:
if i ∈ agents in Fitr then
7:
neighbors ← neighbors ∪ agents in Fitr \i
8:
neighborF unctions ← neighborF unctions ∪ Fitr
9:
end if
10:
end for
11:
for each partial joint action B using the agents in neighbors do {Find best responses}
P
ei (B) ← maxAi ∈Ai Fitr ∈neighborF unctions Fitr (B ∪ Ai )
P
responsesi (B) ← arg maxAi ∈Ai Fitr ∈neighborF unctions Fitr (B ∪ Ai )
12:
13:
14:
end for
15:
F ← F\neighborF unctions
16:
F ← F ∪ ei
17:
end for
18:
bestA ← ∅{Build best action as a set}
19:
for i ← |agents| ...1 do {Start with the parameterless response of the last agent eliminated); use responses to known actions to find more}
20:
21:
bestA ← bestA ∪ responsesi (bestA)
end for
39
max
A0 ,A1 ,A2 ,A3 ,A4
Q1 (S, A1 , A2 ) + Q2 (S, A1 , A3 ) + Q3 (S, A2 , A3 ) + Q4 (S, A3 , A4 ) +
Q5 (S, A4 , A1 ) + Q6 (S, A0 , A4 ) (3.2)
Agents can be eliminated in any order and we designed this example to be illustrative of
the process. If we choose to first eliminate agent 1, then we start by identifying functions that
agent 1 depends on as per lines 5-10 of Algorithm 3.2. This is shown in Equation 3.3
max
A0 ,A2 ,A3 ,A4
Q3 (S, A2 , A3 ) + Q4 (S, A3 , A4 ) + Q6 (S, A0 , A4 ) +
max (Q1 (S, A1 , A2 ) + Q2 (S, A1 , A3 ) + Q5 (S, A4 , A1 )) (3.3)
A1
We can call the inner maximization in Equation 3.3 e1 and isolate it in Equation 3.4. It represents agent 1’s best response to all possible actions of the agents that agent 1 must coordinate
with (ΓC (agent 1) =agents 2, 3, and 4). Solving the maximization to describe this function
requires enumerating over all combinations of actions from A1 , A2 , A3 , and A4 (see lines
11-14 of Algorithm 3.2). Because this function is specific to a state, it doesn’t need the state as
a parameter.
e1 (A2 , A3 , A4 )
=
max (Q1 (S, A1 , A2 ) + Q2 (S, A1 , A3 ) + Q5 (S, A4 , A1 )) (3.4)
A1
Equation 3.4 can be plugged into Equation 3.3, replacing the functions it was constructed
from as per lines 15-16 of Algorithm 3.2. This alters the goal maximization to Equation 3.5,
which no longer is parameterized by agent 1’s action.
max
A0 ,A2 ,A3 ,A4
Q3 (S, A2 , A3 ) + Q4 (S, A3 , A4 ) + Q6 (S, A0 , A4 ) + e1 (A2 , A3 , A4 ) (3.5)
Agent 1 has thereby been fully eliminated. This alters the effective coordination graph
used to determine the neighbors during the remainder of the selection process. Agent 1 is
40
gone, but all agents connected to agent 1 are now connected to each other. The process can be
repeated with agent 2. We begin by separating the components that include agent 2’s action, as
in Equation 3.6.
max Q4 (S, A3 , A4 ) + Q6 (S, A0 , A4 ) +
A0 ,A3 ,A4
max (Q3 (S, A2 , A3 ) + e1 (A2 , A3 , A4 )) (3.6)
A2
The contribution of agent 2 is also stored as a function, e2 , which is shown in Equation 3.7.
e2 (A3 , A4 ) = max (Q3 (S, A2 , A3 ) + e1 (A2 , A3 , A4 ))
(3.7)
A2
Once again, the newly defined function (now Equation 3.7) can be plugged into the maximization (Equation 3.6), further refining the maximization to one that does not include the
agent (shown in Equation 3.8). This not only eliminates agent 2, but it also replaces the previously defined function, e1 . It is important, however, that the selector retains the original e1
function to use it to recover the maximizing action once all of the parameters are known.
max Q4 (S, A3 , A4 ) + Q6 (S, A0 , A4 ) + e2 (A3 , A4 )
A0 ,A3 ,A4
(3.8)
We can use the same process to eliminate agent 0. This does not subsume e2 , so both are
included.
e0 (A4 ) = max (Q6 (S, A0 , A4 ))
A0
(3.9)
Using the new function to eliminate agent 0, this reduces it to a two action maximization.
max Q4 (S, A3 , A4 ) + e0 (A4 ) + e2 (A3 , A4 )
A3 ,A4
(3.10)
Agent 3 is eliminated next.
e3 (A4 ) = max (Q4 (S, A3 , A4 ) + e2 (A3 , A4 ))
A3
(3.11)
41
This leaves only one agent’s action to be maximized.
max e0 (A4 ) + e3 (A4 )
A4
(3.12)
The final elimination is performed on agent 4.
e4 () = max (e0 (A4 ) + e3 (A4 ))
A4
(3.13)
This leaves a “maximization” that is a single scalar value.
max e4 ()
(3.14)
Taking the final introduced function, e4 , which is just the single optimal reward, but using
Equation 3.13 as an argmax, we can find optimal individual action A4 . This can be plugged
into an argmax version of Equation 3.11 to give us the optimal A3 and Equation 3.9 to give
us the optimal A0 . A4 and A3 in turn plug into the argmax of Equation 3.7 to give us A2 ,
which allows computation of A1 from the argmax version of Equation 3.4, completing the
joint action. If the maximizing value is stored during the initial computation of the functions’
values, then these computations are just retrievals with an increasingly large partial joint action
(as in lines 18-21 of Algorithm 3.2).
This example requires finding the value of a joint action m3 +2∗m2 +2∗m times, whereas a
Brute Force solution would take m5 value calculations to reach the same answer. This is a very
real advantage even on the relatively small five agent case. Despite the improvement, the step
of creating a new derived function means that the time remains exponential with the number
of agents in each best response function plus one (the eliminated agent). The space required is
also exponential with the agents in the best response functions. This number is also the largest
width among the coordination graphs induced during the process of elimination. On our coordination graph (explained later and shown in Figure 4.3) and with the order that we eliminate
agents, this value is k. This gives Agent Elimination a time complexity of O(nkmk+1 ) and a
space complexity of O(nmk ) for our structure. This time complexity requires the Gray code
optimization from Section 3.1.1, which we did not use in our testing. Without this, it will take
42
O(nk 2 mk+1 ) (a factor of k is added instead of n because we were computing values of partial
joint actions, not full joint actions, so many of the decomposed Q value functions were already
being skipped).
The order in which the agents are eliminated does not affect the correctness of the solution,
but it does have an effect on the practical performance, as different orders can change the
connectivity of the induced graphs[6].
3.1.2.1
Advantages and Drawbacks
Agent Elimination is our only viable exact method. It is exponentially faster than Brute
Force, but is still exponential with the amount of coordination. Because it builds the exact
solution only after full calculation, it not an anytime method. While it can be counted on to
give the exact best action after a predictable amount of time, it cannot be halted early for a
merely good action.
Agent Elimination can be parallelized for use in a distributed system of agents [11]. This is
accomplished by having an agent calculate the conditional response function locally, then pass
its result to a single one of its neighbors. This replaces the agent’s contribution with a function
of other agents, effectively eliminating it from the coordination graph. This is repeated until
no neighbors of a receiving agent remain, at which point the responses chain normally.
3.2
Max Plus Action Selector
Our first approximate method is Max Plus[10][18][19]. It is shown in Algorithm 3.3 and
named for the maximization and summation in line 7. Like Agent Elimination, it works with
best responses. However, whereas Agent Elimination creates exponentially large tables to
indicate best responses to all remaining coordinated agents, Max Plus has each agent (the
“sender”) create best responses individually to each other agent (the “recipient”) with which it
coordinates. The sending agents notify recipient agents of the responses by sending messages
containing value offsets for each of the recipient’s actions based on how good the best response
to that action is (as in 7 of Algorithm 3.3). The messages take into account the Q value function
43
component that the sender shares with the recipient and the most recent received messages
from other neighbors of the sender. This allows value information to propagate through the
coordination graph to all other agents over the course of several iterations. The construction
of the messages (see Algorithm 3.3), which dominates other factors, takes O(nkm2 ) time per
iteration. Max Plus’s O(nkm) space requirement is less than the O(nkm2 ) required to store
the decomposed Q value functions.
3.2.1 Example
Agent 1
Agent 0
A2
0 9
7 1
A4
1 3
0 8
Q1,4
A1
A1
Q1,2
Q0,4
A1
A2
Q2,3
A3
2 1
7 6
A0
Q1,3
Agent 2
A3
2 3
9 8
A3
Q3,4
Agent 3
A4
1 3
6 3
A4
4 7
5 2
Agent 4
Figure 3.1 The coordination graph from Figure 2.1 with concrete Q value functions. It
represents each Qi,j (S, Ai , Aj ) as a table for a specific state S and places the table on the
edge that links agent i and agent j. The cell at (row r, column c) in this m × m table contains
the value, Qi,j (S, r, c), that Qi,j contributes when Ai = r and Aj = c. For this example, we
have chosen two as the number of actions m per agent.
To demonstrate Max Plus, we can use the example coordination graph in Figure 3.1. Agent
2 in this graph is a good example for discussion due to its placement, which allows us to show
interaction with neighbors and indirectly with non-neighbors. Each agent creates messages to
each of its neighbors as per lines 4-12 of Algorithm 3.3.
Because the messages from other agents are not yet available, the first messages that agent
2 sends to its neighboring agents (agents 1 and 3) are only impacted by the Q value functions
that it shares with them (line 7 of Algorithm 3.3). For its message to agent 1, it maximizes its
44
Algorithm 3.3 Max Plus
Require: S: The current state
Require: Q: The Q value function. Reference each decomposed Q value function of agents i
and j as Qi,j (S, Ai , Aj ) = Qj,i (S, Aj , Ai ), rather than by index
Require: agents: The set of agents
Require: A = {Ai }n1 : The action space
1:
Initialize message i to j (Aj ) to 0 for all agent pairs where j ∈ ΓA (i) and all actions Aj
in Aj
2:
3:
while not converged and not halted do
message i to jold (Aj ) ← message i to j (Aj ) for all agent pairs where j ∈ ΓA (i)
and all actions Aj in Aj
4:
5:
6:
for each agent i in agents do
for each agent j in ΓA (i) do {Agent i sends a message to each neighbor j}
for each individual action Aj ∈ Aj do
message i to j (Aj )
←
P
k∈ΓA (i)\j message k to iold (Ai ))
7:
8:
9:
10:
maxAi ∈Ai (Qi,j (S, Ai , Aj )
+
end for
end for
normalizingConstant ←
P
message i to k(Ak )
P
k∈Γ (i) mk
k∈ΓA (i)
A
11:
message i to j (Aj ) ← message i to j (Aj ) − normalizingConstant
12:
end for
13:
currentA ← ∅{Build current action as a set}
14:
for each agent i in agents do {Current action maximizes the received messages for each
agent}
15:
currentA ← currentA ∪ arg maxAi ∈Ai
P
j∈Γi
message j to i (Ai )
16:
end for
17:
bestA ← arg maxA∈{bestA,currentA} Q (S, A){Track the best joint action seen}
18:
end while
19:
return bestA
45
own action parameterized by agent 1’s possible actions, picking its best response to each. The
Q function from Figure 3.1, used to construct the message, is shown in Table 3.1.
A1
Q1,2
A2
0
9
7
1
Table 3.1 The Q function Q1,2 (S, A1 , A2 ) from Figure 3.1. This is the basis of agent 2’s
message to agent 1.
To maximize the value according to Table 3.1, agent 2 responds to agent 1’s first action
by using agent 2’s second action and responds to agent 1’s second action with its first. The
unnormalized message is therefore 9A17 , the partial Q values tied to these responses.
For the message to agent 3, it uses the Q value function it shares with that agent, shown in
Table 3.2, but originally taken from Figure 3.1.
A2
Q2,3
A3
2
3
9
8
Table 3.2 The Q function Q2,3 (S, A2 , A3 ) from Figure 3.1. This is the basis of agent 2’s
message to agent 3.
To maximize the value according to Table 3.2, agent 2 responds to agent 3’s first action by
using agent 2’s second action and responds to agent 3’s second action with its second. The
unnormalized message is therefore 9A38 .
=
The normalization constant for agent 1, as per line 10 of Algorithm 3.3, is (9+7)+(9+8)
2+2
8.25. Subtracting this constant makes the final normalized messages 0.75 A1-1.25 and
A3
0.75 -0.25 , using line 11 of Algorithm 3.3. Without these normalization constants, the
messages become incorrect in a cyclic graph. In the example graph, this would happen when
a message from agent 1 including agent 2’s messages changes the value of agent 2’s message
46
Agent 0
Agent 1
Agent 2
Agent 3
Agent 4
from Agent 4
A0
1.67 -0.33
from Agent 2
A1
0.75 -1.25
from Agent 1
A2
0.67 2.67
from Agent 1
A3
0.67 -0.33
from Agent 0
A4
-1 1
from Agent 3
A1
-3 2
from Agent 3
A2
-2 4
from Agent 2
A3
0.75 -0.25
from Agent 1
A4
-5.33 1.67
from Agent 4
A3
-2.33 0.67
from Agent 3
A4
1 -2
from Agent 4
A1
-2.33 2.67
Table 3.3 Max Plus Example: The messages after one iteration of Max Plus, organized by
recipient. Agent 2’s messages, for which we have shown the calculations, are bolded.
to agent 3, which contributes to agent 3’s message to agent 1, which will cause agent 1’s message to agent 2 to incorrectly contain some of agent 2’s message to agent 1, which throws off
the calculation. With the specified values, this will also cause the values of the messages to
increase constantly.
The same process is followed for every agent. This yields the full set of normalized
messages shown in Table 3.3. Agent 2 receives two messages, one from each of its neighbors. With these messages, following the maximization in lines 13-16 of Algorithm 3.3, it
combines the messages it received and chooses the action with the highest value. Agent
2 can see that taking action 0 is valued at 0.67 by agent 1 and -2 by agent 3, but action
1 is valued at 2.67 by agent 1 and 4 by agent 3. The total value of 6.67 for action 1 is
greater than the -1.33 provided by action 0, so agent 2 will select action 1. The full joint
action picked by the agents with these messages is {0, 1, 1, 1, 1}. When this joint action
is plugged into all of the Q functions in Figure 3.1, the Q value can be calculated to be
Q0,4 (S, 0, 1) + Q1,2 (S, 1, 1) + Q1,3 (S, 1, 1) + Q1,4 (S, 1, 1) + Q2,3 (S, 1, 1) + Q3,4 (S, 1, 1) =
7 + 1 + 6 + 8 + 8 + 3 = 33.
The second iteration of messages takes the first iteration of messages into account. Because
all other messages that an agent receives are sent to each neighbor, neighbors of neighbors
receive their first information about each other with the second iteration of messages.
47
For agent 2, when it sends a message to agent 1 in the second iteration, it takes into account
the message it received from agent 3 and Q1,2 (S, A1 , A2 ), as per lines 4-12 of Algorithm 3.3.
Similarly, when sending messages to agent 3, agent 2 takes into account the message it received
from agent 1 and Q2,3 (S, A2 , A3 ). These derived value functions are shown in Tables 3.4 and
3.5, respectively. When running the algorithm, there is no need to store them as tables, but for
the sake of example, we show what the terms calculated in lines 4-12 of Algorithm 3.3 look
like.
A1
A2
0-2
9+4
7-2
1+4
Table 3.4 A Q function made from Q1,2 (S, A1 , A2 ) from Figure 3.1 and agent 3’s message
to agent 2 (see Table 3.3). This is the basis of agent 2’s message to agent 1.
A2
A3
2+0.67
3+0.67
9+2.67
8+2.67
Table 3.5 A Q function made from Q2,3 (S, A2 , A3 ) from Figure 3.1 and agent 1’s message
to agent 2 (see Table 3.3). This is the basis of agent 2’s message to agent 3.
To maximize its response to agent 1, agent 2 responds to agent 1’s first action by using agent
2’s second action and responds to agent 1’s second action with either. To maximize its response
to agent 3, agent 2 responds to agent 3’s first action by using agent 2’s second action and
responds to agent 3’s second action with its second. The unnormalized messages are therefore
A3
A1
(13+5)+(11.67+10.67)
= 10.08 is
2+2
13 5 and 11.67 10.67 . The normalization constant of
used, leaving 2.92 A1-5.08 and 1.58A30.58 .
Again, the other agents follow the same process to give the normalized messages in Table
3.6.
Maximizing these messages results in the selector picking joint action {0, 1, 0, 0, 1}.
This has a Q value of 34, which surpasses the previous high of 33 and, therefore, replaces
48
Agent 0
Agent 1
Agent 2
Agent 3
Agent 4
from Agent 4
A0
1.89 -3.11
from Agent 2
A1
2.92 -5.08
from Agent 1
A2
4.56 -1.44
from Agent 1
A3
1.31 0.31
from Agent 0
A4
-1 1
from Agent 3
A1
-3.47 1.53
from Agent 3
A2
-1.56 3.44
from Agent 2
A3
1.58 0.58
from Agent 1
A4
-6.36 1.64
from Agent 4
A3
0.89 0.89
from Agent 3
A4
0.53 -0.47
from Agent 4
A1
-2.78 2.22
Table 3.6 Max Plus Example: The messages after two iterations of Max Plus, organized by
recipient. Agent 2’s messages, for which we have shown the calculations, are bolded.
Agent 0
Agent 1
Agent 2
Agent 3
Agent 4
from Agent 4
A0
1.17 -2.83
from Agent 2
A1
4 -4
from Agent 1
A2
5.42 -.58
from Agent 1
A3
-0.17 -1.17
from Agent 0
A4
-1 1
from Agent 3
A1
-2.67 2.33
from Agent 3
A2
-2.08 4.92
from Agent 2
A3
0 0
from Agent 1
A4
-3.75 0.25
from Agent 4
A3
-0.33 -0.33
from Agent 3
A4
-0.25 -2.25
from Agent 4
A1
-1.33 3.67
Table 3.7 Max Plus Example: The messages after many iterations, organized by recipient.
the previous best joint action (see line 17 of Algorithm 3.3). This now gives agent 2 its first
information based on agent 4, through the messages from agents 1 and 3. All further rounds of
messages follow the same procedures.
After many rounds, the messages converge to those shown in Table 3.7. With a Q value
of 35, the joint action picked by maximizing the converged messages, {0, 1, 1, 0, 1}, is the
optimal action.
49
3.2.2 Advantages, Drawbacks, and Comparison
Max Plus is an iterative approximation that attempts for each agent to build a response table
that takes into account the optimal actions from all other agents. This is exactly the kind of
table that is built for the second-to-last agent in the Agent Elimination method.
Because it is iterative, Max Plus can easily gain the anytime property by computing a
solution after each iteration and tracking the best one. While it is typically guaranteed to
converge to an optimal solution, cyclic functions, such as those in the coordination graphs we
are using, disrupt the guarantee (which can be restored at great computational cost). While a
cyclic coordination graph means that it is possible for Max Plus to converge to a non-optimal
result or even not at all, prior work shows that this does not prevent Max Plus from achieving
high-quality results[18].
Since the agents are the message creators, Max Plus handles decentralization well by parallelizing on the agent level. When distributed, it can also abandon iterations, passing and
receiving messages asynchronously.
3.3
Mean Field Action Selector
We also examined a variant of mean field methods for graphical inference [9][8]. Mean
field methods can be used to approximate functions of many variables. We approximate the
Q value function. A mean field method tries to find a set of single-agent functions {fi }n−1
i=0
Pn−1
such that Q (S, A) ≈ i=0 fi (S, Ai ). Because this simplifies the Q value function to n non-
overlapping single-variable functions, once it is computed, the maximization only requires the
trivial step of maximizing each agent’s individual action independently. The challenge of the
Mean Field action selector therefore lies in computing the marginal functions.
As we compute the marginals, we treat them in terms of probability, rather than value. In
the goal distribution, the probability that the ith action has a specific value a is proportional to
the expected value of the Q value using the probabilities as the distribution, as in Equation 3.15.
As this is self-referential, we compute it iteratively, feeding its result back into itself repeatedly
50
so that it can converge. This is equivalent to finding the value achieved by visiting each action
with probability according to the previous distribution and normalizing that as a Boltzmann
distribution. In our algorithms, we refer to the probability P r(Ai = a) from Equation 3.15 as
probability (i, a).
P r(Ai = a) ∝ E Q S, a ∪ A\i
=
X
b∈A
Y
bj ∈b
P r(Aj = bj |Ai = a) Q S, a ∪ b\i
(3.15)
The basic approach can be seen in Algorithm 3.4. Implemented like that, it checks the
values of all actions every iteration, making it even slower than Brute Force. We use an implementation that takes advantage of the coordination graph structure and which has the anytime
property. The optimized implementation is shown in Algorithm 3.5, which references the update function in Algorithm 3.6 due to the length of the algorithm. It is far more efficient at
O(nkm2 ) per iteration. For both implementations, the space complexity is the O(nm) required to store the probabilities, though that is dominated by the O(nkm2 ) to store the Q value
functions shared by all of the action selection methods.
Agent 1
Agent 0
A2
0 9
7 1
A4
1 3
0 8
Q1,4
A1
A1
Q1,2
Q0,4
A1
A2
Q2,3
A3
2 1
7 6
A0
Q1,3
Agent 2
A3
2 3
9 8
A3
Q3,4
Agent 3
A4
1 3
6 3
A4
4 7
5 2
Agent 4
Figure 3.2 Duplicate of Figure 3.1. The coordination graph from Figure 2.1 with concrete Q
value functions. It represents each Qi,j (S, Ai , Aj ) as a table for a specific state S and places
the table on the edge that links agent i and agent j. The cell at (row r, column c) in this
m × m table contains the value, Qi,j (S, r, c), that Qi,j contributes when Ai = r and Aj = c.
For this example, we have chosen two as the number of actions m per agent.
51
Algorithm 3.4 Mean Field (Naı̈ve Method)
Require: S: The current state
Require: Q: The Q value function
Require: agents: The set of agents
Require: A = {Ai }n1 : The action space
1:
2:
while not converged and not halted do
for each agent i in agents do
3:
normalizationSum ← 0
4:
for each individual action Ai ∈ Ai do
5:
newV alue ← 0{new unnormalized probability for agent i choosing action Ai }
6:
8:
for each joint action A in A that has agent i using Ai do
Q
newV alue ← newV alue + Q (S, A) ∗ j∈agents,j6=i probability (j, Aj )
9:
unnormalizedN extP robability (i, Ai ) ← newV alue
7:
10:
end for
normalizationSum ← normalizationSum + newV alue
11:
end for
12:
for each individual action Ai ∈ Ai do
13:
probability (i, Ai ) ←
unnormalizedN extP robability(i,Ai )
{Normalize
normalizationSum
agents individu-
ally}
14:
end for
15:
end for
16:
end while
17:
currentA ← ∅{Build current action as a set}
18:
for each agent i in agents do {Current action maximizes the probabilities}
19:
currentA ← currentA ∪ arg maxAi ∈Ai (probability (i, Ai ))
20:
end for
21:
return currentA
52
Algorithm 3.5 Mean Field
Require: S: The current state
P
Require: Q = ph=1 Qh : The Q value function
Require: agents: The set of agents
Require: A = {Ai }n1 : The action space
1:
2:
3:
4:
for each agent i in agents do {Initialize probabilities}
for each individual action Ai ∈ Ai do
probability (i, Ai ) ←
1
mi
end for
5:
end for
6:
while not converged and not halted do
7:
probability ← recalculateP robabilities (probability, S, Q, agents, A) (See Algorithm 3.6)
8:
currentA ← ∅{Build current action as a set}
9:
for i in agents do {Current action maximizes the probabilities}
10:
currentA ← currentA ∪ arg maxAi ∈Ai probability (i, Ai )
11:
end for
12:
end while
13:
return currentA
Using the same coordination graph as Agent Elimination and Max Plus (Figure 3.2), we
can demonstrate the Mean Field algorithm. It begins, as per lines 1-5 of Algorithm 3.5, with
all probabilities equal. In the case of our example, each agent has an equal
1
2
probability for
each of its two actions, as shown in Table 3.8.
A0
1
2
1
2
A1
1
2
1
2
A2
1
2
1
2
A3
1
2
1
2
A4
1
2
1
2
Table 3.8 Mean Field Example: Probabilities are initialized to
1
.
m
53
Algorithm 3.6 Mean Field - recalculateProbabilities
Require: probabilityold : The probabilities from the previous iteration
Require: S: The current state
Require: Q: The Q value function. Reference each decomposed Q value function of agents i
and j as Qi,j (S, Ai , Aj ) = Qj,i (S, Aj , Ai ), rather than by index
Require: agents: The set of agents
Require: A = {Ai }n1 : The action space
1:
2:
for each agent i in agents do
for each individual action Ai ∈ Ai do
3:
sum ← 0{new unnormalized probability for agent i choosing action Ai }
4:
for each agent j in ΓC (i) do {For each of agent i’s neighbors}
5:
for each individual action Aj ∈ Aj do
sum ← sum + Qi,j (S, Ai , Aj ) ∗ probability (j, Aj ){Weight the values by their
6:
probabilities}
7:
end for
8:
end for
9:
unnormalizedLogP robability (i, Ai ) ← sum
10:
end for
11:
for each individual action Ai ∈ Ai do
12:
13:
probability (i, Ai ) ←
end for
14:
end for
15:
return probability
P
eunnormalizedLogP robability(i,Ai )
unnormalizedLogP robability (i,Bi )
∈A e
Bi
i
54
For agent 0, there is only one neighbor, agent 4. This means that the unnormalized log
probability of action 0 is calculated (as per lines 3-9 of Algorithm 3.6) with Q0,4 (S, 0, 0) ∗
probability (4, 0) + Q0,4 (S, 0, 1) ∗ probability (4, 1) = 4 ∗ (0.5) + 7 ∗ (0.5) = 5.5. This is
part of the expected value of action 0, under the assumption that all other agents act according
to the previous probabilities. In particular, it is the only part of the expected value that will
differentiate it from other actions of the same agent (the shared parts of the expected value
nicely cancel out during normalization, so there is no need to include them). Action 1 looks
similar, with the unnormalized log probability calculated as Q0,4 (S, 1, 0) ∗ probability (4, 0) +
Q0,4 (S, 1, 1) ∗ probability (4, 1) = 2 ∗ (0.5) + 5 ∗ (0.5) = 3.5. The normalization in lines
11-13 of Algorithm 3.6 reduces these to
e5.5
e5.5 +e3.5
= 0.881 and
e3.5
e5.5 +e3.5
= 0.119. These will
become the new values for agent 0’s probabilities.
This operation is repeated for all agents using the original
1
2
probabilities. The probabilities
at the end of the first iteration are shown in Table 3.9. Calculating the joint action that would be
selected at this iteration is a simple matter of each agent independently picking the individual
action with the highest probability (as per lines 8-11 of Algorithm 3.5). This joint action is
{0, 1, 1, 1, 1}. Plugging this into all of the Q functions in Figure 3.2, the Q value can be
calculated to be Q0,4 (S, 0, 1) + Q1,2 (S, 1, 1) + Q1,3 (S, 1, 1) + Q1,4 (S, 1, 1) + Q2,3 (S, 1, 1) +
Q3,4 (S, 1, 1) = 7 + 1 + 6 + 8 + 8 + 3 = 33.
A0
0.881 0.119
A1
0.002 0.998
.001
A2
0.999
A3
0.182 0.818
A4
0.011 0.989
Table 3.9 Mean Field Example: Calculated probabilities after the first iteration.
The next iteration has the same computation. For action 0 of agent 0, the unnormalized log
probability is 4 ∗ (0.011) + 7 ∗ (0.989) = 6.967 and for action 1 it is 5 ∗ (0.011) + 2 ∗ (0.989) =
2.033. The normalized probabilities for agent 0 in this iteration are
e2.033
e6.967 +e2.033
= 0.007.
e6.967
e6.967 +e2.033
= 0.993 and
55
Agent 1, with its three edges in the coordination graph, is slightly more intensive to calculate, but works the same: with each of its individual actions, it goes through each neighbor, summing up the value of taking its action with each of the neighbor’s action, weighting by how likely the neighbor is to act that way. For action 0, this is (Q1,2 (S, 0, 0) ∗
probability (2, 0) + Q1,2 (S, 0, 1) ∗ probability (2, 1)) + (Q1,3 (S, 0, 0) ∗ probability (3, 0) +
Q1,3 (S, 0, 1) ∗ probability (3, 1)) + (Q1,4 (S, 0, 0) ∗ probability (4, 0) + Q1,4 (S, 0, 1) ∗
probability (4, 1)), which is (0 ∗ 0.001 + 9 ∗ 0.999) + (2 ∗ 0.182 + 1 ∗ 0.818) + (1 ∗ 0.011 +
3 ∗ 0.989) = 13.115.
Action 1 for agent 1 can be computed similarly, differing only in parameterization:
(Q1,2 (S, 1, 0) ∗ probability (2, 0) + Q1,2 (S, 1, 1) ∗ probability (2, 1)) +
(Q1,3 (S, 1, 0) ∗ probability (3, 0) + Q1,3 (S, 1, 1) ∗ probability (3, 1)) +
(Q1,4 (S, 1, 0) ∗ probability (4, 0) + Q1,4 (S, 1, 1) ∗ probability (4, 1)). This is (7 ∗ 0.001 + 1 ∗
0.999) + (7 ∗ 0.182 + 6 ∗ 0.818) + (0 ∗ 0.011 + 8 ∗ 0.989) = 15.098. These unnormalized log
probabilities are normalized to
e13.115
e13.115 +e15.098
= 0.125 and
e15.098
e13.115 +e15.098
= 0.875.
Again, the other agents are calculated in the same way, yielding the probabilities in Table
3.10, which would result in picking {0, 1, 0, 0, 1} with a Q value of 34.
A0
0.993 0.007
A1
0.125 0.875
A2
0.649 0.351
A3
0.875 0.125
A4
0.000 1.000
Table 3.10 Mean Field Example: Calculated probabilities after the second iteration.
A0
0.993 0.007
A1
0.000 1.000
A2
0.067 0.933
A3
0.669 0.331
A4
0.000 1.000
Table 3.11 Mean Field Example: Calculated probabilities after the third iteration.
As the agents start to get meaningful probabilities from their neighbors, they begin to assign
their own probabilities based on a greater understanding of what the other agents will do. This
also gives them information about agents with which they are not coordinating. If we continue
to repeat the same process, it will converge to the probabilities in Table 3.12, which picks {0,
1, 1, 0, 1}, the global optimum with a Q value of 35.
56
A0
0.993 0.007
A1
0.001 0.999
A2
0.360 0.640
A3
0.782 0.218
A4
0.000 1.000
Table 3.12 Mean Field Example: Calculated probabilities after many iterations lead to
convergence.
3.3.1 Advantages, Drawbacks, and Comparison
Mean Field is an approximate method. It has the anytime property because it can be halted
at any iteration to give a best effort result or allowed to continue until convergence. Like Max
Plus on our coordination graphs, Mean Field convergence is not guaranteed to occur at the
global optimum. Once optimized for coordination graphs, Mean Field shows a lot of structural
similarity to Max Plus. They both have iterative updates with each agent calculating new values
based on the decomposed Q value functions shared with the neighbors and the previous values
of those neighbors. As a result, information propagates across the coordination graph in the
same way in both algorithms.
Mean Field attempts to find the closest set of single-agent functions to the full Q value
function, as calculated with Kullback-Leibler divergence. The KL divergence measures the
difference between the marginal distribution and the full multivariate function in terms of information. Such a set of marginal functions guarantees that the lowest possible amount of
information is lost to the approximation. A metric like this, which takes all joint actions into
account when calculating closeness, is not ideal for our purposes. We only care about the very
narrow purpose of finding the best action, so a closer approximation that is better able to predict the Q values of very poor actions does us no good. This contrasts with the emphasis that
Max Plus places on the best responses, despite other similarities between the two algorithms.
The optimized Mean Field algorithm can be parallelized on the agent level in the same
way as Max Plus: passing messages to neighbors on the coordination graph and each agent
independently calculating new messages. For Mean Field, the messages would just be the
probabilities.
57
3.4
Approximate Coordination Through Local Search
As part of our contribution, we introduce a novel action selector that we believe is capable
of successfully navigating the problem of action selection even at high numbers of agents by
traversing the joint action space directly.
3.4.1 Action Graphs
In order to more easily conceptualize the process of traversing the joint action space, we define an action graph (Figure 3.3 shows a small example of an action graph). In this graph, each
n−1
joint action A = {Ai }n−1
i=0 in the joint action space A = {Ai }i=0 is a node. The neighbors
of each node in the action graph are ΓA (A) = B B ∈ A & B 6= A & ∃i A\i = B\i ,
so the edges connect each pair of joint actions that share all individual actions except one.
The graph is directed, with each pair of neighbors attached by an edge in each direction. The
weightings attached to this graph vary for different sub-methods, but in all cases, they are based
solely on the relative Q values of the joint actions that the edge connects. An algorithm that
traverses the action graph does so by moving from neighbor to neighbor with probability proportional to the edges’ weights. The action graphs become extremely sparse as the number of
agents increases, making small examples somewhat misleading. While the number of edges
from each joint action is n ∗ (m − 1) the number of nodes to which there could be a connection
is far larger and faster scaling at mn − 1.
In the case of greedy weightings, this graph takes on special properties. Each joint action
has an outbound edge only to the highest of its neighbors, with some joint actions having no
outbound edges at all. This creates a set of trees (IE: a forest). Each tree is rooted at a local
maximum and the global maximum must be at the root of some tree. A traversal of this tree
will, therefore, always end at a maximum.
The actual use of the action graph is mostly theoretical. As the graph is traversed, the
algorithm computes edge weights to the immediate neighbors in a just-in-time manner from
the Q values of those neighbors. This is because, being as large as the exponential action space
58
A={0,0}
A={0,1}
A={0,2}
A={1,0}
A={1,1}
A={1,2}
A={2,0}
A={2,1}
A={2,2}
Figure 3.3 The structure of an action graph with two agents that each have three actions.
it describes, the action graph is intractable to generate. Indeed, the process of doing so would
be equal in effort to performing the Brute Force solution that we are trying to avoid. Further,
the action graph’s weights are valid for only a single state (the Q value functions depend on
both state and action, but the action graph only includes actions), so a new graph would be
needed for each state. Nevertheless, the action graph is a useful tool to describe a walk in the
joint action space.
3.4.2 Action Graph Traversal Action Selector
The Action Graph Traversal action selector is motivated from two angles, which, while
initially distinct lines of inquiry, can be considered to be two views of the same idea. The first
motivation comes from the observation that the Agent Elimination method is performing an
exact inference on a Markov network. We, therefore, sought to experiment with a variant of
59
Algorithm 3.7 Action Graph Traversal Algorithm
Require: S: The current state
P
Require: Q = ph=1 Qh : The Q value function
Require: agents: The set of agents
Require: A = {Ai }n1 : The action space
Require: startingAction: A starting action, such as a random action or the previous state’s
chosen action
Require: successor: A successor function such as the ones in Algorithms 3.8, 3.9, or ??
1:
currentA ← startingAction
2:
for each iteration do
3:
bestA ← argmaxA∈{currentA,bestA} (Q(S, A))
4:
currentA ← successor(currentA, Q)
5:
end for
6:
return bestA
Algorithm 3.8 ǫ-greedy Action Graph Traversal’s Successor Algorithm
Require: ǫ: The Action Graph Traversal exploration parameter
Require: lastA: The previous step’s chosen action
Require: S: The current state
P
Require: Q = ph=1 Qh : The Q value function
1:
r ← A uniform random variable [0,1]
2:
if r < ǫ then
3:
4:
5:
6:
return A random neighboring joint action
else
return argmaxA∈ΓA (lastA) (Q (S, A))
end if
60
Algorithm 3.9 Greedy-With-Restarts Action Graph Traversal’s Successor Algorithm
Require: lastA: The action whose successor we are determining
Require: S: The current state
P
Require: Q = ph=1 Qh : The Q value function
1:
nextA ← argmaxA∈ΓA (lastA) (Q (S, A))
2:
if penultimateA = nextA then {Detect loops}
3:
nextA ←A fully random joint action
4:
end if
5:
penultimateA ← lastA
6:
return nextA
Gibbs sampling[5], an approximate inference method. In Gibbs sampling, a multivariate distribution is approximated by examining a chain of samples generated by repeatedly following the
conditional distribution. The second motivation considers the joint action selection problem
as an optimization of several separate variables (the actions of individual agents) simultaneously. This approach is similar to a satisfiability problem, upon which local searches such as
GSAT[14] have had great success. These local searches repeatedly flip a single variable at a
time to its best value given that the others are unchanged. We decided to consider possible
action changes for all agents each iteration after some preliminary tests indicated that it might
be better than considering only one agent’s actions per iteration.
These two approaches can both be achieved on the same action graph, albeit with different
weighting strategies, using the same kind of Markov Chain traversal of the action graph by
repeatedly following edges with probability proportional to that edge’s weight. Both of these
approaches will take O(nkm) time and (with suitable implementations) constant space to calculate each iteration. The stated time complexity requires taking advantage of the coordination
structure in a similar way to Gray codes, counting on the fact that each examined neighbor
only differs from the previous iteration’s joint action by a single individual action. The space
complexity is somewhat misleading due to the ever present O(nkm2 ) needed to store the decomposed Q value functions.
61
3.4.2.1
Epsilon-Greedy Successor
The local search approach (see Algorithm 3.8) is done with a successor function based on
an ǫ-greedy weight generator (not to be confused with the use of ǫ-greedy exploration on the
Q-learning level introduced in Section 2.2), in which an edge to the highest-value neighbor has
weight 1− nm−1
ǫ and the rest have weight
nm
ǫ
.
nm
This means that it will move to the highest-value
neighboring action with probability 1 − ǫ and move to a uniformly random neighbor with the
remaining probability ǫ. An example traversal with the ǫ-greedy successor is shown in Figure
3.4.
3.4.2.2
Boltzmann Successor
The approach based on Gibbs sampling (see Algorithm ??), on the other hand, uses a
successor function based on a Boltzmann weight generator, in which an action graph edge to
the joint action A has weight equal to e
Q(S,A)
T
, where T is a Boltzmann-specific “temperature”
parameter to alter the willingness of the algorithm to explore less optimal solutions (hotter
means more exploration). Having easy access to the value of the state and action gives us two
advantages over a typical Gibbs sampling algorithm. Firstly, the traditional burn-in period can
be skipped. Secondly, we don’t need to count visits and use that to determine the highest value;
we can calculate it directly. These also push the Gibbs sampling variant even closer to the local
search. Due to the restrictions in a real computer, these were implemented with log-likelihoods.
3.4.2.3
Greedy-With-Restarts Successor
In addition to the weighted successors, we also investigated a Greedy-With-Restarts successor (Algorithm 3.9). This successor always chooses the highest-value neighboring action
just as the ǫ-greedy weight generator would with ǫ set to zero. It makes up for lost exploration
by restarting whenever it repeats, which always happens at the global optimum and happens
at any local optimum unless the optimum’s highest valued neighbor is a neighbor to an even
62
2.5
2
1.5
1
0.5
0
-0.5
-1
-1.5
-2
-2.5
2
1
Reward
0
-1
-2
0
2
4
6
First Agent’s Action
4
8
12
10
8
6
Second Agent’s Action
2
10
12
0
Figure 3.4 An example of a greedy Action Graph Traversal running over a sample state with
two agents and randomly generated feature values. It is restricted to only two agents and ǫ is
set to zero for ease of understanding. It starts at (2,0) with a Q value of 1.52, picks the highest
neighbor at (2,4) with a Q value of 2.31, then goes to (10,4). (10,4) has a lower Q value of
2.26, because (2,4) is a local optimum, but (10,4) is still the highest neighbor. From (10,4), it
proceeds to (10,7) with a Q value of 2.42, then to (5,7) with a Q value of 2.28. Like (10,4),
this is a decrease, but the next iteration would lead back to (10,7), which is, in fact, the global
optimum. With restarts enabled, this would cause it to start again at a fully random action.
Without restarts, it will bounce between these two actions until it runs out of time.
higher value. Action Graph Traversal with this successor becomes very similar to the Coordinate Ascent algorithm described in [19], the primary difference being that Coordinate Ascent
considers a single agent at a time.
Among the three successor functions, the Greedy-With-Restarts has an advantage in its
robustness. Greedy-With-Restarts only needs to be tuned in the number of iterations, a simple
trade-off of performance against speed, one that can be determined by the agent’s required
reaction time. The Boltzmann temperature and the ǫ-Greedy ǫ parameters must each be tuned
with trial runs for the domain or number of agents.
3.4.2.4
Advantages, Drawbacks, and Comparison
Though this should not be taken as an assertion of equivalence relative performance, the
Action Graph Traversal’s method of approximation is best compared to the Brute Force approach. Whereas Brute Force examines each joint action in order without considering their
likely values, Action Graph Traversal chooses its own ordering. It uses the actions that it has
63
already seen to arrive quickly at joint actions that are particularly promising. Additionally, like
Brute Force, it cannot be sure that it has found the best possible action until all actions are
observed. This would happen eventually due to its stochastic nature but is infeasible to rely
upon due to the immense size of the state space.
Action Graph Traversal, because it moves from action to action stochastically, is the only
one of the action selectors that creates a probabilistic policy and is the only one that can get
multiple results with different values from the same state and Q value function. This comes
with the substantial upside that it can be parallelized by having multiple threads run independent traversals of the action graph, only interacting at the very end of the algorithm to do the
trivial maximization of all threads’ maxima. When parallelizing, only one thread should start
at the previous state’s action. This is a capability that scales to as many threads as are available,
rather than being limited by the number of agents like the other selectors.
There is also a good synergy between the traversal-based method and the approximation
from coordination graphs. In describing the time complexity, we have noted that the many
shared components in neighboring joint actions allow Action Graph Traversal to save computation time, but that is not the only benefit that it gets. With so many shared components, we
can expect similar joint actions to yield similar rewards. The presence of features that vary
based on the action of a single agent will make this correlation even stronger. This benefits
Action Graph Traversal, because it relies on a series of local changes. The shared components
also mean that we can count on a consistent caching effect when computing the values of the
neighbors, as mentioned in reference to the time complexity. The conditional independence of
a Markov network like the coordination graph also provides an additional optimization, though
we did not use this in our experiments. Once the comparative values of changing the actions of
agent i are known, we don’t need to recompute them until one of the agents with which agent
i coordinates (those in ΓC (i)), has its action changed. While the first iteration will still require
computation of the values of actions for all n agents, subsequent iterations need only calculate
values for the k agents whose values have changed. Further, for the ǫ-greedy successor, once
we greedily change agent i’s action, we know that the comparative value of changing it again
64
will be non-positive until an agent in ΓC (i) changes action. We can, therefore, postpone evaluating or even considering agent i’s actions until that occurs. The sole exception to this is a
local optimum, where we might need to follow an edge that brings us to a joint action with a
lower Q value. This situation is easily identifiable by the fact that the other changes we are still
considering will not have positive Q value differences, so we can still postpone recalculation
until that time.
Another optimization available to Action Graph Traversal is, when starting the traversal, to
use the last chosen action (during learning, exploration may make this earlier than the last state)
rather than use a completely random joint action. When necessary, we randomly re-picked any
invalid individual actions. The goal of this is to start with an action that is likely to be above
average. This relies on there being a certain amount of continuity in the states and actions, so
it may not be suitable for all domains.
65
Chapter 4
Empirical Evaluation
We tested the performance of the action selection strategies in a series of experiments designed to determine the quality of Action Graph Traversal. We found that its capability to
select actions is on the order of an exact solution but that it does not also share the exact solution’s scaling difficulties: it is not only capable of equaling but also can surpass existing
approximation algorithms.
4.1
Domains
These tests were performed on a pair of experimental domains, one well tested in prior
work and the other created in part by the authors. Both domains can be freely expanded to
any number of agents, making them ideal testing grounds for the scaling of the various action
selection strategies. Details on their exact features and rewards can be found in Appendix A.
4.1.1 Predator-and-Prey Domain
The first domain we used was an adapted version of a predator-and-prey domain such as
those described in Panait and Luke[12]. The basis of the predator-and-prey domain is pursuit.
The goal of a predator, which is controlled by an agent, is to follow the prey, an environmental
feature, to “capture” it. Our version, which has been adapted to the multi-agent scenario, is
based on a grid world with movement in the four cardinal directions. In it, n agents (each controlling a different predator) attempt to catch p prey. Each agent chooses from four directional
actions, each of which moves the agent’s predator deterministically in that direction, barring
66
p
P
P
P
P
p
P
P
p
P
p
P
p
P
P
P
p
P
P
P
p
P
P
p
Figure 4.1 An example state in the predator-and-prey domain with sixteen agents and eight
prey (P: Predator, p: prey).
collisions. Each prey moves in one of the four directions randomly, with constraints on movement to prevent it from moving into a predator. A prey is considered captured, and a reward
given, when two or more predators are next to the prey. A prey, once captured, is removed from
the state, so that the predators may proceed to another prey. A penalty is assessed upon collision between two predators and the colliding predators do not move. The full reward/penalty
structure is available in Appendix A.1.1.
Coordination is difficult in this domain. The coordination graph and the features are set
up to process coordination involving actions. In this case, however, the coordination that is
needed is not the coordination of which action to use, but, rather, the coordination of which
prey to chase. Since it cannot be represented in our action space or state space, this is not
represented in the features. As a result, this domain presents a challenge to the coordinationbased approximation that we use.
67
4.1.1.1
State and Action Space
The state space contains the position (x,y on a grid world) of each predator and each (uncaught) prey. Predators are constrained to not share the same position with other predators, and
prey may similarly not share positions with other (uncaught) prey. Each prey may be caught or
uncaught. Appendix A.2.1 describes the features that we used to represent the state.
The individual action space A for this problem is just {NORTH, SOUTH, EAST, WEST},
so m is 4. Thus, the joint action space of this problem scales with 4n .
4.1.2 Sepia Domain
The second domain that we used is Sepia (Strategy Engine for Programming Intelligent
Agents) [15], a game designed from the ground up to be more friendly for reinforcement learning. It aims to provide the features of a standard real-time strategy game without sacrificing
the precision and control available in a custom domain, such as easy access to the current state,
access to internal calculations like path finding, and controlling the progression of the simulation. Being based on real-time strategy games, it has units (game pieces representing soldiers
and workers) moving around on a map, gathering valuable resources, raising civilizations by
constructing buildings and training units, and engaging in combat with each other. For this paper, we focused on the combat aspect. Sepia takes place on a grid world with movement in any
the four cardinal directions or the four diagonals. The distances are measured in terms of these
movements (Chebyshev distance). On these maps, we placed a number of archers (a deadly but
fragile ranged soldier), as shown in Figure 4.2. These archers were separated into two groups,
with one group being controlled by our multi-agent system (each unit’s behavior corresponding to a different agent’s action) and the opposing force being controlled by a simpler, static
AI separate from our learning agents. To prevent confusion, we will use the term “opposing”
only in this absolute sense to refer to the agents controlled by the static AI. Similarly, agent
refers only to those aspects of our multi-agent system and not to the static AI used to control
these opposing units. The static AI was set to control each opposing unit independently with
the same policy; it keeps the opposing unit still until an agent’s unit comes within a specific
68
sight radius, at which point the opposing unit pursues and attacks that agent’s unit until one of
them is killed. Each segment of the actions, such as moving one square or attacking another
unit, always takes a single time step. Configuring Sepia like this differs from a standard RTS,
but speeds learning in several ways. It reduces the complexity of the state space, increases the
number of episodes that can be run in the same time period, and gives the agents more reliable
feedback on the effects of their actions.
Figure 4.2 The initial state
in the Sepia domain with
sixteen agents. Each team
has sixteen archers arranged
in two staggered columns.
As in many real-time strategy games, each unit has certain combat-relevant characteristics,
such as attack value, which represent an ability to harm other units; a range value, which
represents the distance at which the unit may attack another; armor value, which represents
69
defenses that work by mitigating incoming attacks; and hit point/health value, which represents
defenses by enduring punishment. Also like many real-time strategy games, the values of
these statistics are clustered into unit types, so all “footmen” share specific values for each,
which differ from those values shared by all “archers”, though our experimentation is limited
to archers. A unit is “killed” (removed from the world), when attacks by other units have done
more armor-reduced damage than the unit had hit points. The episode ends when one team
has no remaining units. Other common characteristics, such as speed, attack frequency, ability
to build buildings or other units, upgrade things, carry things, or gather resources, have been
either disabled or normalized for this limited domain, and the given characteristics are assumed
not to change (with the exception of hit points, which may be divided into an unchanging base
hit point total and a changing current hit point total).
4.1.2.1
State and Action Space
Sepia’s state is made up of a set of units on the grid world. We used maps varying from 6x6
to 16x16 based on the number of agents. The maps contain equal numbers of agent units and
opposing units. In our combat-simplified domain the units are heterogeneous (and identifiable
by id) and each has the following differentiating characteristics: a team, a current health value,
and a position (x and y coordinates). The team does not change over the course of the learning
period, whereas the health value and position are regularly expected to change, even from one
time step to the next. The team corresponds to whether the unit is controlled by an agent or
an opposing unit that is part of the environment. The health values vary between 0 (dead) and
the base health total. The position on the grid world can be any available integer coordinates,
with the restriction that no two living units can occupy the same space at the same time. In the
approximation of the state space, we used a series of features based on relative strengths and
distances of the agent’s unit and opposing unit that is attacked by the specific action, as well
as for more general things about the targeted opposing unit, such as how many other agentcontrolled units had recently attacked that opposing unit. Details of the features are located in
Appendix A.2.2.
70
Unlike the predator-and-prey domain, in which our action space was based on movement
along the grid, we base the Sepia individual actions upon the intention to attack specific opposing units. As such, the individual action space for each agent at time t (at which point the set of
living opposing units is Et ) is { AT T ACKe | e ∈ Et }, the agent having an action for the unit
it controls to attack each living opposing unit. Each trial begins with one opposing unit for each
agent-controlled unit, so the size of E0 is n. The size of the action space at the beginning of an
episode is, therefore, nn . While this is far larger than predator-and-prey, as agent-controlled
and opposing units die, the joint action space plummets as both base and exponent are reduced,
such that even a joint action space with only one joint action is not inconceivable toward the
end of an episode: it requires only that a single opposing unit outlives all the others.
Both the agents and the opposing AI depend on the same basic concept of attacking. When
a unit is ordered to attack another unit, one of two things happens. If the distance to the target
unit is greater than the ordered unit’s range, the ordered unit moves toward the target unit.
The exact path is calculated by the A∗ path-finding algorithm[13] with other units making their
positions impassible. If the ordered unit is in range of the target unit, it attacks, dealing damage
mitigated by armor against the target’s hit point amount.
An additional result of the action space explicitly supporting the coordination that we expect to happen is that, unlike in predator-and-prey, coordination is able to take place fairly
easily. However, with opposing units that can die, the individual action space is not only time
dependent, but is unpredictably so. Further, an agent-controlled unit may be destroyed at any
point, causing the actions of the associated agent to cease to have meaning. As we do not make
the simplifying assumption of homogeneity, this ability for agent-controlled units to die is dealt
with only by zeroing non-constant features containing those agents, which negates their ability
to contribute to the value and prevents them from learning. The dynamic nature of the action
space adds an additional challenge for the agents.
71
4.2
Algorithms
In our experiments, we test all of the action selection methods described in Chapter 3, with
the exception of Brute Force. Theoretical expectations and small scale testing with Brute Force
showed that it was completely overshadowed by Agent Elimination, which serves the same role
as an exact method. We also introduce Discoordinated Action Selection to act as an additional
baseline.
4.2.1 Discoordinated Action Selection
In Discoordinated Action Selection, each agent’s action is selected in complete isolation
from the other agents. It learns based on the same coordinated feedback as other action selectors, but, when performing action selection, it intentionally ignores the Q value’s dependence
on the actions being chosen by other agents for the same step. Instead, each agent calculates its
own action under the assumption that each other agent will choose its action at random. This
can be achieved by just summing up the possible values, as in Algorithm 4.1, but that isn’t very
efficient. Because an agent ignores the action selections of all other agents when it maximizes,
it does not require the complicated recursion of the Mean Field or Max Plus algorithms and
the iterations that requires. It can take full advantage of the computational speedup from the
decomposed structure to efficiently average the other agents’ actions without explicitly enumerating them, as in Algorithm 4.2. It can do this because, when comparing the values of
agent i’s actions, the contribution from any decomposed Q value function not containing agent
i will be the same for each action and thereby cancel out. With pairwise functions, this gives
it a time complexity of O(nkm2 ), with no need to iterate. The space complexity is also low, at
O(nm) to store the average values of each action for each agent. As with the other approximate
algorithms, this is dominated by the space to store the Q value function.
72
Algorithm 4.1 Discoordinated Action Selection Algorithm (Naı̈ve Method), O(mn n)
Require: S: The current state
Require: Q: The Q value function
Require: agents: The set of agents
Require: A = {Ai }n1 : The action space
1:
2:
for each A in A do
for i in agents do
value(i, Ai ) ← value(i, Ai ) + Q (S, A)
3:
4:
end for
5:
end for
6:
bestA ← ∅
7:
for i in agents do
8:
9:
10:
bestA ← bestA ∪ maxAi ∈Ai value(i, Ai )
end for
return bestA
4.2.1.1
Advantages, Drawbacks, and Comparison
The Discoordinated Action Selection is not expected to be a viable competitor to the other
action selection mechanisms. Instead, it acts as a valuable baseline of what is possible when
ignoring coordination. It has a sizable advantage in speed and scaling over the other action
selectors and can even be parallelized, with each agent’s action being calculated independently
on its own thread based on partially shared data from decomposed Q value functions. However,
it only achieves this advantage through a great sacrifice in its understanding of the underlying
problem. As a result, in any situation where it is capable of selecting near-optimal actions, we
would expect for coordination to be unnecessary, which would allow exact methods to be used.
Discoordinated Action Selection is similar to the Mean Field action selector in that both of
them compute actions that are good with the average of the other agents’ actions. Mean Field,
however, refines this understanding through a series of iterations. By doing this, Mean Field
73
Algorithm 4.2 Discoordinated Action Selection Algorithm (Optimized Method, O(nm2 ))
Require: S: The current state
P
Require: Q = ph=1 Qh : The Q value function
Require: agents: The set of agents
Require: A = {Ai }n1 : The action space
1:
2:
for h ← 1...p do
for each partial joint action A using the agents in agents in Qh do
for each agent i in agents in Qh do
3:
value(i, Ai ) ← value(i, Ai ) + Qh (S, A)
4:
end for
5:
6:
end for
7:
end for
8:
bestA ← ∅{Assemble best joint action as a set}
9:
for i in agents do {Best action maximizes values for individual agents}
10:
bestA ← bestA ∪ arg maxAi ∈Ai value(i, Ai )
11:
end for
12:
return bestA
preserves some understanding of the coordinated nature of the problem. While it achieves its
approximation by finding a nearby problem that lacks the coordination structure, it still takes
the coordinated values into account to determine closeness, while the Discoordinated Action
Selection takes them into account by weighting them equally to the rest.
4.2.1.2
Optimistic Variant
We also created an optimistic variant of the Discoordinated Action Selection method, which
we call Max-Discoordinated Action Selection. Instead of an agent using the individual action
which participates in the joint actions with the highest average value, it uses the one corresponding to the highest maximum value. This variant is intended to provide potentially higher
74
value while replicating all of the important points of the Discoordinated Action Selection algorithm. As with the averaging, in order for this to be effective, the agents must be relatively
independent, not coordinated. It is also of similar speed, having the same time complexity but
slightly worse constant factor.
Interestingly, this method is equivalent to the first iteration of Max Plus. Without the additional iterations, it is not able to refine its understanding of the value. However, because it does
not need to construct responses to be used in those further iterations, we would expect it to be
noticeably faster.
4.3
Methodology
In testing the performance of the action selection strategies, we implemented a series of
experiments in the Java programming language and ran them on one core of a four core 3.4
GHz PC with 16 GB of RAM. Our first set of experiments were based on the learning capabilities of the action selectors and used RLGLUE[17] as an experimental framework. In these
experiments, we used the coordinated, state-approximated SARSA algorithm with ǫ-greedy
exploration to learn a policy over the course of many episodes. These episodes were periodically interrupted with evaluation periods in which learning and exploration are temporarily
suspended to better judge performance. It is the cumulative (but not discounted) reward from
these evaluation periods that we show in many of our graphs. We conducted experiments with
eight, sixteen, and thirty-two agents across the two domains, varying action selectors, action
selector parameters, and coordination as the experiment demanded. For each configuration, we
ran multiple trials, each starting fresh with a random initial Q value function, and averaged the
results from these trials. This helped to mitigate serendipitous configurations that might occur
in any individual trial.
When constructing the coordination graph, we use a static ring-based topography (see Figure 4.3a and Figure 4.3b for examples) in which the agents coordinate with half of their fellow
agents (rounded up such that in the eight agent case, each one coordinates with four others) in a
preset configuration. In Sepia, the initial positions of the agents correspond to this ring, though
75
Agent 2
Agent 1
Agent 2
Agent 1
Agent 3
Agent 0
Agent 3
Agent 0
Agent 4
Agent 7
Agent 4
Agent 7
Agent 5
Agent 6
Agent 5
Agent 6
(a) Eight Agents Coordinating with Two Nearest
(b) Eight Agents Coordinating with Four Nearest
Neighbors
Neighbors
Figure 4.3 Example coordination graphs with the same topology used in our testing
the predator-and-prey domain is more dynamic, so positions them at random. All of our testing
uses the SARSA update and learns with actions chosen with an ǫ-greedy exploration strategy
(described in Section 2.2 and not to be confused with the Action Graph Traversal successor
of the same name from Section 3.4.2.1). ǫ is set to 0.1. Thus, ten percent of the steps have a
random joint action chosen and the other ninety percent use a normal action selection strategy.
We chose a discount factor, γ, of 0.8. The evaluation periods consist of twenty-five episodes.
The number of trials being averaged in the results, as well as the learning rate (α), number of
episodes, and frequency of evaluation periods, vary according to the specifics of the domain
and/or the number of agents as shown in Table 4.1. We picked the learning rate and number
of episodes based on a series of preliminary tests with different learning rates, picking rates
that were fast-learning and consistent and a number of episodes to accommodate the learning
curve. The number of trials to average and the frequency of evaluation periods were chosen
in an attempt to moderate between data quality and time spent running the trials, which takes
away from running different parameterizations. In order to present the graphs with less noise,
rather than showing the rewards for each evaluation period, we show the maximum reward seen
in any evaluation period up to that time. This fits with the expected use case of taking the best
76
ǫ
γ
8×8
0.1
0.8
2000
11×11
0.1
0.8
8
4200
N/A
0.1
0.8
0.00015625
4
4000
N/A
0.1
0.8
0.00015625
1
2000
N/A
0.1
0.8
Domain: Agents
Learning Rate
Trials
Episodes Map Size
Predator-and-Prey: 8
0.0003125
8
4200
Predator-and-Prey: 16
0.00015625
4
Sepia: 8
0.0003125
Sepia: 16
Sepia: 32
Table 4.1 Parameters for learning experiments. ǫ and γ are the overall exploration rate and the
discount factor, respectively. Size is not a relevant parameter for Sepia because units only
move toward each other.
discovered policy, not merely the last one. The maximization is done on each trial before they
are averaged.
4.3.1 Sample Generation
While the quality of learned policies is often used as a metric, it is not ideally suited for
our purposes. We are primarily concerned with the minutiae of the performance of the action
selectors. To better examine this, we need to compare the algorithms in identical situations,
ridding ourselves of confounding factors and noise. To do this we gathered pairs of states and
Q value functions, then tested the action selectors against each of them. The learned policies
provide a good source of reasonable Q value functions. We followed greedy policy with these
Q value functions to get a series of states. These pairs of Q value functions and states formed
the basis of our comparison; the various action selectors can each be run over the same pairs
and have the resulting values directly compared.
We generated two distinct sets of pairs. The first, which we used to compare the different
parameterizations of Action Graph Traversal, was gathered from samples from the evaluation
periods of Action Graph Traversal as it learned. The second, which we used mostly to compare
the different selectors, we generated by running a small number of episodes after the learning
process, using only the best Q value functions from a variety of action selectors.
77
4.3.1.1
Sampling Directly from Learning Results
When we were comparing the different parameterizations of our Action Graph Traversal
algorithm, we gathered samples by running the Greedy-With-Restarts successor with 500 iterations on two separate trials. This parameterization was chosen because previous data had
suggested that it was a quite powerful action selector. Periodically (after learning for 25, 500,
975, etc. episodes), we recorded the Q value functions and the entire evaluation period of 25
episodes worth of states. We continued this for the 2000 learning episodes of predator-and-prey
and 4000 episodes of Sepia, to match our learning results. This gave us tens of thousands of
states (each with a corresponding Q value function), so we sampled 1000 of them at random,
without regard for when they happened. We did this for both domains with sixteen agents.
4.3.1.2
Sampling Indirectly from Multiple Learning Results
In order to compare the different action selectors against each other, as we did for predatorand-prey with sixteen agents and Sepia with sixteen and thirty-two agents, we began by finding
an example of each of four approximate selection algorithms (Action Graph Traversal, Mean
Field, Max Plus, and Discoordinated Action Selection). These examples were supposed to be
strong representatives, so we used the best of our learning results (the exact parameters are
reported later in Figures 4.30, 4.32, and 4.34). For each of these representatives, we found
the best evaluation period and saved the Q value function for that period. As previously noted
in Table 4.1, for both domains with sixteen agents, the action selectors in those graphs are
averages of four separate trials, so we used the Q value functions from the best evaluation
period of each. For each one of these Q value functions, we ran three episodes with greedy
policies where the representative action selector followed its best Q value function. In addition
to this, we also generated new Q value functions by assigning each feature weight a uniformly
random value. We generated four of these for the sixteen agent cases and one for Sepia with
thirty-two agents to match the representatives. For these random Q value functions, we run
the same three episodes, but used a greedy policy powered by either Agent Elimination (when
possible) or the best Q value found by any of the four approximate representatives (when the
78
exact solution is too slow). The sample generated is made up of all of the states seen, each
paired with the specific Q value function used to see it. This ended up giving us 5565 state-Q
pairs for a sixteen-agent predator-and-prey domain. Sepia, with its much larger action space
and much faster episodes, has 1396 state-Q pairs with sixteen agents. Sepia with thirty-two
agents has only one trial per representative rather than four, but still ends up with a respectable
531 samples because of the longer episodes.
Because we could not count on states being consecutive for a sample, we ran all of the
sample-based tests using Action Graph Traversal start with a random joint action, rather than
the previous state’s best joint action.
4.4
Metrics
For our learning results, we evaluate the action selectors using the average cumulative non-
discounted reward seen over the course of the evaluation periods, comparing them based on
how many episodes of learning were required to reach that performance.
For the direct action selection comparisons, we grade them based on the Q values of the
joint actions that they choose, rather than the rewards that they would see. This is important
because a Q value function learned with one action selector can not be guaranteed to provide
accurate estimates of future reward for a policy based on another action selector. It also lets
us use Q value functions that are randomly generated and means that we don’t have to worry
about whether a learned function is fully converged. We can also judge the action selector on
whether the Q value it picked was as good as the best Q value for any action selector. When
Agent Elimination is available, the best Q value found will always belong to the best possible
joint action. Without Agent Elimination, this is less certain. We compare these measures of
action selection quality against the time that the selectors take to compute actions. We do not
check for convergence with these timing results. They are meant to be a measure of the quality
of the joint action that can be attained if the algorithm has that amount of time, rather than a
depiction of the average amount of time to get that action.
79
4.5
Results
Our experiments allow us to demonstrate a variety of things about approximate action se-
lection. These findings mostly revolve around our Action Graph Traversal selector, but also
include more general insights.
The algorithms used in these experiments are: Agent Elimination (Section 3.1.2); Max
Plus (Section 3.2); Mean Field (Section 3.3); Action Graph Traversal (Section 3.4.2) with an
ǫ-greedy successor (Section 3.4.2.1), a Boltzmann successor (Section 3.4.2.2), and the GreedyWith-Restarts successor (Section 3.4.2.3); and the Discoordinated Action Selection (Section
4.2.1) and its variant Max-Discoordinated Action Selection (Section 4.2.1.2).
4.5.1 Scaling Agent Elimination
We first demonstrated that the scaling of Agent Elimination makes it infeasible when the
amount of coordination is high. We expected poor performance from the exponential time
complexity of the algorithm, but a fast exact algorithm would call into question the need for
approximation in all but the most time-sensitive of situations, so it is important to show. Table
4.2 shows that Agent Elimination with little coordination can be quite fast, but when the coordination is increased, even with a comparatively small eight agents, the time taken can grow out
of hand. Unfortunately, it did so so quickly that we could not run the full learning trial with full
coordination and had to leave the table incomplete. As a result, we could not run Agent Elimination during our learning experiments with sixteen agents, in which each agent coordinated
with eight other agents (though non-learning experiments were faster and predator-and-prey is
small enough to barely be usable with sixteen agents).
4.5.2 Tuning Action Graph Traversal
Action Graph Traversal can be parameterized in many ways. We ran experiments on these
parameterizations, how they interact, and how they impact the ability of the algorithm to choose
good actions. Most of these experiments involve samples generated as described in Section
80
Action Selector
Time per step(ms)
Time per episode(s)
Total Time(hr)
Total Episodes
Total Steps
10.0
1.22
1.02
3000
367830
19.0
2.28
1.90
3000
359834
31.8
3.82
3.18
3000
360194
47.9
5.68
4.74
3000
355568
52.8
6.28
5.23
3000
356734
3.8
0.46
0.38
3000
361784
4.7
0.57
0.47
3000
361926
92.3
11.22
9.35
3000
364858
5803.7
709.88
591.6
3000
366946
AGT
No Coordination
AGT
2 Other Agents
AGT
4 Other Agents
AGT
6 Other Agents
AGT
All 7 Other Agents
Agent Elimination
No Coordination
Agent Elimination
2 Other Agents
Agent Elimination
4 Other Agents
Agent Elimination
6 Other Agents
Table 4.2 A comparison on the eight-agent Sepia domain between 300 iterations of Action
Graph Traversal using a Greedy-With-Restarts successor and Agent Elimination, each with
the agents coordinating with different numbers of other agents. These numbers include a
consistently roughly 2.5-3.5 ms of overhead per step for tasks other than action selection.
There is a scaling in action selection as the graph becomes more complete with more
decomposed functions and each decomposed function becomes more complex. Unfortunately,
we were unable to run Agent Elimination fully when coordinating with all seven other agents
due to the large increase in the time.
81
-5
1x10
Best Value - Value
10 Iterations
20 Iterations
50 Iterations
100 Iterations
200 Iterations
0.0001 500 Iterations
1000 Iterations
0.001
0.01
0.1
1
0
0.2
0.4
0.6
0.8
1
Epsilon
Figure 4.4 The impact of greed on the action picking quality of the
ǫ-greedy successor in Sepia with sixteen agents. Smaller numbers (at
the top) are better, since they are closest to the best seen.
4.3.1.1, where different parameterizations of Action Graph Traversal are all run against the
same states and Q value functions.
4.5.2.1
Tuning Greediness
We started with the two weight-based successors, ǫ-greedy and Boltzmann. We expected
to find an ideal setting to their greediness (greediness is governed by the ǫ and temperature
parameters and is highest when they are low). Having more iterations gives more time to
be spent both on reaching high-reward areas of the action space and on exploring nearby,
potentially better actions. Being too greedy could lead to too little time spent exploring, but
insufficient greed may waste vital time. The intermediate optimal value for greed, slowly
shifting toward less greed when more iterations are available, is apparent on both domains with
sixteen agents: for ǫ-greedy in Figures 4.4 and 4.5 and for Boltzmann in Figures 4.6 and 4.7.
For smaller numbers of iterations (for example the 10 iteration lines shown), being completely
greedy is best, but still not very good because there may not be enough iterations to reliably
82
-5
1x10
Best Value - Value
10 Iterations
20 Iterations
50 Iterations
100 Iterations
0.0001 200 Iterations
500 Iterations
1000 Iterations
0.001
0.01
0.1
1
10
0
0.2
0.4
0.6
0.8
1
Epsilon
Figure 4.5 The impact of greed on the action picking quality of the
ǫ-greedy successor in predator-and-prey with sixteen agents. Smaller
numbers (at the top) are better, since they are closest to the best seen.
0.0001
Best Value - Value
10 Iterations
20 Iterations
50 Iterations
100 Iterations
200 Iterations
500 Iterations
0.001
0.01
0.1
0.000010
0.000100
0.001000
0.010000
0.100000
Temperature
Figure 4.6 The impact of greed on the action picking quality of the
Boltzmann successor in Sepia with sixteen agents. Smaller numbers
(at the top) are better, since they are closest to the best seen.
83
0.0001
10 Iterations
20 Iterations
50 Iterations
100 Iterations
200 Iterations
500 Iterations
Best Value - Value
0.001
0.01
0.1
1
10
0.001
0.01
0.1
1
Temperature
Figure 4.7 The impact of greed on the action picking quality of the
Boltzmann successor in predator-and-prey with sixteen agents.
Smaller numbers (at the top) are better, since they are closest to the
best seen.
reach a good action, let alone a local optimum that needs to be explored around. When more
iterations are used, it is not enough to just reach a local optimum, the algorithm must explore
around it to find better actions.
4.5.2.2
Comparing the Successor Algorithms
We expected the Boltzmann successor for Action Graph Traversal to be more effective than
the ǫ-greedy, because, instead of picking completely at random, its exploration is informed by
the Q values of its neighbors. We also anticipated that the Greedy-With-Restarts successor
might be able to make up for its lack of exploration by visiting more local optima after restarting. When we compare the various successors on the quality of their learned policies, the
performance of the Action Graph Traversal algorithm seemed to depend little on which successor was chosen (as long as we took the best parameterization of that successor). This can
be seen in Figures 4.8 (zoomed in as 4.9), 4.10, 4.11 (zoomed in as 4.12), 4.13 (zoomed in as
84
400
Average Reward
200
0
-200
-400
-600
-800
-1000
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Episodes of Learning
Figure 4.8 Learning results of predator-and-prey with eight agents for
various Action Graph Traversal successor strategies.
400
Average Reward
380
360
340
320
300
2000
2500
3000
3500
4000
4500
Episodes of Learning
Figure 4.9 Learning results of predator-and-prey with eight agents for
various Action Graph Traversal successor strategies, zoomed in to
show the differences among the best.
85
2
1
Average Reward
0
-1
-2
-3
-4
-5
-6
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Episodes of Learning
Figure 4.10 Learning results of Sepia with eight agents for various
Action Graph Traversal successor strategies.
0
-2000
Average Reward
-4000
-6000
-8000
-10000
-12000
-14000
-16000
-18000
-20000
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Episodes of Learning
Figure 4.11 Learning results of predator-and-prey with sixteen agents
for various Action Graph Traversal successor strategies.
86
-200
Average Reward
-400
-600
-800
-1000
-1200
-1400
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Episodes of Learning
Figure 4.12 Learning results of predator-and-prey with sixteen agents
for various Action Graph Traversal successor strategies, zoomed in to
show the differences among the best.
5
Average Reward
0
-5
-10
-15
-20
-25
0
500
1000
1500
2000
2500
3000
3500
4000
Episodes of Learning
Figure 4.13 Learning results of Sepia with sixteen agents for various
Action Graph Traversal successor strategies.
87
Average Reward
1.5
1
0.5
0
-0.5
0
500
1000
1500
2000
2500
3000
3500
4000
Episodes of Learning
Figure 4.14 Learning results of Sepia with sixteen agents for various
Action Graph Traversal successor strategies, zoomed in to show the
differences among the best.
4.14). In fact, there seemed to be a small, but not definitive, advantage for the ǫ-greedy successor. When we ran the various successors over the same Q value functions and states (again
using the sampling methodology in Section 4.3.1.1), however, the results were much clearer
(see Figures 4.15 and 4.16). After a short period of rough equivalence, Greedy-With-Restarts
proved to be superior to the others. The Boltzmann successor is better than ǫ-greedy at lower
numbers of iterations, especially on the predator-and-prey domain, though it falls behind at
higher iteration counts in both cases. This may be related to the ability for the ǫ-greedy to
escape from areas of relatively high reward that a Boltzmann successor would have a hard time
getting away from, a capability which increases in value as increased numbers of iterations give
it more to do after escaping. As evidenced by the superior Greedy-With-Restarts performance,
the ability to restart is also a means of exploration, but a far more powerful one, especially
when enough iterations are available to reach several local optima.
Despite the differences in value, all of these successors were extremely capable; the differences in value are small fractions of the total. Nevertheless, Greedy-With-Restarts would
88
Best Value - Value
1x10-5
0.0001
0.001
Greedy-With-Restarts
Best Epsilon-Greedy
Best Boltzmann
0.01
0
50
100
150
200
250
300
350
400
450
500
Iterations
Figure 4.15 Comparing the abilities of various Action Graph
Traversal successor strategies to select good actions in Sepia with
sixteen agents.
1x10-5
Best Value - Value
0.0001
0.001
0.01
0.1
Greedy-With-Restarts
Best Epsilon-Greedy
Best Boltzmann
1
0
50
100
150
200
250
300
350
400
450
Iterations
Figure 4.16 Comparing the abilities of various Action Graph
Traversal successor strategies to select good actions in
predator-and-prey with sixteen agents.
500
89
appear to be the best choice. The algorithms that it competed against are the best parameterizations of many that we ran (the same wide variety as we explored the optimal greed with
in Figures 4.4, 4.5, 4.6, and 4.7). While the results about the properties of the optimal greed
suggest that we could predict a good parameterization based on the number of iterations, there
seems to be little point in doing so when even the best of multiple parameterizations can’t
match the performance of Greedy-With-Restarts.
4.5.2.3
Restarting Epsilon-Greedy
We also performed experiments on the impact of restarts overall and how it interacts with
the amount of greed, to see if restarts were worth doing more generally. We had expected
that allowing the algorithm to restart would enhance it, but that this effect would be stronger
when there is more greed, especially with more iterations. The reasoning for this is twofold.
Restarting is more useful when the algorithm would otherwise spend an extended period of
time repeatedly visiting the same small set of joint actions. Additionally, we determine when
0.0001
Best Value - Value
0.001
0.01
0.1
20 Iterations With Restarts
20 Iterations With No Restarts
1
0
0.2
0.4
0.6
0.8
Epsilon
Figure 4.17 Results of Sepia with 16 agents showing ǫ-greedy AGT
with 20 iterations with and without restarts. Smaller numbers (at the
top) are better, since they are closest to the best seen.
1
90
Best Value - Value
0.01
0.1
1
20 Iterations With Restarts
20 Iterations With No Restarts
10
0
0.2
0.4
0.6
0.8
1
Epsilon
Figure 4.18 Results of predator-and-prey with 16 agents showing
ǫ-greedy AGT with 20 iterations with and without restarts. Smaller
numbers (at the top) are better, since they are closest to the best seen.
-5
1x10
Best Value - Value
0.0001
0.001
0.01
200 Iterations With Restarts
200 Iterations With No Restarts
0.1
0
0.2
0.4
0.6
0.8
1
Epsilon
Figure 4.19 Results of Sepia with 16 agents showing ǫ-greedy AGT
with 200 iterations with and without restarts. Smaller numbers (at the
top) are better, since they are closest to the best seen.
91
0.0001
Best Value - Value
0.001
0.01
0.1
1
200 Iterations With Restarts
200 Iterations With No Restarts
10
0
0.2
0.4
0.6
0.8
1
Epsilon
Figure 4.20 Results of predator-and-prey with 16 agents showing
ǫ-greedy AGT with 200 iterations with and without restarts. Smaller
numbers (at the top) are better, since they are closest to the best seen.
to restart based on the repetition of the state. This was a deliberate choice to allow AGT
to sometimes be able continue even after reaching local optima, but when mixed with high
exploration, can lead to false positives. Indeed, we find that at lower iterations (see Figures 4.17
and 4.18), the restarts are a disadvantage. However, as the number of iterations are increased,
restarts begin to be an advantage. This begins at the fully greedy ǫ = 0, but as the number
of iterations are increased, ever lower amounts of greed (higher ǫ values) are required to make
restarts beneficial (see Figures 4.19 and 4.20). Interestingly, these figures also show that with
very low greed, the restarts are no longer detrimental, and may even be a slight advantage. We
credit this to the meandering nature of those solutions. A completely random joint action is
almost guaranteed to be one that not only hasn’t been seen, but is surrounded by new actions.
With such low greed, not restarting remains close to a random walk of the action graph and
thus vulnerable to repeating previously seen joint actions.
92
0
-6
1x10
Best Value - Value
-5
1x10
0.0001
0.001
0.01
0.1
1
Starting at a Random Joint Action
Starting at the Last State’s Picked Action
1
10
100
1000
10000
Iterations
Figure 4.21 Results of predator-and-prey with 16 agents showing the
effect of starting in different ways, then using the
Greedy-With-Restarts Action Graph Traversal selector with various
numbers of iterations.
-6
1x10
-5
Best Value - Value
1x10
0.0001
0.001
0.01
0.1
Starting at a Random Joint Action
Starting at the Last State’s Picked Action
1
1
10
100
1000
10000
Iterations
Figure 4.22 Results of Sepia with 16 agents showing the effect of
starting in different ways, then using the Greedy-With-Restarts Action
Graph Traversal selector with various numbers of iterations.
93
1x10-5
Best Value - Value
0.0001
0.001
0.01
0.1
Starting at a Random Joint Action
Starting at the Last State’s Picked Action
1
1
10
100
1000
10000
100000
Iterations
Figure 4.23 Results of Sepia with 32 agents showing the effect of
starting in different ways, then using the Greedy-With-Restarts Action
Graph Traversal selector with various numbers of iterations.
4.5.2.4
Choosing a Starting Point
By using the sequential set of samples described in Section 4.3.1.2, we were able to test
whether seeding the Action Graph Traversal algorithm with the last state’s chosen action instead of a completely random action was beneficial. For both predator-and-prey (see Figure
4.21) and Sepia (see Figures 4.22 and 4.23), this provided a narrow, but distinct, advantage.
The advantage fades within a couple of dozen iterations, as subsequent restarts are based on
random joint actions. The larger domains take longer for this to occur. Interestingly, the advantage is not apparent from the start. In fact, for the two Sepia domains, the random action
is better on the first iteration or two. This implies that the last starting action, while closer to
good actions than an average action, is actually associated with a lower than average value.
For our domains, the last state’s best action seems worth using, but this is likely to vary by
domain (for example, in a domain with a lot of action switching, it is easy to see how it could
be counterproductive to do so). The fading of this effect with increased numbers of iterations
is likely to be common to all domains.
94
400
Average Reward
200
0
-200
-400
-600
-800
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Episodes of Learning
Figure 4.24 Results of predator-and-prey with eight agents for Action
Graph Traversal vs. Agent Elimination.
4.5.3 Action Graph Traversal vs. Other Action Selectors
We expected that the Action Graph Traversal action selector (as represented by the GreedyWith-Restarts successor that had proved to be superior) would be able to compete with an exact
action selector on smaller problems. We found that learning with actions from Action Graph
Traversal is competitive with the exact Agent Elimination method (see Figures 4.24, 4.25, and
4.26).
We expected for the Action Graph Traversal action selector to be able to compete favorably
with other approximate methods in time and quality of reward. We expected for this to be true
even on larger scales, where Agent Elimination falters. When we look at the learning results
in Figures 4.27 (zoomed in as 4.28), 4.29, 4.30 (zoomed in as 4.31), and 4.32 (zoomed in as
4.33), we find that Mean Field seems to be the most effective overall, though the algorithms
are fairly closely matched. When we extended to thirty-two agents and even to some degree
with sixteen agents in Sepia, Action Graph Traversal seems to fall behind (see Figure 4.34).
95
400
Average Reward
380
360
340
320
300
2800
3000
3200
3400
3600
3800
4000
4200
Episodes of Learning
Figure 4.25 Results of predator-and-prey with eight agents for Action
Graph Traversal vs. Agent Elimination, zoomed in to show the
differences among the best.
3
2
1
Average Reward
0
-1
-2
-3
-4
-5
-6
-7
-8
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Episodes of Learning
Figure 4.26 Results of Sepia with eight agents for Action Graph
Traversal vs. Agent Elimination.
96
As with the learning trials comparing the successors, we found these results to be unsatisfactory. Q-learning, by design, is fairly accommodating to differences in action selecting
quality. An action selector that is good, but not excellent, can be the basis of accurate learning. This acts to keep the action selectors closer together in learning ability than their respective
capabilities to select actions might suggest. Additionally, despite the effect of averaging a number of trials, the learning results were extremely noisy, fluctuating significantly from evaluation
period to evaluation period and often overwhelming the differences between the algorithms or
parameterizations. This may have been due to the random starting Q values and the stochastic
transitions/rewards in the domains, as well as variance because the selected actions are approximate. Those parameterizations chosen for the graphs were largely those that either did better
than all other parameterizations, even those with more iterations, and those that appeared to be
doing as well as the rest, but with fewer iterations. This means that they were largely outliers,
further blurring the difference between the algorithms. Meanwhile, the choice to maximize
all rewards, while it is necessary to smooth the graphs enough to be understandable, can also
fixate the rewards of rare confluences of good fortune and truly good policy. Together, these
factors limit the usefulness of the rewards of learned policies as a means of comparing the
action selection algorithms.
In order to compare the algorithms more closely, we took samples of states and Q value
functions and ran each algorithm against the same samples. These samples were generated
as described in Section 4.3.1.2. When we did this, we found that, far from the similarity
we saw in the learning results, there were stark differences between the action selectors. On
predator-and-prey with sixteen agents, Max Plus and Action Graph Traversal both find actions
of similar qualities, putting clear distance between themselves and Mean Field, which is similar
in performance to Discoordinated Action Selection (see Figure 4.35). In fact, with its faster
start than Action Graph Traversal, Max Plus comes very close to dominating Mean Field by
providing better reward at all times. Because it reaches the very good actions faster than Action
Graph Traversal, it would not be a stretch to say that Max Plus is the preferable algorithm in
this case. When we zoom into the rewards closest to the best and put it in a log scale, as
97
1000
0
Average Reward
-1000
-2000
-3000
-4000
-5000
-6000
-7000
-8000
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Episodes of Learning
Figure 4.27 Results of predator-and-prey with eight agents for Action
Graph Traversal vs. other approximate methods.
390
Average Reward
385
380
375
370
365
360
2600
2800
3000
3200
3400
3600
3800
4000
4200
Episodes of Learning
Figure 4.28 Results of predator-and-prey with eight agents for Action
Graph Traversal vs. other approximate methods, zoomed in to show
the differences among the best.
98
3
2
Average Reward
1
0
-1
-2
-3
-4
-5
-6
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Episodes of Learning
Figure 4.29 Results of Sepia with eight agents for Action Graph
Traversal vs. other approximate methods.
0
Average Reward
-5000
-10000
-15000
-20000
-25000
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Episodes of Learning
Figure 4.30 Results of predator-and-prey with sixteen agents for
Action Graph Traversal vs. other approximate methods.
99
-400
Average Reward
-600
-800
-1000
-1200
-1400
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Episodes of Learning
Figure 4.31 Results of predator-and-prey with sixteen agents for
Action Graph Traversal vs. other approximate methods, zoomed in to
show the differences among the best.
5
Average Reward
0
-5
-10
-15
-20
-25
-30
0
500
1000
1500
2000
2500
3000
3500
4000
Episodes of Learning
Figure 4.32 Results of Sepia with sixteen agents for Action Graph
Traversal vs. other approximate methods.
100
2
Average Reward
1.5
1
0.5
0
-0.5
1500
2000
2500
3000
3500
4000
Episodes of Learning
Figure 4.33 Results of Sepia with sixteen agents for Action Graph
Traversal vs. other approximate methods, zoomed in to show the
differences among the best.
0
-10
Average Reward
-20
-30
-40
-50
-60
-70
-80
0
500
1000
1500
2000
2500
3000
3500
4000
Episodes of Learning
Figure 4.34 Results of Sepia with thirty-two agents for Action Graph
Traversal vs. other approximate methods.
101
in Figure 4.36, it becomes clear that, while both Action Graph Traversal and Max Plus do
extremely well, Action Graph Traversal passes Max Plus early on in value and is the only nonexact method to reach the global optimum every time, reaching that level well before Agent
Elimination has finished its computation. While is questionable as to whether such a minuscule
difference matters (Figure 4.39 even shows that about 99% of the time, Max Plus and Action
Graph Traversal both find the globally optimum action), on Sepia, Action Graph Traversal
begins to look even better (see Figures 4.37 and 4.38). It dominates both Max Plus and Mean
Field, providing better actions faster, and also puts a noticable gap between itself and Max Plus
in terms of value. While part of this difference is due to an increasing gap in value, the poorer
per-iteration scaling of Max Plus with both the number of individual actions and the amount of
coordination makes the difference far more pronounced. In these larger action spaces, Action
Graph Traversal would seem to be the better algorithm.
We had expected that Max Plus would be better than Mean Field, but the sheer differences
in the values they achieve, despite their similar implementations, shows the power of focusing
-2
-2.5
Value
-3
-3.5
-4
-4.5
Agent Elimination
Action Graph Traversal
Max Plus
Mean Field
Discoordinated Action Selection
Max-Discoordinated Action Selection
-5
-5.5
0.0000010
0.0000100
0.0001000
0.0010000
0.0100000
0.1000000
1.0000000
10.0000000
Time Per Action Selection (s)
Figure 4.35 The trade-off between values of actions picked and the
amount of time required for each of the action selectors for
predator-and-prey with sixteen agents.
102
0
-6
1x10
Best Value - Value
-5
1x10
0.0001
0.001
0.01
0.1
Agent Elimination
Action Graph Traversal
Max Plus
Mean Field
Discoordinated Action Selection
Max-Discoordinated Action Selection
1
0.0000010
0.0000100
0.0001000
0.0010000
0.0100000
0.1000000
1.0000000
10.0000000
Time Per Action Selection (s)
Figure 4.36 The trade-off between values of actions picked and the
amount of time required for each of the action selectors for
predator-and-prey with sixteen agents, shown in terms of average
difference from the optimum value for the state. This is shown in a
modified “log scale” that shows minor differences among near-zero
values like a true log scale, but with an unseen gap in the y-axis so that
we can also show zero itself. Up is better, since it is the closest to the
best.
on best responses to a given situation, rather than the average. The largest impact of this
seems to be in the convergence of the algorithms. Max Plus can take hundreds of iterations to
converge, but Mean Field converges (at least enough to have a consistent best action) within
only a few. When choosing the best action, relying on the values of best responses preserves
more detailed information than Mean Field’s averaging approach, allowing it to continue to
improve over a longer period.
The differences between Mean Field and Max Plus repeat themselves in miniature with
Discoordinated Action Selection and its optimistic variant (Max-Discoordinated Action Selection). Max-Discoordinated Action Selection is slightly better, being identical in all results to
the first iteration of Max Plus. Discoordinated Action Selection, however, is close to Mean
Field in the values it picks. Without the extra iterations to converge, the gaps are quite small,
103
0
-0.01
-0.02
Value
-0.03
-0.04
-0.05
-0.06
-0.07
Action Graph Traversal
Max Plus
Mean Field
Discoordinated Action Selection
Max-Discoordinated Action Selection
-0.08
-0.09
0.0000010
0.0000100
0.0001000
0.0010000
0.0100000
0.1000000
1.0000000
10.0000000
Time Per Action Selection (s)
Figure 4.37 The trade-off between values of actions picked and the
amount of time required for each of the action selectors for Sepia with
sixteen agents.
-0.07
-0.08
-0.09
Value
-0.1
-0.11
-0.12
-0.13
-0.14
Action Graph Traversal
Max Plus
Mean Field
Discoordinated Action Selection
Max-Discoordinated Action Selection
-0.15
-0.16
0.000010
0.000100
0.001000
0.010000
0.100000
1.000000
10.000000
100.000000
Time Per Action Selection (s)
Figure 4.38 The trade-off between values of actions picked and the
amount of time required for each of the action selectors for Sepia with
thirty-two agents.
104
100
Percentage Best
Agent Elimination
Action Graph Traversal
Max Plus
Mean Field
Discoordinated Action Selection
80 Max-Discoordinated Action Selection
60
40
20
0
0.0000010
0.0000100
0.0001000
0.0010000
0.0100000
0.1000000
1.0000000
10.0000000
Time Per Action Selection (s)
Figure 4.39 The trade-off between values of actions picked and the
amount of time required for each of the action selectors for
predator-and-prey with sixteen agents, shown in terms of how
frequently the selector finds the best action.
though they grow larger as the action space increases. As suggested by their time complexity,
these algorithms are extremely fast, arriving at their answers in a fraction of a millisecond, even
on the 32-agent Sepia. As a result, though neither is able to come close to the action selection
quality of later iterations of Action Graph Traversal or Max Plus, Discoordinated Action Selection and Max-Discoordinated Action Selection are the best at the time that they complete
their calculations.
Despite the clear gap that developed between Action Graph Traversal and Max Plus and
between those selectors and Mean Field and Discoordinated Action Selection, we found that,
in all cases, the gap between a random action and any of the action selectors is far larger than
between the action selector and the best selector. This can be seen on any of the previous
figures, since Action Graph Traversal’s fastest data point is just the single random action that
we use to seed it. This means that even the Mean Field and Discoordinated Action Selection
105
baseline could be considered to be selecting actions that are good, though not great. This fits
with the earlier finding that all the selectors were able to power effective learning.
Interestingly, the ability to reach the best action was not fully correlated with the average
value of the action selected. Max Plus and Max-Discoordinated Action Selection were able
to reach the global optimum at rates quite disproportionate from their reward on predatorand-prey with sixteen agents (see Figure 4.39). The first iteration of Max Plus was almost
indistinguishable from Mean Field in value, but picks the best action 35% of the time to Mean
Field’s 7%. Likewise, in the same domain, Action Graph Traversal was able to consistently
pass Max Plus in value within 0.001 seconds, but took until 0.01 seconds to surpass it in the
number of times it reached the global optimum. These effects seem to be less prominant in
Sepia, however, with Action Graph Traversal perhaps holding a small advantage. The ability
to reach the best action in Sepia should be taken with a grain of salt, however, because of the
tendency for Sepia’s action space to shrink late in the episode. The ability of the first iteration
of Action Graph Traversal, which is just a single random value, to sometimes get the best action
in Sepia is evidence of this, as it was never able to do so in the predator-and-prey domain.
Without an exact action selector for the Sepia domain, we have no way of knowing if the
best value that we find is actually the global optimum. Nevertheless, at least for the sixteenagent case, we can point to the strong overlap between Action Graph Traversal and Max Plus as
proof of their mutual quality (see Figure 4.40). With thirty-two agents (see Figure 4.41), this is
somewhat less clear because of the lack of similar quality from Max Plus, and when we focus
on the states generated with random Q value functions (see Figure 4.42), we did see similar
performance to the case with sixteen agents. The leveling off of the value for the thirty-two
agent case also suggests that Action Graph Traversal may be reaching, if not global optima,
then at least extremely high local optima.
4.5.4 Coordination Requirements
We expected there to be a great need for coordination. We had also expected that less coordination would learn faster due to the smaller size of its factorized state space, but that it
106
100
Percentage Best
Action Graph Traversal
Max Plus
Mean Field
90
Discoordinated Action Selection
Max-Discoordinated Action Selection
80
70
60
50
40
30
20
10
0
0.0000010
0.0000100
0.0001000
0.0010000
0.0100000
0.1000000
1.0000000
10.0000000
Time Per Action Selection (s)
Figure 4.40 The trade-off between values of actions picked and the
amount of time required for each of the action selectors for Sepia with
sixteen agents, shown in terms of how frequently the selector finds the
best action.
90
Action Graph Traversal
Max Plus
Mean Field
80
Discoordinated Action Selection
Max-Discoordinated Action Selection
Percentage Best
70
60
50
40
30
20
10
0
0.000010
0.000100
0.001000
0.010000
0.100000
1.000000
10.000000
100.000000
Time Per Action Selection (s)
Figure 4.41 The trade-off between values of actions picked and the
amount of time required for each of the action selectors for Sepia with
thirty-two agents, shown in terms of how frequently the selector finds
the best action.
107
100
Action Graph Traversal
Max Plus
Mean Field
Discoordinated Action Selection
Max-Discoordinated Action Selection
90
Percentage Best
80
70
60
50
40
30
20
10
0
0.000010
0.000100
0.001000
0.010000
0.100000
1.000000
10.000000
100.000000
Time Per Action Selection (s)
Figure 4.42 The trade-off between values of actions picked and the
amount of time required for each of the action selectors for Sepia with
thirty-two agents, shown in terms of how frequently the selector finds
the best action. This contains only the states generated by a path using
a random Q value function.
1000
Average Reward
0
-1000
-2000
-3000
-4000
-5000
-6000
0
1000
2000
3000
4000
5000
6000
Episodes of Learning
Figure 4.43 Results of predator-and-prey with eight agents at various
coordination levels, showing how, at each level, more coordination is
beneficial.
108
P
P
p
p
P
P
Figure 4.44 A possible state in the predator-and-prey domain (P: Predator, p: prey). This is an
example of an advantage for coordination beyond pairs. Two of the agents should deliberately
choose to pursue a prey that is not the closest for the sole reason that the other two agents will
be pursuing it.
would then be eclipsed in later episodes by the larger but more descriptive state spaces available with more coordination, which would be more capable in the long term. Though, we
never found a case where the lower amounts of coordination did well and then were passed, by
testing performance of the same algorithm at varying levels of coordination, we found that it is
possible, but not guaranteed, for domains to continue to benefit from additional coordination
with every increase. Unlike results from our other experiments, which tended to show lower
coordination levels (often coordinating with only two other agents) reaching the same or higher
optima faster, results from the predator-and-prey domain with eight agents (Figure 4.43) show
that every level of additional coordination is helpful. It may seem counterintuitive for a domain that revolves around rewards for pairs of agents to have need for that much coordination.
However, the most important level of communication, that of determining which pair of agents
will converge on a given prey, is aided by coordination with any or all other agents, to make
sure that they are not pointlessly pursuing the same prey in vain. An example of this is shown
in Figure 4.44. Our results do prove the existence of problems that can benefit from techniques
like ours that ease the burden of coordination and that are cumulative with other techniques. If
these techniques can help an implementer to increase the coordination even a little bit more,
then better policies may result.
109
Chapter 5
Conclusion
In the course of our experimental work, we found that there are cases where further coordinating may be an increasing boon. Not all domains will benefit from such coordination,
though; we went through some effort to design a learn-able feature set where coordination is
helpful, but ultimately produced a domain where coordination had only minor impact.
We found that an exact method, even when optimized for the coordinated structure of the
Q value, is unable to scale to more than a small number of agents when coordination is used
heavily. Approximate methods can match the performance of exact methods on these smaller
scales and are capable of quality action selection even after the problem is scaled up, delivering
good actions reliably and even the global optimal action with surprising regularity.
We also found that the Greedy-With-Restarts successor for Action Graph Traversal algorithm was a superior parameterization. With this successor, Action Graph Traversal showed
itself to be a superior selector to Max Plus, Mean Field, and Discoordinated Action Selection
on large problems.
These results, while descriptive of the differences between the action selectors, are limited in their application during the learning phase due to the resilience of of the Q-learningand
SARSA algorithms. Especially early in learning, reasonable actions are enough to learn effectively. We also ran into difficulties with the process of scaling up. Feature sets that were
capable of learning repeatedly needed enhancements when we added more agents.
110
5.1
Further Work
Parallel processing is an increasingly common feature of consumer hardware and multi-
agent domains often explicitly consist of multiple machines. With the expected parallelizablility of the Action Graph Traversal algorithm, it would also be useful to do some testing with
threading (or simulated threading by forced restarts).
One important that could be made is to incorporate more advanced versions of the algorithms and components that we used. Better, more dynamic coordination structures, a more
descriptive way of describing feature value contributions, and enhancements to the action selectors could provide better data.
The use of approximate action selectors comes with the opportunity to combine them in
a way that doesn’t make sense for exact action selectors, taking advantage of their different
strengths or even just their ability to get different results. Simple ways of doing this could involve running multiple algorithms to maximize the value for any given time or just to get some
benefit from the time after the preferred algorithm has converged. Changing algorithms on the
fly, either using another selector to seed an exploratory method like Action Graph Traversal
or by changing parameters of Action Graph Traversal on the fly. Perhaps the most interesting
application of multiple action selectors is to have an additional layer with a classifier that examines the state and chooses the most appropriate action selector or even a most appropriate
order in which to run several action selectors. There were hints of at least some pattern to the
states which do better with Max Plus vs. Action Graph Traversal, which could be used to more
reliably get the better of the two. Further investigation into this kind of hybrid action selector is
needed to determine whether it produces a superior selector that selects higher quality actions
faster than its components or merely an equivalent action selector with additional overhead.
Roles, as described by Wilson et al.[21], contain possibilities for further extending the size
of the multi-agent domains. As a form of transfer learning, there is the possibility of learning on
easier (preferably smaller) problems. This offers an additional speedup on the learning process
for large problems, cumulative with approximation. The way that it combines learning from
111
different agents only within learned roles offers another potential speedup from homogeneity,
without fully compromising the heterogeneity we wanted. Balancing these would require a
new way of working with coordination graphs.
DISCARD THIS PAGE
112
Appendix A: Domain Details
This appendix provides some of the inner workings of our domains.
A.1
Reward Structure
On both of our domains, we give the agents frequent feedback about the state, rewarding
them further when subgoals are achieved. They are also given a constant negative reward to
discourage inaction.
A.1.1 Predator-and-Prey
The predator-and-prey domain rewards approaching and capturing prey and punishes collisions and failure to quickly capture prey. All rewards are divided by the number of agents, n,
to try to keep the total reward per step similar as we change sizes.
• -1 per agent per time step
• 3 ∗ 0.25distance f rom prey for each agent for each uncaptured prey
• 5 for capturing a prey, additional to proximity rewards
• -9 for collisions between agents
• 50 for successful capture of the last uncaptured prey
A.1.2 Sepia
Sepia has somewhat lower rewards, based on damage and kills. Like predator-and-prey, we
divide each reward by the number of agents.
• -0.03 per agent per each time step
• 0.01 for each point of damage inflicted by an agent-controlled on an opposing unit
113
• 250% of the damage reward required to kill the most formidible unit for killing an opposing unit (1 when all units are archers)
• -0.01 for each point of damage inflicted by an opposing unit on an agent
• -250% of the damage reward required to kill the most formidible unit when an opposing
unit kills an agent (-1 when all units are archers)
A.2
Features
For each edge of our coordination graph, there is a pairwise Q value function, for a total
of
kn
2
functions. The Q value function is made up of individual features for each of the pair
of agents involved and pairwise features relevant to the pair of agents. These features can be
binary or scalar.
A.2.1 Predator-and-Prey
The predator-and-prey domain has a number of features devoted to tracking each of the p
different prey. There are 2 + 3 ∗ p individual features per agent, 1 + 2 ∗ p pairwise features,
and one always-on constant feature, for a total of 6 + 8 ∗ p features per pairwise Q value
function. The features are evaluated for each possible pair of actions that the agents can do.
For the individual features, the prey are sorted by closeness to the agent’s predator, in order to
discourage the trap of agents learning to focus only on a specific prey, regardless of closeness.
• Individual Features
– Whether the agent’s action would cause the agent’s predator to hit the edge of the
map. This feature is 1 if it would hit and -1 if it would not.
– Whether the agent’s predator’s expected next position would be within one square
of another uncoordinated predator’s current position. This detects whether it could
114
collide with that other predator. Only the predators for agents that are not coordinating with this agent are counted. This feature is 1 if a collision is possible and -1
if it is not.
– For each prey in order of closeness
∗ The agent’s predator’s expected next position’s distance from the prey. This
feature is 0 if the prey has been captured already and is otherwise weighted by
log(distance+2)−log(2)
). The
the maximum possible distance on the map as (1− log(maximumDistance+2)
formula is chosen make a distant prey similar to a previously captured prey and
to make the exact distance less important at close ranges.
∗ Whether the action takes the agent’s predator closer to the prey. This feature
is 1 if the move takes it closer, -1 if the move takes it further, and 0 if it is
equidistant or if the prey is already captured.
∗ Whether the prey has already been captured. This feature is 1 if it has and -1
if it has not.
• Pairwise Features
– Whether the two agents’ predators would collide. Because it is based on both
agents’ actions, this feature is more certain than the individual feature. This feature
is 1 if they will collide and -1 if they will probably not (they can still collide if one
is moving into the other’s square and the other is trying to move away but collides
with a third predator).
– For each prey in order of index (which is not correlated with position, but is consistent across time steps)
∗ Whether both agents’ actions would move their respective predators toward the
prey. This feature is 0 if the prey is captured already and is otherwise weighted
1 +distance2 +2)−log(2)
using the maximum possible distance on them map (1− log(distance
).
log(maximumDistance∗2+2)
115
The formula was chosen to promote moving together toward prey that were already close without making the exact distance overly important as long as it is
close.
∗ Whether the action of at least one of the two agents would move its predator toward the prey and the other’s action would not move its predator away
from the prey. This feature is 0 if the prey is captured already and is otherwise weighted using the maximum possible distance on them map (1 −
log(distance1 +distance2 +2)−log(2)
).
log(maximumDistance∗2+2)
The formula was chosen to promote moving to-
gether toward prey that were already close without making the exact distance
overly important as long as it is close.
A.2.2 Sepia
In Sepia, each individual action corresponds to a particular opposing unit that will be attacked. As such the features are mostly based on that opposing unit and the agent’s unit. There
are 24 individual features per agent, 2 pairwise features, and one always-on constant feature,
for 51 total per pairwise Q value function. All features but the constant are zeroed when a
relevant agent is dead.
• Individual Features
– Distance from the agent’s unit to the opposing unit. This feature is normalized by
the maximum distance on the map.
– Whether the distance from the agent’s unit to the opposing unit is less than or equal
to the agent’s unit’s maximum range. This corresponds directly to whether the
attack action will cause damage (instead of moving the agent’s unit). This feature
is 1 if the opposing unit is in range and 0 if it is out of range.
– Whether the distance from the agent’s unit to the opposing unit is greater than the
agent’s unit’s maximum range. This corresponds directly to whether the attack
116
action will cause the agent’s unit to move and is the counterpart to the previous
feature. This feature is 1 if the opposing unit is out of range and 0 if it is in range.
– Seven separate features for whether the distance from the agent’s unit to the opposing unit is 1, 2, 3, 4, 5, 6, and 7. These features are each 1 if the opposing unit is at
that exact range and -1 if it is not.
– The difference between the agent’s unit’s distance from the opposing unit and the
agent’s unit’s maximum range. This can be negative if the distance is less than the
maximum range. This feature is normalized by the maximum distance by which it
is possible to be out of range on the map (the maximum distance minus the agent’s
unit’s range).
– Whether the agent’s unit hit the opposing unit during the previous step. Hitting the
opposing unit requires that the agent had used the same action during the previous
step while already having been in range. This feature is 1 if the unit hit the opposing
unit and -1 if it did not.
– Whether the opposing unit hit the agent’s unit during the previous step. This feature
is 1 if the opposing unit hit the agent’s unit and -1 if it did not.
– Difference in health values between the opposing unit and the agent’s unit. This
feature is normalized to the highest maximum health of any unit on either side.
– Five features based on the number of agent-controlled units that tried to attack the
opposing unit on the last time step. This is also the number of agents that used this
action on the last time step. There are binary features (1 or -1) for whether that
number was 1, 2, 3, 4, or greater than 4.
– The number of agent-controlled units that hit the opposing unit during the last step.
This is the number of agents that used this action on the last time step whose units
were in range of the opposing unit. This feature is normalized based on the number
of agent-controlled units at the start.
117
– The total health of all agent-controlled units whose agents that used this action
during the last time step. This feature is normalized based on the highest maximum
health of any unit times the number of agent-controlled units that exist at the start
of an episode.
– The health value of the agent’s unit. This feature is normalized like the other health
values.
– The health value of the opposing unit. This feature is normalized like the other
health values.
– Whether the opposing unit is dead. This feature is unusual because the action itself
is invalid because it has nothing to target and it is therefore not considered. This
feature is 1 if the opposing unit is dead and -1 if it is not.
• Pairwise Features
– Whether two agents’ units are trying to attack the same opposing unit. This feature
is 1 if the two agents are using the same action and -1 if they are not.
– The absolute value of the difference in range-reduced distance between each of the
two agents’ units and the opposing unit. A low value for this feature indicates that
they will be able to start doing damage at about the same time. This feature is
weighted by the maximum distance on the map.
118
Bibliography
[1] Samuel Barrett and Peter Stone. Ad hoc teamwork in variations of the pursuit domain. In
Proceedings of the 25th AAAI Conference on Artificial Intelligence, 2011. URL https:
//www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/view/3517/4138.
[2] Richard Bellman. On the theory of dynamic programming. In Proceedings of the National
Academy of Sciences, volume 38, 1952.
[3] Daniel S. Bernstein, Shlomo Zilberstein, and Neil Immerman. The complexity of decentralized control of mdps. In Proceedings of the Sixteenth Conference on Uncertainty in
Artificial Intelligence, 2000.
[4] N. Gast, B. Gaujal, and J.-Y. Le Boudec. Mean field for Markov Decision Processes:
from Discrete to Continuous Optimization. ArXiv e-prints, April 2010.
[5] Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the
bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell., 6(6):721–741,
November 1984. ISSN 0162-8828. doi: 10.1109/TPAMI.1984.4767596. URL http:
//dx.doi.org/10.1109/TPAMI.1984.4767596.
[6] Carlos Guestrin, Daphne Koller, and Ronald Parr. Multiagent planning with factored
mdps. In NIPS-14, 2001.
[7] Carlos Guestrin, Shobha Venkataraman, and Daphne Koller. Context-specific multiagent
coordination and planning. In Proceedings of the Eighteenth National Conference on
Artificial Intelligence, pages 253–259, 2002.
119
[8] Tommi S. Jaakkola. Tutorial on variational approximation methods. In Advanced Mean
Field Methods: Theory and Practice, pages 129–159. MIT Press, 2000.
[9] Hilbert J. Kappen and Wim J. Wiegerinck. Mean field theory for graphical models. In
Advanced Mean Field Methods: Theory and Practice, pages 38–49. MIT Press, 2000.
[10] Jelle R. Kok and Nikos Vlassis. Using the max-plus algorithm for multiagent decision
making in coordination graphs. In RoboCup-2005: Robot Soccer World Cup IX, 2005.
[11] Jelle R. Kok, Pieter Jan ’t Hoen, Bram Bakker, and Nikos Vlassis. Utile coordination: learning interdependencies among cooperative agents. In Proceedings of the IEEE
Symposium on Computational Intelligence and Games (CIG), pages 29–36, Colchester,
United Kingdom, April 2005.
[12] Liviu Panait and Sean Luke.
art.
Cooperative multi-agent learning: The state of the
Autonomous Agents and Multi-Agent Systems, 11(3):387–434, 2005.
ISSN
1387-2532. doi: 10.1007/s10458-005-2631-2. URL http://dx.doi.org/10.1007/
s10458-005-2631-2.
[13] Stuart J. Russell and Peter Norvig.
proach (3. internat. ed.).
2.
Artificial Intelligence - A Modern Ap-
Pearson Education, 2010.
ISBN 978-0-13-207148-
URL http://vig.pearsoned.com/store/product/1,1207,store-12521_
isbn-0136042597,00.html.
[14] Bart Selman, Hector Levesque, and David Mitchell. A new method for solving hard
satisfiability problems. In Proceedings of the Tenth National Conference on Artificial
Intelligence, AAAI’92, pages 440–446. AAAI Press, 1992. ISBN 0-262-51063-4. URL
http://dl.acm.org/citation.cfm?id=1867135.1867203.
[15] Scott Sosnowski, Tim Ernsberger, Feng Cao, and Soumya Ray. Sepia: A scalable game
environment for artificial intelligence teaching and research. In Proceedings of the Fourth
Educational Advances in Artificial Intelligence Symposium, pages 1592–1597, 2013.
120
[16] Peter Stone, Gal A. Kaminka, Sarit Kraus, and Jeffrey S. Rosenschein. Ad hoc autonomous agent teams: Collaboration without pre-coordination. In Proceedings of the
Twenty-Fourth Conference on Artificial Intelligence, July 2010.
[17] Brian Tanner and Adam White.
RL-Glue : Language-independent software for
reinforcement-learning experiments. Journal of Machine Learning Research, 10:2133–
2136, September 2009.
[18] Nikos Vlassis. A Concise Introduction to Multiagent Systems and Distributed Artificial
Intelligence. Morgan and Claypool Publishers, 1st edition, 2007. ISBN 1598295268,
9781598295269.
[19] Nikos Vlassis, Reinoud Elhorst, and Jelle R. Kok. Anytime algorithms for multiagent decision making using coordination graphs. In Proceedings of the International Conference
on Systems, Man, and Cybernetics (SMC), The Hague, The Netherlands, October 2004.
[20] C.J.C.H. Watkins. Learning from delayed rewards. PhD thesis, University of Cambridge,
England, 1989.
[21] Aaron Wilson, Alan Fern, Soumya Ray, and Prasad Tadepalli. Learning and transferring roles in multi-agent reinforcement learning. In Association for the Advancement of
Artificial Intelligence 2008: Workshop on Transfer Learning for Complex Tasks, 2008.
[22] Chongjie Zhang and Victor Lesser. Coordinating multi-agent reinforcement learning with
limited communication. In Ito, Jonker, Gini, and Shehory, editors, Proceedings of the
12th International Conference on Autonomous Agents and Multiagent Systems (AAMAS
2013), 2013.
© Copyright 2026 Paperzz