No-regret Learning in Games

No-regret Learning in Games
Hoda Heidari
Abstract
The study of learning dynamics in strategic environments has a long history in economic
theory. Many different classes of learning algorithms have been considered in the literature
and some have been shown to converge to equilibrium under certain conditions. In this
note, I focus on a particular class of learning processes, called no-regret learning. While
the no-regret framework was originally proposed for single-agent decision making and for
a long time was mostly studied in computer science and operations research community, a
series of relatively recent papers have pointed out interesting and significant implications of
these dynamics in strategic settings. I survey and discuss a few important results in this
area. The line of research on no-regret learning in games can be divided into three main
strands: The first set of results is concerned with equilibrium computation via no-regret
learning; the second set of results looks into the impact of no-regret dynamics on welfare
and efficiency; and in the third part, I briefly mention a recent trend in the algorithmic
economics community where as opposed to best-response, researchers assume players follow
some sort of no-regret algorithm and explore the market consequences of such behavior.
1
Introduction
Learning in strategic environments has received considerable attention in economic theory
(see [Young, 2004] for a survey). The study of this topic is important for several reasons:
First, most of the theoretical results in economics rely heavily on Nash equilibrium or its refinements as the solution concept, but it is not clear why players should be expected to reach
an equilibrium. Various explanations have been proposed to justify this assumption. One of
the leading arguments is that equilibrium may arise naturally as the result of players learning
over time to respond to their opponents optimally. For this reasoning to hold, players must be
equipped with natural and robust ways of adapting their actions over time that leads them to
equilibrium. Second, a class of results known as Folk theorems assert that any reasonable outcome can arise as a Nash equilibrium in an infinitely repeated game. In light of these theorems,
learning models can provide a way to model players’ behavior or even a prescription for how
they should learn as the game unfolds to maximize their long-term payoff.
Many different classes of learning algorithms have been considered in the literature and some
have been shown to converge to equilibrium under certain conditions. In this note, I focus on a
particular class of learning processes, called no-regret learning. Consider a setting in which an
agent is faced with the problem of choosing which action to take over a sequence of trials. Each
time the decision maker chooses an action, he receives a payoff, and his goal is to minimize his
overall regret. Roughly speaking, regret compares the decision maker’s total payoff to that of
the best actions they could have taken in retrospect.
Past solutions to this problem can be divided into two main categories: first, solutions that
rely on statistical assumptions about the underlying payoff process (see [Robbins, 1952, Gittins
et al., 1989, Auer et al., 2002]); and second, solutions for the setting where an adversary,
who is capable of reacting to the choices of the decision maker, determines the sequence of
payoffs (see for instance [Auer et al., 2002]). For the purposes of this survey, we mainly focus
on the adversarial environments. While the no-regret framework was originally proposed for
single-agent decision making environments and for a long time was mostly studied in computer
1
science and operations research community, a series of relatively recent papers have pointed out
interesting and significant implications of these dynamics in strategic settings. I survey and
discuss a few important results in this area. The line of research on no-regret learning in games
can be divided into three main strands: The first set of results is concerned with equilibrium
computation via no-regret learning; the second set of results looks into the impact of no-regret
dynamics on welfare and efficiency; and the third set of results explore the market consequences
of no-regret behavior.
1.1
Background
The focus of the current note is on learning that takes place at the individual level in strategic environments involving repetition. Another important class of learning models, known as
evolutionary models, involve learning at the population level and study whether the agents’
aggregate behavior resembles a Nash equilibrium in the long-run. In these models, strategies
are represented by individuals in the population and the game is played among a subset of randomly chosen individuals. For more information see [Samuelson, 1998, Fudenberg and Levine,
1998, Pradelski and Young, 2012].
Many different classes of learning processes have been considered in the literature. Depending on whether the process directly models the opponent’s behavior, learning algorithms can be
divided into two main categories: model-based or model-free. In model-based learning, the player
starts with a model (belief) for their opponent’s strategy; at each round, they best-respond to
their current belief; then observe the opponent’s play at that round and update their belief.
The model-free approach avoids building an explicit model of the opponent’s behavior. Rather,
it relies solely on the player’s own possible actions and how they compare with one another in
terms of payoff [Shoham et al., 2007].
An important aspect of a learning model is whether it converges to some notion of equilibrium. The type of convergence result established varies considerably across different learning
models. To avoid unnecessary details and keep the survey at a high level, I will intentionally
be vague about the notion of convergence proven for each model. I will, however, distinguish
between the following two cases: one in which players are guaranteed to reach Nash Equilibria
within finite number of rounds; another where the sequence of play is guaranteed to resemble
a Nash Equilibrium in the limit (i.e. an outside observer cannot reject the Nash Equilibrium
hypothesis).
One can also classify learning algorithms according to the information they require about
the opponents’ play. A learning procedure is called coupled if requires the player to observe
the opponents’ actions and payoffs in each round. A learning procedure is called uncoupled
if does not require any information about the opponents’ payoffs. A learning procedure is
called completely uncoupled if it does not require any information about the opponents’ actions
or payoffs. While convergence to Nash Equilibruma can be guaranteed for a broad subset of
coupled dynamics (see for instance [Crawford and Haller, 1990, Kalai and Lehrer, 1993]), [Hart
and Mas-Colell, 2003b, Hart and Mas-Colell, 2005] show that uncoupled dynamics can not in
general lead to Nash equilibrium.
What follows is a non-comprehensive list of some of the learning dynamics studied in the
literature and a brief overview of each. See Table 1 for a quick comparison.
• Bayesian learning: In this model-based approach, players start with a prior belief about
their opponents’ strategy and update their belief at each round according to the Bayes
rule (see for example [Crawford and Haller, 1990, Kalai and Lehrer, 1993]). Players are
assumed to be able to observe their opponent’s action and payoff in each round. [Kalai
and Lehrer, 1993] show that under certain conditions players reach a Nash equilibrium
with probability one. The condition required for this to hold is that players’ strategies
must be optimal given their beliefs; also their beliefs must put positive probability on all
2
events that receive positive probability under the truly chosen strategies. Unfortunately a
negative result by [Nachbar, 1997] shows that satisfying the latter condition is difficult.1
• Fictitious play: This model-based approach was first introduced in [Brown, 1951]. Each
player presumes their opponent follows a stationary strategy; and at each round best responds to the empirical frequency of their opponent’s play up to that round. Players are
assumed to be able to observe their opponent’s actions and payoffs. While fictitious play
does not in general converge to a Nash equilibrium of the stage game (see for instance [Fudenberg and Kreps, 1993, Fudenberg and Kreps, 1995, Monderer and Sela, 1996, Foster
and Young, 1998]), positive convergence results have been established for certain classes
of games, including zero-sum games [Robinson, 1951], non zero-sum 2×2 game [Miyasawa,
1961], games solvable via iterated elimination of strictly dominated strategies [Nachbar,
1990], potential games [Monderer and Shapley, 1996b, Monderer and Shapley, 1996a], supermodular games [Hofbauer and Sandholm, 2002], and generic 2×n games [Berger, 2005].
Also see [Fudenberg and Levine, 1995].
• Calibrated learning: A forecasting rule over an infinite sequence of observations is calibrated if asymptotically the empirical frequency of an event coincides with its predicted
probability. [Foster and Vohra, 1997] show that if each players follows a learning rule that
is a calibrated forecast of the other player’s play, and and at each round picks a myopic
best response to this forecast, the sequence of plays converges to a correlated equilibrium. [Foster and Vohra, 1998] present a randomized calibrated forecasting algorithm.
Unfortunately beyond binary outcomes no computationally efficient calibrated algorithm
is known to exists [Abernethy and Mannor, 2011] and the problem has been shown to be
computationally hard in general [Hazan and Kakade, 2012].
More recently, researchers have proposed less demanding notions of calibration; see for
instance smooth callibration in [Foster and Hart, 2015]. If all players follow the same
smoothly calibrated procedure and at each round best-respond to the forecast of their
opponents’ action, then convergence to (approximate) Nash equilibria is guaranteed. This
result in line with the Conservation Coordination Law for game dynamics: some form
of “coordination” is needed for convergence, either in the equilibrium concept or in the
dynamic leading to it. For other instances of this general principle, see [Foster and Young,
2003, Kakade and Foster, 2004]
• Trial and error learning: In this approach players occasionally experiment with alternative strategies, keeping the current strategy if and only if it leads to a strict increase in
their payoff (see for instance [Young, 2009, Marden et al., 2009]). Another name for this
approach is “learning by hypothesis testing” [Foster and Young, 2003, Foster and Young,
2006]. Players can be thought of as starting with a hypothesis about the stage game
strategy used by their opponent. Each player continues playing a myopic best response
to their hypothesis until the evidence provided by past play leads them to reject it. After
rejecting one hypothesis another one is formed at random.
• Reinforcement learning: In this model-free approach, every player assigns each of her
strategies a positive weight; the weight is increased each time the strategy is played; the
player makes her choices in each period at random with probabilities proportional to
these weights. (see for example [Börgers and Sarin, 1997, Bush and Mosteller, 1955]).
Reinforcement learning is motivated by the theory of stimulus-response in psychology.
Q-learning [Watkins and Dayan, 1992] and its extensions are well-known instances of this
approach. Reinforcement learning approaches have been shown to outperform equilibrium
1
More precisely, Nachbar shows that if players optimize with respect to beliefs that satisfy a certain diversity
condition called neutrality, each player will choose a strategy that his opponent was certain would not be played.
3
Learning class
Bayesian
Fictitious play
Calibrated
Trial and error
Reinforcement
No-regret
Observables
Coupled
Coupled
Uncoupled
Uncoupled
Completely uncoupled
Completely uncoupled
Model-based
Yes
Yes
Yes
Yes
No
No
Hardness
Yes
No
Yes
No
No
Convergence
NE
NE
NE, CE
NE, CE
CE, CCE
Table 1: Comparison of different classes of learning algorithms.
models in terms of predicting human behavior [Roth and Erev, 1995, Erev and Roth,
1998]. However, these dynamics do not converge to equilibrium in general, as they ignore
the counter-factual performance of each action.
• No-regret learning: In this model-free approach, at each round the player chooses an
action; then receives the corresponding payoff, and his goal is to minimize his overall
regret. The payoffs can be arbitrary or even adversarial. Nonetheless if a player follows
a no-regret learning procedure, they are guaranteed vanishing regret. Depending on the
notion of regret, If followed by all players no-regret dynamics exhibit nice convergence
properties: the average play converges to correlated or coarse correlated equilibria [Hart
and Mas-Colell, 2000, Hart and Mas-Colell, 2001, Hart and Mas-Colell, 2003a, Hart, 2005].
1.2
Outline
In Section 2, I define the notation. I start with a brief overview of the basic concepts in game
theory and repeated games. I then present a summary of the no-regret framework, defining
different notions of regret and presenting a central no-regret algorithm, known as the weighted
majority algorithm. Sections 3, 4, 5 discuss a series of results that draw a connection between
the latter two areas of research, no-regret learning and game theory.
No-regret dynamics have been shown to convergence to equilibrium under certain conditions.
Section 3 summarizes the results on equilibrium computation via no-regret learning algorithms.
We will see that depending on the notion of regret being minimized by players, the average play
over time converges to different generalizations of Nash equilibrium. Even if no-regret dynamics
do not converge to equilibrium, they can guarantee certain lower bounds on the social welfare.
Section 4 addresses the efficiency of no-regret dynamics, and state the conditions sufficient for
such bounds to be attainable.
In Section 5, I briefly mention a recent trend in the algorithmic economics community where
as opposed to best-response dynamics, researchers assume players follow some sort of no-regret
algorithm, and explore the market consequences of such behavior. As an instance of papers in
this area, I overview a framework proposed for estimating the valuations of regret-minimizing
agents in repeated auctions. I conclude in Section 6 by enumerating some of the caveats and
inadequacies of the previous work and present a list of potential directions for future work.
2
Notation and Preliminaries
In this section, I define the notation. I start with a brief overview of the basic concepts in
game theory and repeated games, then present a summary of the no-regret framework, defining
different notions of regret and demonstrate a central no-regret algorithm, known as the weighted
majority algorithm.
4
2.1
Game Theory Basics
A normal-form game G among n ≥ 2 players is defined as follows: G = {(Si , ui )}ni=1 where Si
is the set of all pure strategies for player i (i = 1, · · · , n). A strategy profile s = (s1 , · · · , sn )
specifies a strategy si ∈ Si for each player i. Let S = S1 × · · · × Sn be the set of all strategy
profiles. For any
P s ∈ S, ui (s) is the utility player i receives if all players follow s. Also we
define u(s) = ni=1 ui (s). u(s) can be thought of as the social welfare associated with strategy
profile s. Throughout and unless otherwise specified, we assume all Si ’s are finite, and the game
structure is common knowledge among players.
Definition 1 (Pure Nash equilibrium) A strategy profile s is a Pure Nash equilibrium (PNE)
of G if no player can increase their payoff via a unilateral deviation:
ui (si , s−i ) ≥ ui (s0i , s−i )
for all i = 1, · · · , n and any s0i ∈ Si .
A mixed strategy σi is a probability distribution over Si (i = 1, · · · , n) allowing the player
to randomize over their pure strategy space.
Definition 2 (Mixed Nash equilibrium) A set (σ1 , · · · , σn ) of independent probability distributions is a Mixed Nash equilibrium (MNE or simply NE) of G if no player can increase their
expected payoff under the product distribution σ = σ1 × · · · × σn via a unilateral deviation:
Es∼σ ui (si , s−i ) ≥ Es−i ∼σ−i ui (s0i , s−i )
for all i = 1, · · · , n and any s0i ∈ Si .
A celebrated theorem by Nash shows that mixed Nash equilibria always exist in finite games [Nash
et al., 1950]. Complexity theory, however, provides compelling evidence for the computational
hardness of finding Nash Equilibria (see for example [Gilboa and Zemel, 1989, Daskalakis et al.,
2009, Fabrikant et al., 2004]).
While NE requires players’ strategies to be independent of one another, a generalization
of this notion, called correlated equilibrium [Aumann et al., 1974], allows players to observe a
correlating signal before making their choice.
Definition 3 (Correlated equilibrium) A joint probability distribution σ over S is a Correlated Equilibrium (CE) of G if:
Es∼σ [ui (si , s−i )|si ] ≥ Es−i ∼σ−i [ui (s0i , s−i )|si ]
for all i = 1, · · · , n and any s0i ∈ Si .
A classical interpretation of a correlated equilibrium is the following: suppose there is a mediator who draws an outcome s from the publicly known distribution σ and privately recommend
strategy si to each player i. Correlated equilibrium requires that the expected payoff from playing the recommended strategy be larger than or equal to playing any other strategy. While the
set of Nash Equilibria of a game is a mathematically complex object (it is a set of fixed-points),
the set of correlated equilibria is a convex polytope, and as the result there are computationally
efficient algorithms to compute correlated equilibria [Papadimitriou and Roughgarden, 2008].
See Figure 1.
While correlated equilibrium requires that following the recommended strategy be a best
response in the interim stage, a generalization of this notion, called coarse correlated equilibrium,
only requires this to be the case at the ex-ante stage.
5
Definition 4 (Coarse correlated equilibrium) A joint probability distribution σ over S is
a Coarse Correlated Equilibrium (CCE) of G if:
Es∼σ [ui (si , s−i )] ≥ Es−i ∼σ−i [ui (s0i , s−i )]
for all i = 1, · · · , n and any s0i ∈ Si .
The set of all such distributions is sometimes called the Hannan set [Hannan, 1957]. See Figure 1
for a brief comparison of different equilibrium notions.
No$Regret$(CCE)$
CorEq$
easy$to$
compute/$
learn$
always$
exists,$hard$$
to$compute$
MNE$
need$not$
exist,$hard$
to$compute$
PNE$
Many of the POA upper bounds in the literature can
be recast as instantiations of this canonical method.
(B) We prove an “extension theorem”: every bound on the
price of anarchy that is derived via a smoothness argument extends automatically, with no quantitative
degradation in the bound, to all of the more general
equilibrium concepts pictured in Figure 1.
(C) We prove that routing games, with cost functions restricted to some arbitrary set, are “tight” in the following sense: smoothness arguments, despite their automatic generality, are guaranteed to produce optimal
worst-case upper bounds on the POA, even for the
set of pure Nash equilibria. Thus, in these classes of
games, the worst-case POA is the same for each of the
equilibrium concepts of Figure 1.
1: Generalizations
of pure Nashefficiency
equilibria. of each and their relationship
Figure 1: GeneralizationsFigure
of PNE,
their computational
“PNE” stands for pure Nash equilibria; “MNE” for
2. SMOOTH GAMES
with one another. Illustration
fromequilibria;
[Roughgarden,
mixed Nash
“CorEq” for2009].
correlated equi-
2.2
libria; and “No Regret (CCE)” for coarse correlated
equilibria, which are the empirical distributions corresponding to repeated joint play in which every
player has no (external) regret.
Repeated Games
2.1
Definitions
By a cost-minimization game, we mean a game — players, strategies, and cost functionsP
— together with the joint
k
cost objective function C(s) =
i=1 Ci (s). Essentially, a
“smooth game” is a cost-minimization game that admits a
POA bound of a canonical type (a “smoothness argument”).
in
economics
to model
re-explain how to interWe give
the formal definition
and then
pret
giveit. rise to incentives and
Repeated games are the amost
game. common class of models examined
This article presents a general theory of “robust” bounds
peated interactions among
strategic agents. Repeated interactions
on the price of anarchy. We focus on the hierarchy of fundamental equilibrium
concepts shown
Figure 1;
the full
(Smooth Games)For
A cost-minimization game
opportunities that are fundamentally
different
frominthose
observed
inDefinition
isolated2.1interactions:
version [22] discusses additional generalizations of pure Nash
is ( , µ)-smooth if for every two outcomes s and s⇤ ,
example the promise of rewards
and
the
threat
of
punishment
in
the
future
of
the
game
can
equilibria, including approximate equilibria and outcome sek
X
quences generated
by best-response
dynamics. Weand
formally
provide incentives for desirable
behavior
today
[Mailath
Samuelson,
2006].
addition,
Ci (s⇤i , sIn
· C(s⇤ ) + µ · C(s).
(2)
i) 
define the equilibrium concepts of Figure 1 — mixed Nash
i=1
equilibria,with
correlated
and coarse to
correlated
repetition provides the players
theequilibria,
opportunity
learnequiand adapt their strategies using
Roughly, smoothness controls the cost of a set of “one-dimensiona
libria — in Section 3.1, but mention next some of their imtheir past observations. portant properties.
perturbations” of an outcome, as a function of both the initial outcome s and the perturbations s⇤ .
Enlarging
theconsists
set of equilibria
weakens
the
behavioral
andis played
At a high level a repeated
game
of
a
stage
game
that
for an infinite number
We claim that if a game is ( , µ)-smooth, with > 0 and
technical assumptions necessary to justify equilibrium analµ < 1,payoffs.
then each of
its pure
Nash equilibria s has cost at
First,
while
are games with
no pure
Nash equilibof time periods by playersysis.
who
seek
tothere
maximize
their
average
discounted
More
formally
most /(1 µ) ntimes that of an optimal solution s⇤ . In
ria — “Matching Pennies” being a simple example — every
a repeated game among n(finite)
players
can
be
defined
as
follows:
Let
G
=
{(A
,
u
)}
be
the
stage
i derive
i i=1
proof, we
game has at least one mixed Nash equilibrium [16].
As
a
result,
the
“non-existence
critique”
for
pure
Nash
equik
game, where Ai is the action
space for player i = 1, · · · , n and u (a , · · · , an ) is the (stage)
X utility
libria does not apply to any of the more general conceptsi in 1
C(s) =
Ci (s)
(3)
Figure 1.
Second, a
while
Nash
equilibplayer i receives if the action
profile
= computing
(a1 , · · · a, amixed
)
is
played.
Players
repeatedly
play
G at
n
i=1
rium
is
in
general
a
computationally
intractable
problem
[5,
k t
X
time t = 1, 2, · · · . Let hti 8],denote
the
history
observed
H
⇤ the
computing
a correlated
equilibrium
is notby
(see,player
e.g., [17, i up to time t, and
i Cbe

(4)
Sthe “intractability
i (si , s i )
t
Chapter
2]).
Thus,
critique”
for
pure
and
i=1
set of all such histories. Define
H
=
H
.
A
strategy
for
player
i,
is
a
function
s
:
H
→
Ai .
i
i
i
mixed Nash equilibriat doesi not apply to the two largest sets
⇤

· C(s ) + µ · C(s),
(5)
of Figure 1.
More importantly,
these two
Let Si be the set of all possible
strategies
for player
i. sets are “easily
learnable”: when a game is played repeatedly over time, t where (3) follows from the definition of the objective funcA strategy profile s =there
(s1 ,are· ·natural
· , sn )classes
induces
at outcome
path {a (s)}
Let δ (4)
befollows
the from
common
tion;t .inequality
the Nash equilibrium condiof learning
dynamics — processes
tionframework
(1), applied onceδtois
each
player i with the hypothetical
by which As
a player
for the in
nextthe
time no-regret
step,
discount factor for all players.
wechooses
will its
seestrategy
shortly,
assumed
deviation s⇤i ; and inequality (5) follows from the defining
as a function only of its own payo↵s and the history of play
condition
(2) of a smooth game. Rearranging terms yields
to be 1. The total payoff —for
player
i in the
repeated
defined
as follows:
that
are guaranteed
to converge
quicklygame
to theseis
sets
of
the claimed bound.
Definition 2.1 is sufficient for the last line of this three-line
1.2 Overview
proof (3)–(5), but it insists on more than what is needed: it
t
t
Ui (s) =
δ) intoδthree
ui (a
(s))
demands that the inequality (2) holds for every outcome s,
Our contributions
can(1
be −
divided
parts.
and not only for Nash equilibria. This is the basic reason
t=1
(A) We identify a sufficient condition for an upper bound
why smoothness arguments imply worst-case bounds beyond
on the POA of pure Nash equilibria of a game, which
the set of pure Nash equilibria.
n
encodes aG(∞)
canonical=
proof
template
deriving such
We define the robust POA as the best upper bound on the
normal-form game
{(S
i , Ui )for
i=1 }. The notion of Nash equilibrium is
bounds. We call such proofs “smoothness arguments”.
POA that is provable via a smoothness argument.
equilibria (see [17, Chapter 4]).
∞
X
This defines a
defined for G(∞) the same way as it is defined for ordinary normal-form games.
Repeated games can be classified according to the information available to players after each
round:
6
• Perfect public monitoring: At the end of each time period t, players can observe each
others’ actions and payoffs. With perfect public monitoring hti = ht for all i and ht =
(a1 , · · · , at−1 ).
• Imperfect public monitoring: At the end of each time period t, players can only observe
a public signal that contains information about the other players’ actions, but this signal
does not necessarily determine it. Again here hti = ht for all i, but ht is not necessarily
equal to (a1 , · · · , at−1 ).
• Private monitoring: At the end of each time period t, players receive a possibly different
private signal containing information about what has happened in that round.
The above types of monitoring are in parallel with coupled, uncoupled, and completely uncoupled dynamics discussed earlier. In the no-regret framework, the decision maker only needs
to observe their own payoff, hence the information structure is closer to private monitoring in
repeated games.
A cluster of results known as Folk Theorems (see for example [Fudenberg and Maskin,
1986, Fudenberg et al., 1994]) asserts that with sufficient patience, any payoff vector that is not
obviously uninteresting can be an equilibrium payoff. More precisely:
Definition 5 (Minmax utility) Player i’s minmax utility in game G is equal to
vi =
min
max ui (ai , a−i )
Q
a−i ∼ j6=i ∆(Aj ) ai ∼∆(Ai )
The set of individually rational payoffs is the set of feasible payoff vectors that offer each player
a payoff at least equal to their minmax payoff:
{v ∈ Rn | vi ≥ v i } ∩ conv{v ∈ Rn | ∃a s.t. v = u(a)}.
Theorem 1 (Folk Theorem) In a repeated game G(∞) if δ is large enough, for any point v
in the individually rational region, there is a mixed Nash equilibrium that achieves that payoff
combinations.
While this might suggest the problem of finding Nash equilibria is generally easier for repeated
game, [Borgs et al., 2008] show that the problem remains hard for more than two players.
2.3
No-regret Learning
In the adversarial setting, an agent has a number of actions or arms available to them. At each
time step the agent chooses one action and in return receives a payoff. The agent’s goal is to
achieve a high cumulative payoff. No statistical assumptions are made about how the payoffs
are generated and the agent must perform well even if the sequence of payoffs is chosen by an
adversary.
Formally, at every time step t ∈ {1, · · · , T } every action k ∈ {1, · · · , K} P
generates a payoff
utk ∈ [0, 1]. The cumulative payoff of action k at time T is defined as UkT = Tt=1 utk . Consider
P
t
an algorithm A that picks action k at time t with probability
wkt (that is K
k=1 wk = 1). The
P
n
t
t
t
algorithm’s instantaneous payoff at time t is equal to uA = k=1 wk uk . The cumulative payoff
of A up to time T is equal to
T
UA
=
T
X
utA =
t=1
T X
K
X
wkt utk .
t=1 k=1
The majority of previous studies aim to minimize the so-called external regret as their objective. External regret compares the online algorithm’s payoff to the best actions in retrospect
7
(see for instance [Foster and Vohra, 1993, Littlestone and Warmuth, 1994, Freund and Schapire,
1997, Freund and Schapire, 1999, Cesa-Bianchi et al., 1997], and is the difference between the
total reward collected by the agent and the maximum reward that could have been collected by
taking a single action on the same sequence of rewards as the one generated by the algorithm.
External regret is a sensible notion of regret if the adversary is oblivious. In many real-world
scenarios, however, the adversary is adaptive, that is, the decision maker’s choices can impact
the sequence of payoffs the adversary generates.
Definition 6 (External regret) The external regret of an algorithm A up to time T is
T
UA
−
max
k∈{1,··· ,K}
UkT .
An algorithm is said to have no external regret if the average per time step external regret
approaches 0 as T goes to infinity. The Weighted Majority (WM) algorithm [Littlestone and
Warmuth, 1994, Freund and Schapire, 1997, Freund and Schapire, 1999] is an example of a no
t
external regret algorithm. Weighted Majority uses weights wkt proportional to e−µUk , where
µ > 0 is a tunable parameter known
as the learning rate. If µ is set properly the regret of WM
√
after T trials is bounded by O( T log K).
A more general notion of regret, internal regret, allows one to modify A’s action sequence
by changing every occurrence of a given action k to an alternative action j. More precisely, a
(memory-less) modification rule f is any mapping that takes as input an action k ∈ {1, · · · , K}
and outputs a (possibly different) action f (k) ∈ {1, · · · , K}. For an algorithm A, f (A) denotes
the action sequence resulting from applying f (.) to A’s output sequence. Let F be the set of
all K K modification rules. Let F int be the subset of F including all the K(K − 1) modification
rules fk,j , where fk,j = j and fk,j (k 0 ) = k 0 for k 0 6= k.
Definition 7 (Internal regret) The internal regret of an algorithm A up to time T is
T
UA
− max UfT(A)
f ∈F int
The notion was introduced in [Foster and Vohra, 1997, Foster and Vohra, 1998], and several low
internal regret algorithms have been developed in the literature [Foster and Vohra, 1997, Foster
and Vohra, 1998, Foster and Vohra, 1999, Hart and Mas-Colell, 2000, Cesa-Bianchi and Lugosi,
2003, Blum and Mansour, 2005]. Blackwell’s approachability theorem2 plays an important role
in some of these algorithms.
A somewhat stronger notion of regret, called swap regret, has also been studied in the
past [Blum and Mansour, 2005]. Swap regret allows one to simultaneously swap multiple pairs
of actions.
Definition 8 (Swap regret) The swap regret of an algorithm A up to time T is
T
UA
− max UfT(A)
F ∈F
In [Blum and Mansour, 2005] authors present an efficient way to convert any low external regret
algorithm into a low swap regret algorithm.
When the adversary is adaptive, the strongest notion of regret is one that compares the
performance of A to the optimal algorithm in hindsight. This is called policy regret [Arora
et al., 2012].
2
Blackwell’s approachability theorem [Blackwell et al., 1956] is a generalization of the minmax theorem to
games with vector-valued payoffs.
8
Definition 9 (Policy regret) The policy regret of an algorithm A up to time T is
T
T
UA
− max
UA
0
0
A
A0
where
is an arbitrary algorithm that can possibly have full knowledge of the adversary’s
behavior.
In [Arora et al., 2012] authors show that if the adaptive adversary has bounded memory, then
a variant of traditional online learning algorithms can still guarantee no policy regret.
3
Equilibrium Computation via No-regret Learning
Suppose all players follow a no-regret algorithm; does the average play converge to an equilibrium? In this section, I survey several results that attempt to answer this question. We will see
that if all players follow a no external regret algorithm, their average play converges to the set
of coarse correlated equilibria, and if they follow a no internal regret algorithm, it converges to
the set of correlated equilibria or to the Hannan set of the stage game (see [Hart and Mas-Colell,
2000, Lehrer, 2003, Cesa-Bianchi and Lugosi, 2006]).
3.1
Convergence to Coarse Correlated Equilibria
Theorem 2 If every player plays according to a no external regret algorithm that guarantees
them a regret bounded by , then the empirical distributions of play converge to the set of coarse
correlated equilibrium of the stage game G.
Proof [sketch] Suppose after T iterations of no-regret dynamics, every player has regret at most
for each of its actions. Let σ t be the distribution
over action profiles at time t and σ be the
1 PT
time average of these distributions (i.e. σ = T t=1 σ t ). For every player i we have that:
Es∼σ [ui (s)] =
≥
T
1X
Es∼σt [ui (s)]
T
1
T
t=1
T
X
Es∼σt [ui (s0i , s−i )] + t=1
= Es∼σ−i [ui (s0i , s−i )] + where the first and last lines follow from the definition and the second line uses the no-regret
guarantee. So overall we have that
Es∼σ [ui (s)] ≥ Es∼σ−i [ui (s0i , s−i )] + which means that the empirical distributions of play is an -approximate coarse correlated
equilibrium of the stage game G.
3.2
Convergence to Correlated Equilibria
The regret matching algorithm [Hart and Mas-Colell, 2000] is a simple example of a no internal
regret algorithm guaranteeing convergence to correlated equilibria. At a high level the algorithm
works as follows: At each period, a player may either continue playing the same action as they
were playing in the previous round, or switch to a new action with probabilities proportional
the relative regret of each action. The relative regret of an action is equal to the difference
between the player’s accumulated payoff had they always replaced their current play with the
new action in the past, and their current payoff. See Algorithm 1 for the details.
Theorem 3 If every player plays according to the regret matching algorithm, then the empirical
distributions of play converge to the set of correlated equilibria of the stage game G.
9
Algorithm 1 The Regret Matching Algorithm
Pick a ∈ Ai ; % a is the current action.
t = 0; % time index
R = (0, · · · , 0); % R is the relative regret of each action.
U = 0; % U is the total payoff collected so far.
repeat
t = t + 1;
Play a at time t and receive uta ;
U = U + uta ;
for each a0 ∈ Ai , a0 6= a do
let V be the total payoff player i would have received if in the past rounds he had played
a0 in place of a;
Ra0 = V − U ;
end for
Pick a new action a ∈ Ai with probabilities proportional to R/kRk;
until false
4
Efficiency of No-regret Dynamics
The price of anarchy is the ratio of the worst-case social welfare of an equilibrium and that of
an optimal outcome of a game. Price of anarchy quantifies the suboptimality caused by selfish
behavior. Compelling upper bounds on price of anarchy have been established in a wide range
of applications.
Definition 10 (Price of Anarchy (PoA)) Let E(G) be the set of all strategy profiles that
form an equilibrium and s∗ be the social welfare maximizing strategy profile. The price of
anarchy for G is defined as:
u(s)
min
s∈E(G) u(s∗ )
An upper bound on PoA is meaningful only if participants can successfully reach an equilibrium.
For Nash equilibrium, it is not clear why this should be expected: players might fail to coordinate
on one particular equilibrium; or fail to compute a Nash equilibrium [Fabrikant et al., 2004].
These have motivated the study of bounds on PoA that apply more broadly to weaker notions
of equilibrium and no-regret behavior [Roughgarden, 2009, Blum et al., 2008].
In [Blum et al., 2008] rather than assuming self-interested players will play according to
a Nash equilibrium, authors assume only that selfish players play so as to minimize their own
regret. The price of total anarchy measures the suboptimality caused by this no-regret behavior.
More precisely, suppose players repeatedly play G, each minimizing their external regret. A
sequence of action profiles s1 , s2 , · · · , sT is called a no external regret sequence of play if for
every player i
T
T
X
X
ui (st ) ≥ [max
ui (s0i , st−i )] + o(T )
(1)
0
t=1
si ∈Si
t=1
The price of total anarchy is then defined as follows:
Definition 11 (Price of Total Anarchy (PoTA)) Suppose s1 , s2 , · · · , sT is a no external
regret sequence of play. Then the price of total anarchy for G is defined as:
1 PT
t
t=1 u(s )
T
min
s
u(s∗ )
[Blum et al., 2008] prove that in several broad classes of games, price of total anarchy matches
the price of anarchy, even though play may never converge to Nash equilibrium.
10
4.1
Smoothness
[Roughgarden, 2009] identifies general sufficient conditions on deriving an upper bound on the
price of anarchy of pure Nash equilibria of a game, called smoothness; and shows that every
bound on the price of anarchy derived this way extends automatically to mixed Nash equilibria,
correlated equilibria, coarse correlated equilibria, and every no external regret sequence of play.
Smoothness controls the payoff of a set of one-dimensional deviations as a function of both the
initial and resulting strategy profile. More precisely,
Definition 12 A game G is (λ, µ)-smooth (for λ > 0 and µ < 1) if for every two strategy
profiles s and s0 ,
n
X
ui (s0i , s−i ) ≥ λ.u(s0 ) + µ.u(s)
i=1
Note that smoothness is a requirement for every pair of strategies, not just Nash equilibria or
social welfare maximizing outcomes.
It is easy to see that if G is (λ, µ)-smooth, then G’s pure strategy price of anarchy is bounded
by λ/(1 − µ). Suppose s is the pure strategy equilibrium with the lowest social welfare and s∗
is the strategy profile that maximizes the social welfare. We have that:
u(s) =
≥
n
X
i=1
n
X
ui (si , s−i )
ui (s∗i , s−i )
i=1
≥ λ.u(s∗ ) + µ.u(s)
or equivalently u(s)/u(s∗ ) ≥ λ/(1 − µ).
4.2
Extension Results
[Roughgarden, 2009] next shows that every bound on the price of anarchy derived via a smoothness argument extends automatically to weaker notions of equilibria and no external regret
sequences of play. For static notions of equilibria, the reasoning is similar to above.
Theorem 4 Let ρ(G) be the largest lower bound of PoA that can be derived via a smoothness
argument on G. The coarse correlated price of anarchy of G is lower bounded by ρ(G).
For no external regret sequences of play, the proof sketch is as follows:
Theorem 5 Suppose s1 , s2 , · · · , sT is a no external regret sequence of play. Then
[ρ(G) + o(1)].u(s∗ )
11
1
T
PT
t
t=1 u(s )
≥
Proof [sketch] Suppose ρ(G) = λ/(1 − µ) and G is (λ, µ)-smooth. We have that:
T
X
t
u(s ) =
t=1
=
≥
T X
n
X
t=1 i=1
n X
T
X
i=1 t=1
n X
T
X
[
ui (st )
ui (st )
ui (s∗i , st−i )] + o(T )
i=1 t=1
≥ o(T ) +
T X
n
X
[
ui (s∗i , st−i )]
t=1 i=1
≥ o(T ) +
T
X
[λ.u(s∗ ) + µ.u(st )]
t=1
≥ o(T ) + T λ.u(s∗ ) + µ
T
X
u(st )
t=1
which is equivalent to
T
λ
1X
u(st ) ≥
u(s∗ ) + o(1)
T
1−µ
t=1
5
Implications of No-regret Dynamics in Markets
In recent years, learning outcomes have emerged as an important alternative to Nash equilibria.
The notion of a no-regret learning outcome generalizes Nash equilibrium by requiring the noregret property of a Nash equilibrium, in an approximate sense, but without the assumption that
player strategies are stable. More precisely, choosing a best response implies that the players
have no regret for alternate strategy options; no-regret is a generalization of this concept, where
in a sequence of play players have small regret for any fixed strategy.
An increasing number of studies in the algorithmic game theory community assume players
follow some form of no-regret algorithm—as opposed to best response—and explore the market
consequences of such behavior. For instance, [Krichene et al., 2014] study no-regret behavior in
selfish routing; [Kleinberg et al., 2009] investigates it in the context of congestion games; and
[Nekipelov et al., 2015] look at no-regret bidders in sponsored search auctions. As an instance
of such work, in this section we overview [Nekipelov et al., 2015] which studies the problem of
inferring no-regret agents’ values.
[Nekipelov et al., 2015] focus on a model of sponsored search auctions where bidders compete
for advertising spaces and propose a theory of inference of agent values from observed data just
based on the assumption that the agent’s learning strategies are smart enough that they have
minimal regret. Authors don’t make any assumptions on the particular no-regret algorithms
players employ.
When inferring player values from data, one needs to always accommodate small errors.
When players use learning strategies, a small error > 0 would mean that the player can have
up to regret.
Definition 13 (Rationalizable set) The rationalizable set (NR) consists of the set of values
and error parameters (v, ) such that if player’s value is equal to v, the sequence of bid he
submitted would have at most regret.
12
[Nekipelov et al., 2015] next characterize NR and show that:
Theorem 6 NR is a closed convex set and can be computed and estimated efficiently from data.
Needless to say being able to infer values from data has numerous empirical advantages for the
auctioneer.
6
Discussion and Future Directions
There are multiple reasons to question the validity of the regret minimization behavior in the
context of games.
Adversarial opponent assumption. The assumption that the opponent is adversarial is
typically too strong in strategic environments. As a result upper bounds on regret are possible
only for weak notions of regret.
Weakness of the regret notion for strategic environments. The notions of regret
studied in the literature do not take into account the effect of the players actions on the subsequent behavior of their opponent. While this is a sensible assumption in a decision making
environment, in a game by definition players respond to their opponents’ behavior.
Impossibility of no-regret with responsive opponents. The no-regret property relies
on a strong assumption that each player treats her opponents as unresponsive and fully ignores
the opponents’ possible reactions to her actions. [Schlag and Zapechelnyuk, 2012] show that if
at least one player is slightly responsive, it is impossible to achieve no regrets.
Justification for no-regret behavior. No-regret learning is only sensible in settings with
very low information available to players. Simply knowing the type of game being played can
make no-regret behavior suboptimal.
The discount factor. No-regret learning assumes players are infinitely patient, and does
not allow for discounting of future payoffs.
Some of the possible avenues for future work are:
Policy regret. The notions of external and internal regret fail to capture the actual regret
of an online algorithm compared to the optimal sequence of actions it could have taken when
the adversary is adaptive. Policy regret is defined to address this counterfactual setting [Arora
et al., 2012]. Can one design no policy regret algorithms for the players in a game? Of course
this requires proper restrictions imposed on the opponent’s behavior.
Modeling the opponent. According to the impossibility result mentioned above, without
any restriction on the opponent’s behavior, no policy regret algorithms do not exist. Are there
realistic restriction (e.g. computational limitations) on the opponent’s behavior that allow the
design of no policy regret algorithms? For example in [Arora et al., 2012] authors show that
if the adaptive adversary has bounded memory, then a variant of traditional online learning
algorithms can still guarantee no policy regret. Can similar results be obtained in strategic
environments?
Best response to a no-regret opponent. In what class of games is no-regret learning an
(approximate) best response to no-regret behavior? This condition is the minimum requirement
for no-regret dynamics to be self-fulfilling.
Moving from uncoupled to coupled dynamics. As oppose to private monitoring,
consider a situation where the payoff to each player is a noisy signal of the action profile and
the state of the world at each step. Does this new setting allow for positive convergence results?
Acknowledgments
This report was written as part of my Written Preliminary Examination (WPE) II. I would
like to thank Rakesh Vohra, Sanjeev Khanna, and Michael Kearns for serving on my WPE
committee.
13
References
[Abernethy and Mannor, 2011] Abernethy, J. and Mannor, S. (2011). Does an efficient calibrated forecasting strategy exist? In Conference on Learning Theory, pages 809–812.
[Arora et al., 2012] Arora, R., Dekel, O., and Tewari, A. (2012). Online bandit learning against
an adaptive adversary: from regret to policy regret. In International Conference on Machine
Learning.
[Auer et al., 2002] Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002). The
nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77.
[Aumann et al., 1974] Aumann, R. J. et al. (1974). Subjectivity and correlation in randomized
strategies. Journal of mathematical Economics, 1(1):67–96.
[Balcan et al., 2015] Balcan, M.-F., Blum, A., Haghtalab, N., and Procaccia, A. D. (2015).
Commitment without regrets: Online learning in stackelberg security games. In Proceedings
of the Sixteenth ACM Conference on Economics and Computation, pages 61–78. ACM.
[Berger, 2005] Berger, U. (2005). Fictitious play in 2× n games. Journal of Economic Theory,
120(2):139–154.
[Blackwell et al., 1956] Blackwell, D. et al. (1956). An analog of the minimax theorem for vector
payoffs. Pacific Journal of Mathematics, 6(1):1–8.
[Blum et al., 2008] Blum, A., Hajiaghayi, M., Ligett, K., and Roth, A. (2008). Regret minimization and the price of total anarchy. In Proceedings of the fortieth annual ACM Symposium
on Theory of Computing, pages 373–382. ACM.
[Blum et al., 2004] Blum, A., Kumar, V., Rudra, A., and Wu, F. (2004). Online learning in
online auctions. Theoretical Computer Science, 324(2):137–146.
[Blum and Mansour, 2005] Blum, A. and Mansour, Y. (2005). From external to internal regret.
In Conference on Learning Theory.
[Börgers and Sarin, 1997] Börgers, T. and Sarin, R. (1997). Learning through reinforcement
and replicator dynamics. Journal of Economic Theory, 77(1):1–14.
[Borgs et al., 2008] Borgs, C., Chayes, J., Immorlica, N., Kalai, A. T., Mirrokni, V., and Papadimitriou, C. (2008). The myth of the folk theorem. In Proceedings of the fortieth annual
ACM Symposium on Theory of Computing, pages 365–372. ACM.
[Brown, 1951] Brown, G. W. (1951). Iterative solution of games by fictitious play. Activity
analysis of production and allocation, 13(1):374–376.
[Bush and Mosteller, 1955] Bush, R. R. and Mosteller, F. (1955). Stochastic models for learning.
John Wiley & Sons, Inc.
[Cesa-Bianchi et al., 1997] Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P.,
Schapire, R. E., and Warmuth, M. K. (1997). How to use expert advice. Journal of the
ACM (JACM), 44(3):427–485.
[Cesa-Bianchi and Lugosi, 2003] Cesa-Bianchi, N. and Lugosi, G. (2003). Potential-based algorithms in on-line prediction and game theory. Machine Learning, 51(3):239–261.
[Cesa-Bianchi and Lugosi, 2006] Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning,
and games. Cambridge University Press.
14
[Crawford and Haller, 1990] Crawford, V. P. and Haller, H. (1990). Learning how to cooperate:
Optimal play in repeated coordination games. Econometrica: Journal of the Econometric
Society, pages 571–595.
[Daskalakis et al., 2009] Daskalakis, C., Goldberg, P. W., and Papadimitriou, C. H. (2009). The
complexity of computing a nash equilibrium. SIAM Journal on Computing, 39(1):195–259.
[Erev and Roth, 1998] Erev, I. and Roth, A. E. (1998). Predicting how people play games: Reinforcement learning in experimental games with unique, mixed strategy equilibria. American
economic review, pages 848–881.
[Fabrikant et al., 2004] Fabrikant, A., Papadimitriou, C., and Talwar, K. (2004). The complexity of pure nash equilibria. In Proceedings of the thirty-sixth annual ACM Symposium on
Theory of Computing, pages 604–612. ACM.
[Foster and Hart, 2015] Foster, D. P. and Hart, S. (2015). Smooth calibration, leaky forecasts,
finite recall, and nash dynamics.
[Foster and Vohra, 1999] Foster, D. P. and Vohra, R. (1999). Regret in the on-line decision
problem. Games and Economic Behavior, 29(1):7–35.
[Foster and Vohra, 1993] Foster, D. P. and Vohra, R. V. (1993). A randomization rule for
selecting forecasts. Operations Research, 41(4):704–709.
[Foster and Vohra, 1997] Foster, D. P. and Vohra, R. V. (1997). Calibrated learning and correlated equilibrium. Games and Economic Behavior, 21(1):40–55.
[Foster and Vohra, 1998] Foster, D. P. and Vohra, R. V. (1998).
Biometrika, 85(2):379–390.
Asymptotic calibration.
[Foster and Young, 1998] Foster, D. P. and Young, H. P. (1998). On the nonconvergence of
fictitious play in coordination games. Games and Economic Behavior, 25(1):79–96.
[Foster and Young, 2003] Foster, D. P. and Young, H. P. (2003). Learning, hypothesis testing,
and nash equilibrium. Games and Economic Behavior, 45(1):73–96.
[Foster and Young, 2006] Foster, D. P. and Young, H. P. (2006). Regret testing: Learning to
play nash equilibrium without knowing you have an opponent.
[Freund and Schapire, 1997] Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system
sciences, 55(1):119–139.
[Freund and Schapire, 1999] Freund, Y. and Schapire, R. E. (1999). Adaptive game playing
using multiplicative weights. Games and Economic Behavior, 29(1):79–103.
[Fudenberg and Kreps, 1993] Fudenberg, D. and Kreps, D. M. (1993). Learning mixed equilibria. Games and Economic Behavior, 5(3):320–367.
[Fudenberg and Kreps, 1995] Fudenberg, D. and Kreps, D. M. (1995). Learning in extensiveform games i. self-confirming equilibria. Games and Economic Behavior, 8(1):20–55.
[Fudenberg et al., 1994] Fudenberg, D., Levine, D., and Maskin, E. (1994). The folk theorem
with imperfect public information. Econometrica: Journal of the Econometric Society, pages
997–1039.
[Fudenberg and Levine, 1995] Fudenberg, D. and Levine, D. K. (1995). Consistency and cautious fictitious play. Journal of Economic Dynamics and Control, 19(5):1065–1089.
15
[Fudenberg and Levine, 1998] Fudenberg, D. and Levine, D. K. (1998). The theory of learning
in games, volume 2. MIT Press.
[Fudenberg and Maskin, 1986] Fudenberg, D. and Maskin, E. (1986). The folk theorem in
repeated games with discounting or with incomplete information. Econometrica: Journal of
the Econometric Society, pages 533–554.
[Gilboa and Zemel, 1989] Gilboa, I. and Zemel, E. (1989). Nash and correlated equilibria: Some
complexity considerations. Games and Economic Behavior, 1(1):80–93.
[Gittins et al., 1989] Gittins, J., Glazebrook, K., and Weber, R. (1989). Multi-armed bandit
allocation indices. John Wiley & Sons.
[Hannan, 1957] Hannan, J. (1957). Approximation to bayes risk in repeated play. Contributions
to the Theory of Games, 3(97-139):2.
[Hart, 2005] Hart, S. (2005). Adaptive heuristics. Econometrica, 73(5):1401–1430.
[Hart and Mas-Colell, 2000] Hart, S. and Mas-Colell, A. (2000). A simple adaptive procedure
leading to correlated equilibrium. Econometrica, 68(5):1127–1150.
[Hart and Mas-Colell, 2001] Hart, S. and Mas-Colell, A. (2001). A general class of adaptive
strategies. Journal of Economic Theory, 98(1):26–54.
[Hart and Mas-Colell, 2003a] Hart, S. and Mas-Colell, A. (2003a). Continuous-time regretbased dynamics. Games and Economic Behavior, 45:375–394.
[Hart and Mas-Colell, 2003b] Hart, S. and Mas-Colell, A. (2003b). Uncoupled dynamics do not
lead to nash equilibrium. The American Economic Review, 93(5):1830–1836.
[Hart and Mas-Colell, 2005] Hart, S. and Mas-Colell, A. (2005). Stochastic uncoupled dynamics and nash equilibrium. In Proceedings of the 10th conference on Theoretical aspects of
rationality and knowledge, pages 52–61. National University of Singapore.
[Hazan and Kakade, 2012] Hazan, E. and Kakade, S. (2012). (weak) calibration is computationally hard. arXiv preprint arXiv:1202.4478.
[Hofbauer and Sandholm, 2002] Hofbauer, J. and Sandholm, W. H. (2002). On the global convergence of stochastic fictitious play. Econometrica, 70(6):2265–2294.
[Kakade and Foster, 2004] Kakade, S. M. and Foster, D. P. (2004). Deterministic calibration
and nash equilibrium. In Learning theory, pages 33–48. Springer.
[Kalai and Lehrer, 1993] Kalai, E. and Lehrer, E. (1993). Rational learning leads to nash equilibrium. Econometrica: Journal of the Econometric Society, pages 1019–1045.
[Kleinberg et al., 2009] Kleinberg, R., Piliouras, G., and Tardos, E. (2009). Multiplicative updates outperform generic no-regret learning in congestion games. In Proceedings of the fortyfirst annual ACM symposium on Theory of computing, pages 533–542. ACM.
[Krichene et al., 2014] Krichene, W., Drighès, B., and Bayen, A. (2014). On the convergence
of no-regret learning in selfish routing. In Proceedings of the 31st International Conference
on Machine Learning (ICML-14), pages 163–171.
[Lehrer, 2003] Lehrer, E. (2003). A wide range no-regret theorem. Games and Economic Behavior, 42(1):101–115.
16
[Lehrer and Solan, 2009] Lehrer, E. and Solan, E. (2009). Approachability with bounded memory. Games and Economic Behavior, 66(2):995–1004.
[Littlestone and Warmuth, 1994] Littlestone, N. and Warmuth, M. K. (1994). The weighted
majority algorithm. Information and Computation, 108(2):212–261.
[Mailath and Samuelson, 2006] Mailath, G. J. and Samuelson, L. (2006). Repeated games and
reputations, volume 2. Oxford University Press.
[Marden et al., 2007] Marden, J. R., Arslan, G., and Shamma, J. S. (2007). Regret based
dynamics: convergence in weakly acyclic games. In Proceedings of the 6th international joint
conference on Autonomous agents and multiagent systems, page 42. ACM.
[Marden et al., 2009] Marden, J. R., Young, H. P., Arslan, G., and Shamma, J. S. (2009).
Payoff-based dynamics for multiplayer weakly acyclic games. SIAM Journal on Control and
Optimization, 48(1):373–396.
[Miyasawa, 1961] Miyasawa, K. (1961). On the convergence of the learning process in a 2 x 2
non-zero-sum two-person game. Technical report, DTIC Document.
[Monderer and Sela, 1996] Monderer, D. and Sela, A. (1996). A2× 2game without the fictitious
play property. Games and Economic Behavior, 14(1):144–148.
[Monderer and Shapley, 1996a] Monderer, D. and Shapley, L. S. (1996a). Fictitious play property for games with identical interests. Journal of Economic Theory, 68(1):258–265.
[Monderer and Shapley, 1996b] Monderer, D. and Shapley, L. S. (1996b). Potential games.
Games and Economic Behavior, 14(1):124–143.
[Nachbar, 1990] Nachbar, J. H. (1990). “evolutionary” selection dynamics in games: Convergence and limit properties. International Journal of Game Theory, 19(1):59–89.
[Nachbar, 1997] Nachbar, J. H. (1997). Prediction, optimization, and learning in repeated
games. Econometrica: Journal of the Econometric Society, pages 275–309.
[Nash et al., 1950] Nash, J. F. et al. (1950). Equilibrium points in n-person games. Proc. Nat.
Acad. Sci. USA, 36(1):48–49.
[Nekipelov et al., 2015] Nekipelov, D., Syrgkanis, V., and Tardos, E. (2015). Econometrics
for learning agents. In Proceedings of the Sixteenth ACM Conference on Economics and
Computation, pages 1–18. ACM.
[Papadimitriou and Roughgarden, 2008] Papadimitriou, C. H. and Roughgarden, T. (2008).
Computing correlated equilibria in multi-player games. Journal of the ACM (JACM),
55(3):14.
[Powers and Shoham, 2005] Powers, R. and Shoham, Y. (2005). Learning against opponents
with bounded memory. In IJCAI, volume 5, pages 817–822.
[Pradelski and Young, 2012] Pradelski, B. S. and Young, H. P. (2012). Learning efficient nash
equilibria in distributed systems. Games and Economic behavior, 75(2):882–897.
[Robbins, 1952] Robbins, H. (1952). Some aspects of the sequential design of experiments.
Bulletin of the American Mathematical Society, 58(5):527–535.
[Robinson, 1951] Robinson, J. (1951). An iterative method of solving a game. Annals of Mathematics, pages 296–301.
17
[Roth and Erev, 1995] Roth, A. E. and Erev, I. (1995). Learning in extensive-form games:
Experimental data and simple dynamic models in the intermediate term. Games and economic
behavior, 8(1):164–212.
[Roughgarden, 2009] Roughgarden, T. (2009). Intrinsic robustness of the price of anarchy. In
Proceedings of the forty-first annual ACM Symposium on Theory of Computing, pages 513–
522. ACM.
[Samuelson, 1998] Samuelson, L. (1998). Evolutionary games and equilibrium selection, volume 1. MIT Press.
[Schlag and Zapechelnyuk, 2012] Schlag, K. and Zapechelnyuk, A. (2012). On the impossibility
of achieving no regrets in repeated games. Journal of Economic Behavior & Organization,
81(1):153–158.
[Shoham et al., 2007] Shoham, Y., Powers, R., and Grenager, T. (2007). If multi-agent learning
is the answer, what is the question? Artificial Intelligence, 171(7):365–377.
[Watkins and Dayan, 1992] Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4):279–292.
[Young, 2004] Young, H. P. (2004). Strategic learning and its limits. Oxford University Press.
[Young, 2009] Young, H. P. (2009). Learning by trial and error. Games and economic behavior,
65(2):626–643.
18