Circumventing the Price of Anarchy: Leading Dynamics to Good

Circumventing the Price of Anarchy: Leading
Dynamics to Good Behavior
1
Maria-Florina Balcan1 Avrim Blum2 Yishay Mansour3
School of Computer Science, College of Computing, Georgia Institute of Technology
2
Department of Computer Science, Carnegie Mellon University
3
Blavatnik School of Computer Science, Tel Aviv University
[email protected] [email protected] [email protected]
Abstract: Many natural games can have a dramatic difference between the quality of their best and
worst Nash equilibria, even in pure strategies. Yet, nearly all work to date on dynamics shows only
convergence to some equilibrium, especially within a polynomial number of steps. In this work we
study how agents with some knowledge of the game might be able to quickly (within a polynomial
number of steps) find their way to states of quality close to the best equilibrium. We consider two
natural learning models in which players choose between greedy behavior and following a proposed
good but untrusted strategy and analyze two important classes of games in this context, fair cost-sharing
and consensus games. Both games have extremely high Price of Anarchy and yet we show that behavior
in these models can efficiently reach low-cost states.
Keywords: Dynamics in Games, Price of Anarchy, Price of Stability, Cost-sharing games, Consensus
games, Learning from untrusted experts
1 Introduction
on fast convergence of both best response and regret minimization dynamics to states with cost
There has been substantial work in the machine comparable to the Price of Anarchy of the game
learning, game theory, and (more recently) algo- [4, 15, 26, 28], whether or not that state is an equirithmic game theory communities on understand- librium. This line of work is justified by the realing the overall behavior of multi-agent systems ization that in many cases the equilibrium nature
in which agents follow natural learning dynam- of the system itself is less important, and what we
ics such as (randomized) best/better response and care more about is having the dynamics reach a
no-regret learning. For example, it is well known low-cost state; this is a position we adopt as well.
that in potential games, best-response dynamics,
in which players take turns each making a bestHowever, while the above results are quite genresponse move to the current state of all the oth- eral, the behavior or equilibrium reached by the
ers, is guaranteed to converge to a pure-strategy given dynamics could be as bad as the worst equiNash equilibrium [24, 27]. Significant effort has librium in the game (the worst pure-strategy Nash
been spent recently on analyzing various proper- equilibrium in the case of best-response dynamics
ties of these dynamics and their variations, in par- in potential games, and the worst correlated equiticular on their convergence time [1, 26]. No re- librium for no-regret algorithms in general games).
gret dynamics have also been long studied. For Even for specific efficient algorithms and even
example, a well known general result that applies for natural potential games, in general no bounds
to any finite game is that if players each follow better than the pure-strategy price of anarchy are
a “no-internal-regret” strategy, then the empirical known. On the other hand, many important podistribution of play is guaranteed to approach the tential games, including fair cost-sharing and conset of correlated equilibria of the game [17–19].
sensus games, can have very high-cost Nash equiThere has also been a lot of attention recently libria even though they also always have low-cost
1
equilibria as well; that is, their Price of Anarchy
is extremely high and yet their Price of Stability is
low (we discuss these games more in Section 1.1).
Thus, a guarantee comparable with the cost of the
worst Nash equilibrium can be quite unsatisfactory.
solutions might be known because people analyzing the specific instance being played might discover and publish low cost global behaviors. In
this case, individual players might then occasionally test out their parts of these behaviors, using
them as extra inputs to their own learning algorithm or adaptive dynamics to see if they do in fact
provide benefit to themselves. The question then is
can this process allow low cost states to be reached
and in what kinds of games?
Unfortunately, in general there have been very
few results showing natural dynamics that lead to
low cost equilibria or behavior in games of this
type, especially within a polynomial number of
steps. For potential games, it is known that noisy
best-response in which players act in a “simulatedannealing” style manner, also known as log-linear
learning, does have the property that the states
of highest probability in the limit are those minimizing the global potential [8, 9, 23]. In fact,
as temperature approaches 0, these are the only
stochastically-stable states. However, as we show
in Appendix A.2, reaching such states with nonnegligible probability mass may easily require exponential time. For polynomial-time processes,
Charikar et al. [11] show natural dynamics rapidly
reaching low-cost behavior in a special case of
undirected fair cost-sharing games; however, these
results require specialized initial conditions and
also do not apply to directed graphs.
Motivated by this question, in this paper we develop techniques for understanding and influencing the behavior of natural dynamics in games
with multiple equilibria, some of which may be
of much higher social quality than others. In particular we consider a model in which a low cost
global behavior is proposed, but individual players do not necessarily trust it (it may not be an
equilibrium, and even if it is they do not know if
others will follow it). Instead, players use some
form of experts learning algorithm where one expert says to play the best response to the current
state and another says to follow the proposed strategy. Our model imposes only very mild conditions
on the learning algorithm each player uses: different players may use completely different algorithms for deciding among or updating probabiliIn this paper we initiate a study of how to aid
ties on their two high-level choices. Assume that
dynamics, beginning from arbitrary initial condithe players move in a random order. Will this protions, to reach states of cost close to the best Nash
duce a low-cost solution (even if it doesn’t conequilibrium. We propose a novel angle on this
verge to anything)? We consider two variations of
problem by considering whether providing more
this model: a learn-then-decide model where playinformation to simple learning algorithms about
ers initially follow an “exploration” phase where
the game being played can allow natural dynamthey put roughly equal probability on each expert,
ics to reach such high quality states. At a high
followed by a “commitment” phase where based
level, there are two barriers to simple dynamics
on their experience they choose one to use from
performing well. One is computational: for dithen on, and a smoothly adaptive model where
rected cost-sharing games for instance, we do not
they slowly change their probabilities over time.
even know of efficient centralized procedures for
Within each we analyze several important classes
finding low-cost states in general. As an optimizaof games. In particular, our main results are that
tion problem this is the Directed Steiner Forest
for both fair cost-sharing and consensus games,
problem and the best approximation factor known
these processes lead to good quality behavior.
1/2+ǫ
4/5+ǫ
2/3 ǫ
is min(n
,N
, m N ) where n is the
number of players, N is the number of vertices and
Our study is motivated by several lines of work.
m is the number of edges in the graph [12, 16]. One is the above-mentioned work on noisy bestThe other barrier is incentive-based: even if a low- response dynamics which reach high-quality states
cost solution were known, there would still be the but only after exponentially many steps. Another is
issue of whether players would individually want work on the value of altruism [29] which considers
to play it, especially without knowing what other how certain players acting altruistically can help
players will choose to do. For example, low-cost the system reach a better state. The last is the work
2
in [5] which considers a central authority who can
temporarily control a random subset of the players
in order to get the system into a good state. That
model is related to the model we consider here, but
is much more rigid because it posits two classes
of players (one that follows the given instructions
and one that doesn’t) and does not allow players to
adaptively decide for themselves. (See Section 1.2
for more discussion).
Figure 1 A directed cost-sharing game, that models a setting where each player can choose either
to drive its own car to work at a cost of 1, or share
public transportation with others, splitting an overall cost of k. For any 1 < k < n, if players arrive
one a time and each greedily chooses a path minimizing its cost, then the cost of the equilibrium
obtained is n, whereas OPT has cost only k.
1.1 Our Results
As described above we introduce and analyze
two models for guiding dynamics to good equilibria, and within these models we prove strong
positive results for two classes of games, fair cost
sharing and consensus games. In n-player fair cost
sharing games, players choose routes in a network
and split the cost of edges they take with others using the same edge; see Section 3 for a formal definition. These games can model scenarios such as
whether to drive one’s own car to work or to share
public transportation with others (see Figure 1 for
a simple example) and can have equilibria as much
as a factor of n worse than optimal even though
they are always guaranteed to have low-cost equilibria that are only O(log n) worse than optimal as
well. In consensus games, players are nodes in a
network who each need to choose a color, and they
pay a cost for each neighbor of different color than
their own (see Section 4). These again can have a
wide gap between worst and best Nash equilibrium
(in this case, cost Ω(n2 ) for the worst versus 0 for
the best).
Our main results for these games are the
following.
For fair cost-sharing, we show
that in the learn-then-decide model, so long
as the exploration phase has sufficient (polynomial) length and the proposed strategy is nearoptimal, the expected cost of the system reached
is O(log(n) log(nm)OPT). Thus, this is only
slightly larger than that given by the price of stability of the game. For the smoothly-adaptive
model, if there are many players of each type
(i.e., associated to each (si , ti ) pair) we can do
even better, with high probability achieving cost
O(log(nm)OPT), or even O(OPT) if the number of players of each type is high enough. Note
that with many players of each type the price of
anarchy remains Ω(n) though the price of stability becomes O(1). For consensus games, we
show that so long as players place probability
β > 1/2 on the proposed optimal strategy, then
with high probability play will reach the exact optimal behavior within a polynomial number of steps.
Moreover for certain natural graphs such as the
line and the grid, any β > 0 is sufficient. Note that
our results are actually stronger than those achievable in the more centralized model of [5], where
one needs to make certain minimal degree assumptions on the graph as well.
In both our models it is an easy observation that
for any game, if the proposed solution is a good
equilibrium, then in the limit the system will eventually reach the equilibrium and stay there indefinitely. Our interest, however, is in polynomialtime behavior.
1.2 Related Work
Dynamics and Convergence to Equilibria: It is
well known that in potential games, best-response
dynamics, in which players take turns each making a best-response move to the current state of
all the others, is guaranteed to converge to a purestrategy Nash equilibrium [24, 27]. Significant effort has been spent recently on the convergence
time of these dynamics [1, 26], with both examples of games in which such dynamics can take
exponential time to converge, and results on fast
convergence to states with cost comparable to the
3
Price of Anarchy of the game [4, 15].
implies that in a two phase model, where in the
Another well known general result that applies first phase the jobs join one at a time and use the
to any finite game is that if players each follow greedy algorithm (or that of [3]), and in the seca “no-internal-regret” strategy, then the empirical ond phase they perform a best response dynamics,
distribution of play is guaranteed to approach the achieves cost within a factor O(m) (or O(log m))
set of correlated equilibria of the game [17–19]. In of optimal. This is in contrast to the unbounded
particular, for good no-regret algorithms, the em- Price of Anarchy in general.
pirical distribution will be an ǫ-correlated equilib- Taxation: There has also been work on using taxes
rium after O(1/ǫ2 ) rounds of play [7]. A recent to improve the quality of behavior [13, 14]. Here
result of [20] analyzes a version of the weighted- the aim is via taxes to adjust the utilities of each
majority algorithm in congestion games, showing player, such that the only Nash equilibria in the
that behavior converges to the set of weakly-stable new game correspond to optimal or near-optimal
equilibria. This implies performance better than behavior in the original game. In contrast, our fothe worst correlated equilibrium, but even in this cus is to identify how dynamics can be made to
case the guarantee is no better than the worst pure- reach a good result without changing the game or
strategy Nash equilibrium.
adjusting utilities, but rather by injecting more inNoisy best-response has been shown in the limit formation into the system.
to reach states of minimum global potential [8,
9, 23], and thus provide strong positive results in
games with a small gap between potential and cost.
However, this convergence may take exponential
time, even for fair cost-sharing games. In particular we show in Appendix A.2 that if one makes
many copies of the edges of cost 1 in the example
of Figure 1, reaching a state with sufficiently many
players on the shared path to induce others to follow along would take time exponential in k even
if the algorithm can arbitrarily vary its temperature
parameter. For details see Appendix A.2.
Public service advertising: Finally, the public
service advertising model of [5] also uses the idea
of a proposed strategy in order to move players
into a good equilibrium. In the model of [5], the
players only once select (randomly) between following the proposed strategy or not. Players then
stick with their decision while those that decided
not to follow the proposed strategy settle on some
equilibrium for themselves (given the other players actions are fixed). Then, in the last phase all
players perform a best response dynamics to converge to a final equilibrium (the convergence is
guaranteed since the discussion is limited to potential games). In contrast, in our models the players
continuously randomly re-select between following the proposed strategy or performing a best response. This continuous randomization makes the
analysis of our dynamics technically more challenging. Conceptually, the benefit of our model is
that the players are “symmetric” and can continuously switch between the two alternatives, which
better models selfish behavior. This continuous
process is what enables the players to both explore
and exploit the two alternative actions.
An alternative line of work assumes that the system starts empty and players join one at a time.
Charikar et al. [11] analyze fair cost sharing in
this setting on an undirected graph where all players have a common sink. Their model has two
phases: in the first, players enter one at a time and
use a greedy algorithm to connect to the current
tree, and in the second phase the players undergo
best-response dynamics. They show that in this
case a good equilibrium (one that is within only
a polylog(n) factor of optimal) is reached. We
discuss this result further in Appendix A. We remark here that for directed graphs, it is easy to construct simple examples where this process reaches 2 A Formal Framework
an equilibrium that is Ω(n) from optimal (see Figure 1). For scheduling on unrelated machines with 2.1 Notation and Definitions
makespan cost function, the natural greedy algoWe start by providing general notations and defrithm to assign incoming jobs is an O(m) approx- initions. A game is denoted by a tuple
imation and a more sophisticated online algorithm
guarantees an O(log m) approximation [3]. This
G = hN, (Si ), (costi )i
4
where N is a set of n players, Si is the finite action
space of player i ∈ N , and costi is the cost function of player i. The joint action space of the players is S = S1 × . . . × Sn . For a joint action s ∈ S
we denote by s−i the actions of players j 6= i, i.e.,
s−i = (s1 , ..., si−1 , si−1 , ..., sn ). The cost function of player i maps a joint action s ∈ S to a
real non-negative number, i.e., costi : S → R+ .
Every game has associated a social cost function
cost : S → R that maps a joint action to a real
value. In the cases discussed in this paper the social cost is simply the sum of players’ costs, i.e.,
cost(s) =
n
X
2.2 The Model
We now describe the formal model that we consider. Initially, players begin in some arbitrary
state, which could be a high-cost equilibrium or
even a state that is not an equilibrium at all. Next
an entity, perhaps the designer of the system or a
player who has studied the system well, proposes
some better global behavior B. B may or may not
be an equilibrium behavior, our only assumption
is that it have low overall cost. Now, players move
one at a time in a random order. Each player, when
it is their turn to move, chooses among two options. The first is to simply behave greedily and to
make a best-response move to the current configuration. The second option is to follow their part of
the given behavior B. These two high-level strategies (best-response or follow B) are viewed as two
“experts” and the player then runs some learning
algorithm aiming to learn which of these is most
suitable for himself. Note that best-response is
an abstract option — the specific action it corresponds to may change over time.
Because moves occur in an asynchronous manner, there are multiple reasonable ways to model
the feedback each player gives to its learning algorithm: for instance, does it consider the average
cost since the player’s previous move or just the
cost when it is the player’s turn to go, or something in between? In addition, does it get to observe the cost of the action it did not take (the full
information model) or only the action it chose (the
bandit model)? To abstract away these issues, and
even allow different players to address them differently, we consider here two models that make only
very mild assumptions on the kind of learning and
adaptation made by players.
Learn then Decide model: In this model, players follow an “exploration” phase where each
time it is their turn to move, they flip a coin
to decide whether to follow the proposed behavior B or to do a best-response move to
the current configuration. We assume that the
coin gives probability at least β to B, for some
constant β > 0. Finally, after some common
time T ∗ , all players switch to an “exploitation” phase where they each commit in an arbitrary way based on their past experience to
follow B or perform best response from then
on. (The time T ∗ is selected in advance.)
costi (s).
i=1
The optimal social cost is
OPT(G) = min cost(s).
s∈S
We sometimes overload notation and use OPT for
a joint action s that achieves cost OPT(G).
Given a joint action s, the Best Response (BR) of
player i is the set of actions BRi (s) that minimizes
its cost, given the other players actions s−i , i.e.,
BRi (s−i ) = arg mina∈Si costi (a, s−i ).
A joint action s ∈ S is a pure Nash Equilibrium
(NE) if no player i ∈ N can benefit from unilaterally deviating to another action, namely, every
player is playing a best response action in s, i.e.,
si ∈ BRi (s−i ) for every i ∈ N . A best response
dynamics is a process in which at each time step,
some player which is not playing a best response
switches its action to a best response action, given
the current joint action. In this paper we focus on
potential games, which have the property that any
best response dynamics converges to a pure Nash
equilibrium [24].
Let N (G) be the set of Nash equilibria of the
game G. The Price of Anarchy (PoA) is defined
as the ratio between the maximum cost of a Nash
equilibrium and the social optimum, i.e.,
max cost(s)/OPT(G).
s∈N (G)
The Price of Stability (PoS) is the ratio between
the minimum cost of a Nash equilibrium and the
social optimum, i.e.,
min cost(s)/OPT(G).
s∈N (G)
For a class of games, the PoA and PoS are the maximum over all games in the class
5
other alternative) or even the sum total cost since
its previous move, then B might appear better. Our
model allows users to update in any way they wish,
so long as the updates are sufficiently gradual.
Finally, as mentioned in the introduction, in both
our models it is an easy observation that for any
game, if the proposed solution is a good equilibrium, then in the limit (as T ∗ → ∞ in the Learnthen-Decide model or as ∆ → 0 in the Smoothly
Adaptive model) the system will eventually reach
the equilibrium and stay there indefinitely. Our interest, however, is in polynomial-time behavior.
The above model assumes some degree of coordination: a fixed time T ∗ after which all players
make their decisions. One could imagine instead
each player i having its own time Ti∗ at which it
commits to one of the experts, perhaps with the
time itself depending on the player’s experience.
In the Smoothly-Adaptive model below we even
more generally allow players to smoothly adjust
their probabilities over time as they like, subject
only to a constraint on the amount by which probabilities may change between time steps.
Smoothly Adaptive model: In this model, there
are no separate exploration and exploitation
phases. Instead, each player i maintains and
adjusts a value pi over time. When player i
is chosen to move, it flips a coin of bias pi
to select between the proposed behavior or a
best-response move, choosing B with probability pi . We allow the players to use arbitrary adaptive learning algorithms to adjust
these probabilities with the sole requirement
that learning proceed slowly. Specifically, using pti to denote the value of pi at time t, we
require that
3 Fair Cost Sharing
The first class of games we study in this paper,
because of its rich structure and wide gap between
price of anarchy and price of stability, is that of
fair cost sharing games. These games are defined
as follows. We are given a graph G = (V, E),
which can be directed or undirected, where each
edge e ∈ E has a nonnegative cost ce ≥ 0. There
is a set N = {1, ..., n} of n players, where player
i is associated with a source si and a sink ti . The
strategy set of player i is the set Si of si − ti paths.
In an outcome of the game, each player i chooses
|pti − pt+1
|
≤
∆
i
a single path Pi ∈ Si . Given a vector of players’
for a sufficiently (polynomially) small quan- strategies s = (P1 , . . . , Pn ), let xe be the number
tity ∆, and furthermore that for all i, the ini- of agents whose strategy contains edge e. In the
tial probability p0i ≥ p0 for some overall con- fair cost sharing game the cost to agent i is
stant 0 < p0 < 1. Note that the algorithm
X ce
may update pi even in time steps at which it
costi (s) =
xe
does not move. The learning algorithm may
e∈Pi
use any kind of feedback or weight-updating
strategy it wants to (e.g., gradient descent [30, and the goal of each agent is to connect its termi31], multiplicative updating [10, 21, 22]) sub- nals with minimum total cost. The social cost of
an outcome s = (P1 , ..., Pn ) is defined to be
ject to this bounded step-size requirement.
X
We say that the probabilities are (T, β)-good
cost(P1 , ..., Pn ) =
ce .
if for any time t ≤ T we have for all i, pti >
0
e∈∪
P
i
i
β. (Note that if ∆ < (p − β)/T then clearly
the probabilities are (T, β)-good.)
It is well known that fair cost sharing games are
We point out that while one might at first think that potential games [2, 24] and the price of anarchy in
any natural adaptive algorithm would learn to favor these games is Θ(n) while the price of stability is
P
best-response (always decreasing pi ), this depends H(n) [2], where H(n) = n 1/i = Θ(log n).
i=1
on the kind of feedback it uses. For instance, if the In particular, the potential function for these games
algorithm considers only its cost immediately after is
xe
XX
it moves, then indeed by definition best-response
Φ(s)
=
ce /x,
will appear better. However, if it considers its cost
e∈E x=1
immediately before it moves (comparing that to
what its cost would have been had it chosen the which satisfies the following inequality:
6
Fact 1 In fair cost sharing, for any s ∈ S we
have: cost(s) ≤ Φ(s) ≤ H(n) · cost(s).
3.1 The Main Arguments
We begin with the following key lemma that is
useful in the analysis of both learn-then-decide and
smoothly adaptive models.
For ease of notation, we assume in this section
that the proposed strategy B is the socially optimal
behavior OPT, so we can identify PiOP T = PiB
as the behavior proposed by B to player i. If B
is different from OPT, then we simply lose the
corresponding approximation factor.
Lemma 2 Consider a fair cost sharing game.
There exists a T = poly(n), such that if the probabilities are (T, β)-good for constant β > 0, then
with high probability the cost at time T will be at
most O(OPT log(mn)).
Moreover, if we have at least c log(nm) players
of each (si , ti ) pair for sufficiently large constant
c, then with high probability the cost at time T will
be O(OPT).
Overview of the Results and Analysis: Our main
results for fair cost sharing are the following. If
we have many players of each type (the type of
a player is determined by its source si and destination ti ) then in both the learn-then-decide and
smoothly adaptive models we can show that with
high probability, behavior will reach a state of cost
within a logarithmic factor of OPT within a polynomial number of steps. Moreover, for the learnthen-decide model, even if we do not have many
players of each type we can show that the expected
cost at the end of the process will be low.
Proof: We begin with the general case. Let nopt
e
denote the number of players who use edge e in
OPT. We partition edges into two classes. We
say an edge is a “high traffic” edge if nopt
>
e
c log(nm) where c is a sufficiently large constant
(c = 32/β suffices for the argument below). We
say it is a “low traffic” edge otherwise.
Define T0 = 2n log n. With high probability, by
time T0 each player has had a chance to move at
least once. We assume in the following that this indeed is the case. Note that as a crude bound, at this
point the cost of the system is at most n2 · OPT
(each player will move to a path of cost-share at
most OPT and therefore of actual cost at most
n·OPT). Next, by Chernoff bounds and the union
bound, our choice of c implies that with high probability each high-traffic edge e has at least βnopt
e /2
players on it at all time steps T ∈ [T0 , T0 + n3 ];
in particular, Chernoff bounds imply that each of
the at most mn3 events has probability at least
opt
1 − e−βne /8 ≥ 1 − 1/(mn)4 . In the remaining analysis, we assume this indeed is the case as
well.
Let OPTi denotePthe cost of player i in OPT,
so that OPT =
i OPTi . Our assumption
above implies that for any time step T under consideration, if player i follows the proposed strategy
PiOP T , its cost will be at most c log(nm)OPTi .
In particular, its cost on the low-traffic edges in
PiOP T can be at most a factor c log(nm) larger
than its cost on those edges under OPT, and its
cost on high-traffic edges is at most a 2/β factor
larger than its cost on those edges under OPT.
The high level idea of the analysis is that we
first prove that so long as each player randomizes
with probability near to 50/50, with high probability the overall cost of the system will drop to
within a logarithmic factor of OPT in a polynomial number of steps; moreover, at that point
both the best response and the proposed actions are
pretty good strategies from the individual players
point of view. To finish the analysis in the “Learn
then Decide” model, we show that in the remaining steps of the exploration phase the expected cost
does not increase by much; using properties of the
potential function, we then show that in the final
“decision” round T ∗ , the overall potential cannot
increase substantially either, which in turn implies
a bound on the increase in overall cost.
For the adaptive model, one key difficulty in the
analysis is to show that if the system reaches a
state where the social cost is low and both abstract
actions are pretty good for most players, the cost
never goes high again. We are able to show that
this indeed is the case as long as there are many
players of each type, no matter how the players
adjust their probabilities pti or make their choice
between best-response and following the proposed
behavior.
7
We now argue as follows. Let costT denote the
cost of the system at time T . If
which is not possible since by definition of XT we
have
costT ≥ 2c log(nm)OPT,
ΦT ≤ Φ0 + (XT − X0 ) − T Q/n
then the expected cost of a random player is at least
which would be negative. Therefore, with high
probability stopping must occur before this time
2c log(nm)OPT/n.
as desired.
Finally, if we have at least c log(mn) players of
On the other hand, if player i is chosen to move at
each
type, then there are no low-traffic edges and
time T , from the above analysis its cost after the
so
we
do not need to lose the c log(mn) factor in
move (whether it chooses B or best response) will
the
argument.
be at most c log(nm)OPTi . The expected value
of this quantity over players i chosen at random is
We now present a second lemma which will be
at most c log(nm)OPT/n. Therefore, if
used to analyze the “Learn then Decide” model.
costT ≥ 2c log(nm)OPT,
Lemma 3 Consider a fair cost sharing game in
the Learn-then-Decide model. If the cost of the
system at time T1 is O(OPT log(mn)), and T =
T1 + poly(n) < T ∗ , then the expected value of the
potential at time T is O(OPT log(mn) log(n)).
the expected drop in potential at time T is at least
c log(nm)OPT/n.
Finally, since the cost at time T0 was at most n2 ·
OPT, which implies by Fact 1 the value of the
potential was at most n2 (1 + log(n))OPT, with
high probability this cannot continue for more than
O(n3 ) steps. Formally, we can apply HoeffdingAzuma bounds for supermartingales as follows: let
us define Q = c log(nm)OPT and
Proof: First, as argued in the proof of Lemma
2, with high probability for any player i and any
time t ∈ [T1 , T ], the cost for player i to follow the proposed strategy at time t is at most
c log(nm)OPTi for some constant c. Let us assume below that this is indeed the case.
Next, the above bound implies that if the cost
at time t ∈ [T1 , T ] is costt , then the expected
decrease in potential caused by a random player
moving (whether following the proposed strategy
or performing best response) is at least
∆T = max(ΦT − ΦT −1 + Q/n, −2Q)
and consider running this process stopping when
costT < 2Q. Let
XT = Φ0 + ∆1 + . . . + ∆T .
(costt − c log(nm)OPT)/n;
Then throughout the process we have
in particular, costt /n is the expected cost
of
a random player before its move, and
E[XT |X1 , . . . , XT −1 ] ≤ XT −1
c log(nm)OPT)/n is an upper bound on the expected cost of a random player after its move. Note
and
that this is an expectation over randomness in the
|XT − XT −1 | ≤ 2Q,
choice of player at time t, conditioned on the value
where the first inequality holds because our analyof costt . In particular, since this holds true for
sis showing an expected decrease in potential of at
any value of costt , we can take expectation over
least Q/n is true even if we cap all decreases to a
the entire process from time T1 up to t, and we
maximum of 2Q as in the definition of ∆T . So, by
have that if
Hoeffding-Azuma (see Theorem 10 in Appendix
3
B), after n steps in the non-stopped process with
E[costt ] ≥ c log(nm)OPT
high probability we would have
then
1
XT − X0 ≤ n2 Q,
E[Φt+1 ] ≤ E[Φt ],
2
8
where Φt is the value of the potential at time step the total increase in potential after time T ∗ is at
t.
most O(OPT log n).
Finally, since Φt ≤ log(n)costt , this implies
Since after all players have committed, potenthat if
tial can only decrease, this implies that the expected value of the potential at any time T ′ ≥ T ∗
E[Φt ] ≥ c log(nm) log(n)OPT
is O(OPT log(mn) log(n)). Therefore, the expected cost at time T ′ is at most this much as well,
then
as desired.
E[Φt+1 ] ≤ E[Φt ].
We now use Lemma 2 to analyze fair cost sharSince for any value of costt we always have
ing games in the adaptive learning model when the
number of players ni of each type (i.e., associated
E[Φt+1 ] ≤ E[Φt ] + c log(nm)OPT/n,
to each (si , ti ) pair) is large.
this in turn implies by induction on t that
Theorem 5 Consider a fair cost sharing game in
the adaptive learning model satisfying ni = Ω(m)
for all i. There exists a T1 = poly(n) such that
if the probabilities are (T1 , β)-good for constant
β > 0, then with high probability, for all T ≥ T1
the cost at time T is O(log(nm)OPT).
Moreover, there exists constant c such that if
ni ≥ max[m, c log (mn)] then with high probability, for all T ≥ T1 the cost at time T is O(OPT).
E[Φt ] ≤
c log(nm) log(n)OPT + c log(nm)OPT/n
for all t ∈ [T1 , T ], as desired.
Our main result in the “Learn then Decide”
model is the following:
Theorem 4 For fair cost sharing games in the
Learn then Decide model, a polynomial numProof: First, by Lemma 2 we have that with high
ber of exploration steps T ∗ is sufficient so that
probability
at some time T0 ≤ T1 , the cost of the
the expected cost at any time T ′ ≥ T ∗ is
system reaches O(OPT log(mn)). The key to the
O(log(n) log(nm)OPT).
argument now is to prove that once the cost beProof: From Lemma 2, there exists T = poly(n) comes low, it will never become high again. To do
such that with high probability the cost of the sys- this we use the fact that ni is large for all i. Our
tem will be at most O(OPT log(mn)) at some argument which follows the proof of Theorem 4.3
time T1 ∈ [T ∗ − T, T ∗ ]. From Lemma 3, this in [6] is as follows.
Let U be the set of all edges in use at time T0
implies the expected value of the potential at time
∗
along
with all edges used in B. In general, we will
T right before the final exploitation phase is
insert
an edge into U if it is ever used from then
O(OPT log(mn) log(n)). Finally, we consider
∗
on
in
the
process, and we never remove any edge
the decisions made at time T . When a player
from
U
,
even
if it later becomes unused. Let c∗ be
chooses to make a best-response move, this can
∗
only decrease potential. When a player chooses the total cost of all edges in U . So, c is an upper
B, this could increase potential. However, for any bound∗on the cost of the current state and the only
edge e in the proposed solution B, the total in- way c can increase is by some player choosing a
crease in potential caused by edge e over all play- best response path that includes some edge not in
ers who have e in their proposed solution is at most U . Now, notice that any time a best-response path
for some (si , ti ) player uses such an edge, the total
opt
cost
of all edges inserted into U is at most c∗ /ni ,
ce · H(ne ) = O(ce log n).
because the current player can always choose to
This is because whenever a new player makes a de- take the path used by the least-cost player of his
cision to commit to following the proposed strat- type and those ni players are currently sharing a
egy and using edge e, all previous players who total cost of at most c∗ . Thus, any time new edges
made that commitment after time T and whose are added into U , c∗ increases by at most a mulproposed strategy uses edge e are still there. Thus, tiplicative (1 + 1/ni ) factor. We can insert edges
9
into U at most m times, so the final cost is at most of Ω(n2 ).1 This is substantially worse than optimal
(either by an Ω(n2 ) or infinite factor, depending on
m
whether one allows an additive offset or not).
cost(T0 )(1 + 1/ni∗ ) ,
Unlike in the case of fair cost-sharing games,
where ni∗ = mini ni . This implies that as long as for consensus games there is no hope of quickly
ni∗ = Ω(m) we have cost(T ) = O(cost(T0 )) achieving near-optimal behavior for all β > 0. In
for all T ≥ T0 . Thus, overall the total cost remains particular, for any β < 1/2 there exists α > 0
in the example above such that if players choose
O(OPT log(mn)).
If ni ≥ max[m, c log(mn)], we simply use the the proposed strategy B with probability β, with
improved guarantee provided by Lemma 2 to say high probability all nodes have more neighbors
that with high probability the cost of the system performing best response than following B. Thus,
reaches O(OPT) within T0 ≤ T1 time steps, and it is easy to see by induction that no matter what
then apply the charging argument as above in order strategy B is, all best-response players will remain
with their initial color. Moreover, this remains true
to get the desired result.
for exponentially in n many steps.2 On the other
hand, this argument breaks down for β > 1/2. In
Variations: One could imagine a variation on our
fact, we will show that for β > 1/2, for any graph
framework where rather than proposing a fixed beG, the system will in polynomial time reach optihavior B, one can propose a more complex stratmal behavior (if B = OPT).
∗
egy such as “follow B until time step T and then
It is interesting to compare this with the cen′
∗
follow B ” or “follow B until time step T and
tralized model of [5] in which a central authority
then perform best-response”. In the latter case,
takes control of a random constant fraction of the
the Smoothly-Adaptive model with (T ∗ , β)-good
players and aims to use them to guide the selfish
probabilities becomes essentially a special case of
behavior of the others. In that setting, even simthe Learn-then-Decide model and all results above
ple low-degree graphs can cause a problem. For
for the Learn-then-Decide model go through iminstance, consider a graph G consisting of n/4
mediately. On the other hand, this type of strat4-cycles each in the initial equilibrium configuraegy requires a form of global coordination that one
tion red, red, blue, blue. If the authority controls
would prefer to avoid.
a random β fraction of players (for any constant
β < 1), with high probability a constant fraction of
the 4-cycles contain no players under the author4 Consensus Games
ity’s control and will remain in a high-cost state.
Another interesting class of games we consider On the other hand, it is easy to see that this specific
is that of consensus games. Here, players are ver- example will perform well in both our learn-thentices in an n-vertex graph G, and each have two decide or smoothly-adaptive models. We show that
choices of action, red or blue. Players incur a in fact all graphs G perform well, though this recost equal to the number of neighbors they have quires care in the argument due to correlations that
of different color, and the overall social cost is the may arise.
sum of costs of each player. The potential funcWe assume that the proposed behavior B is the
tion for consensus games is simply half the social optimal behavior “all blue” and prove that with
cost function, or equivalently the number of edges high probability the configuration will reach this
having endpoints of different color. While in these state after O(n log2 n) steps. We begin with a simgames optimal behavior is trivial to describe (all
1 Intuitively one can think of this example as two countries,
red or all blue for a total cost of 0) and is an equione using English units and one using the metric system, neilibrium, they also have extremely poor equilibria ther wanting to switch.
2 Of course, in the limit as the number of time steps T →
as well. For example consider two cliques of n/2
vertices each, with each node also having αn/2 ∞ eventually we will observe a sequence in which all players
B, and if B is an equilibrium the system will then remain
neighbors in the other clique for some 0 < α < 1. select
there forever. Thus if B is “all blue” the system will in the
In this case, there is a Nash equilibrium with one limit converge to optimal behavior. However, our interest is in
clique red and one clique blue, for an overall cost polynomial time convergence.
10
ple preliminary lemma.
Equivalently, we can write this as
Lemma 6 If X1 , . . . , Xd are {0, 1}-valued ran1 − qt = (1 − qt−1 )(1 − γ).
dom variables with Pr(Xi = 1) ≥ p, then no
matter how they are correlated,
Since γ > 0 (because β > 1/2), this in turn
implies that t = O(log n) is sufficient to reach
Pr(MAJORITY(X1 , . . . , Xd ) = 1) ≥ 2p − 1. 1 − q ≤ 1/n2 , meaning that with high probability
t
all
nodes
are blue as desired.
Proof: The expected number of 1’s is at least pd.
We
prove
this bound on qt as follows. Consider
It is also at most
the nodes who move at times Tt−1 + 1, Tt−1 +
2, . . . , Tt in order. When some node v moves, by
d · pMAJ + (d/2) · (1 − pMAJ )
induction each neighbor w of v has probability at
where
least qt of being blue (though these may not be
independent). By assumption, v chooses B with
pMAJ = Pr(MAJORITY(X1 , . . . , Xd ) = 1)).
some probability β ′ ≥ β > 1/2 and chooses bestresponse with probability 1 − β ′ . Therefore, the
Solving, we get pMAJ ≥ 2p − 1.
probability v becomes blue is at least:
We now present our main result of this section.
′
′
Theorem 7 In both Learn-then-Decide and β + (1 − β )Pr(majority of nbrs of v are blue)
≥ β ′ + (1 − β ′ )(2qt−1 − 1) (by Lemma 6)
Smoothly-Adaptive models, for any constant
β > 1/2, if the proposed behavior B is op= (1 − (2β ′ − 1))qt−1 + (2β ′ − 1)
timal then with high probability play will
≥ (1 − γ)qt−1 + γ
(for γ = 2β − 1)
become optimal in O(n log2 n) steps. For the
Smoothly-Adaptive model we assume behavior is as desired. Thus, with high probability all nodes
(cn log2 n, β)-good for sufficiently large constant are blue by time Tt for t = O(log n), and by our
c.
assumption on S this occurs within O(n log2 n)
Proof: In both models the dynamics contain steps.
two random processes: the random order in which
players move and the random coins flipped by the
players to determine how they want to move. In
order to prove the theorem it will be helpful to separate these two processes and consider them each
in turn.
First, consider and fix a random sequence S of
players to move. Let T1 denote the time by which
all players have moved at least once according to
S, and more generally let Tt+1 denote the time by
which all players have moved at least once since
time Tt . Since players move i.i.d. in a random
order, with high probability Tt+1 ≤ Tt +3n log(n)
for all t = 1, . . . , n. We assume in the following
that this indeed is the case; in fact, we can allow S
to be an arbitrary, adversarially-chosen sequence
subject to this constraint.
Fixing S, we now consider the coin flips of the
individual players. We will prove by induction that
each player has probability at least qt of being blue
at time Tt (not necessarily independently) for
qt = (1 − γ)qt−1 + γ for γ = 2β − 1.
The general result above requires β > 1/2, and
as noted earlier there exists graphs G and initial
configurations such that the process will fail for
any β < 1/2. On the other hand, for several “nice”
graphs such as the line or grid, any constant β > 0
is sufficient. For this we assume that best-response
will only ask to switch color if the new color is
strictly better than the current color.
Theorem 8 For the line and d-dimensional grid
graphs (constant d), for any β > 0, if the proposed
action B is optimal then with high probability play
will reach optimal in poly(n) steps. For the case of
the Smoothly-Adaptive model we assume behavior
is (T, β)-good for T a sufficiently large polynomial
in n.
Proof: Assume B is “all blue”. On the line, if
any two neighbors become blue, they will remain
blue indefinitely. Similarly in the grid, if any ddimensional cube becomes blue, the nodes in the
cube will also remain blue indefinitely. On the line,
any neighbors, in any two consecutive steps, have
11
probability at least β 2 /n2 of becoming blue, and
on the d-dimensional grid, any cube, in any 2d cond
secutive steps, has probability at least (β/n)2 of
becoming all blue. Therefore with high probability
all nodes become blue in a polynomial number of
steps.
5 Conclusions and Future Directions
In this paper we initiate a study of how to aid
dynamics, beginning from arbitrary initial conditions, to reach states of cost close to the best
Nash equilibrium. We propose a novel angle on
this problem by considering how providing more
information to simple learning algorithms about
the game being played can allow the dynamics
to reach such low-cost states. We show that for
fair cost-sharing and consensus games, proposing a good solution and allowing users to adaptively decide for themselves between that solution
and best-response behavior will efficiently lead
to near-optimal configurations, so long as users
adapt sufficiently slowly. Both of these games
have the property that all by itself, random best response may easily end up in a high-cost state: cost
Ω(n · OPT) for fair cost-sharing and cost Ω(n2 )
for consensus, whereas with this additional aid the
states reached have cost only polylog(n) · OPT
for cost-sharing and 0 for consensus.
Open questions and future directions
Our results for fair cost sharing games in the
adaptive learning model hold only for the case
when the number of players ni of each type is
large. One natural open question is whether similar positive results can be shown for the general
case, i.e., when the number of players ni of each
type is arbitrary. It would also be interesting to
broaden the types of learning algorithms the players can use. In particular, for the Smoothly Adaptive model, how large a learning rate ∆ (or for the
Learn-then-Decide model, how small a cutoff time
T ∗ ) can one allow and still produce comparable
positive results on the quality of the final outcome?
It would also be interesting to extend this model
to the case of multiple proposed solutions, some
of potentially much higher quality than others, and
still show that these dynamics reach good states.
In this case, one would need to make additional
assumptions on the learning algorithms the players
are using. In particular, if players are trying to decide between various learning dynamics (best response, regret minimization, noisy best response,
etc.), and if they use an experts learning algorithm
to decide between these actions, can we show that
if following the the best of the dynamics is good
for all players and if the experts learning algorithms have appropriate properties, then the players will do nearly as well as if they had used the
best dynamics?
Finally, in both of the games we consider players with the same profile (the same source and destination in the fair cost sharing case) receive the
same global advice, which can also be easily communicated as a single message. It would be interesting to also analyze games where players with
the same profile might receive different global advice.
One way to view our model is as follows. Starting from the original game, we create a meta-game
in which each player is playing one of the two
abstract actions, best response and the proposed
strategy. We then show that if the players learn in
the new meta-game, with restrictions only on the
learning rate, then this results in good behavior in
the original game. More generally, it would be interesting to further explore the idea of designing
“meta games” on the top of the actual games and
showing that natural behavior in these meta games
induce good behavior in the original game of interest.
Acknowledgements
This work was supported in part by ONR grant
N00014-09-1-0751, by AFOSR grant FA9550-091-0538, by NSF grant CCF-0830540, by the IST
Programme of the European Community under
the PASCAL2 Network of Excellence IST-2007216886, by a grant from the Israel Science Foundation (grant No. 709/09) and by grant No. 2008321 from the United States-Israel Binational Science Foundation (BSF). This publication reflects
the authors’ views only.
References
12
[1] H. Ackermann, H. Röglin, and B. Vöcking. On
the impact of combinatorial structure on conges-
tion games. In Proc. 47th FOCS, 2006.
[2] E. Anshelevich, A. Dasgupta, J. Kleinberg, E. Tardos, T. Wexler, and T. Roughgarden. The Price of
Stability for Network Design with Fair Cost Allocation. In FOCS, 2004.
[3] J. Aspnes, Y. Azar, A. Fiat, S. A. Plotkin, and
O. Waarts. On-line routing of virtual circuits
with applications to load balancing and machine
scheduling. J. ACM (JACM), 44(3):486–504,
1997.
[4] B. Awerbuch, Y. Azar, A. Epstein, and A. Skopalik. Fast convergence to nearly optimal solutions
in potential games. In ACM Conference on Electronic Commerce, 2008.
[5] M.-F. Balcan, A. Blum, and Y. Mansour. Improved Equilibria via Public Service Advertising.
In ACM-SIAM Symposium on Discrete Algorithms,
2009.
[6] M.-F. Balcan, A. Blum, and Y. Mansour. The Price
of Uncertainty. In Tenth ACM Conference on Electronic Commerce, 2009.
[7] A. Blum and Y. Mansour. Chapter 4: Learning,
Regret-Minimization, and Equilibria. In N. Nisan,
T. Roughgarden, E. Tardos, and V. V. Vazirani, editors, Algorithmic Game Theory. Cambridge University Press, 2007.
[8] L. Blume. The statistical mechanics of strategic interaction. Games and Economic Behavior, 5:387–
424, 1993.
[9] L. Blume. How noise matters. Games and Economic Behavior, 44:251–271, 2003.
[10] N. Cesa-Bianchi, Y. Freund, D.P. Helmbold,
D. Haussler, R.E. Schapire, and M.K. Warmuth.
How to use expert advice. Journal of the ACM,
44(3):427–485, 1997.
[11] M. Charikar, H. Karloff, C. Mathieu, J. Naor, and
M. Saks. Online multicast with egalitarian cost
sharing. In Proc. 20th SPAA, 2008.
[12] C. Chekuri, G. Even, A. Gupta, and D. Segev. Set
Connectivity Problems in Undirected Graphs and
the Directed Steiner Network Problem. In ACMSIAM Symposium on Discrete Algorithms, 2008.
[13] R. Cole, Y. Dodis, and T. Roughgarden. How much
can taxes help selfish routing? In ACM-EC, 2003.
[14] R. Cole, Y. Dodis, and T. Roughgarden. Pricing
network edges for heterogeneous selfish users. In
STOC, 2003.
[15] A. Fanelli, M. Flamminia, and L. Moscardelli.
Speed of convergence in congestion games under
best responce dynamics. In Proc. 35th ICALP,
2008.
[16] M. Feldman, G. Kortsarz, and Z. Nutov. Improved
approximation algorithms for directed steiner forest. In ACM-SIAM Symposium on Discrete Algo-
rithms, 2009.
[17] D. Foster and R. Vohra. Calibrated learning and
correlated equilibrium. Games and Economic Behavior, 21:40–55, 1997.
[18] D. Foster and R. Vohra. Regret in the on-line decision problem. Games and Economic Behavior,
29:7–36, 1999.
[19] S. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68:1127–1150, 2000.
[20] R. Kleinberg, G. Piliouras, and E. Tardos. Multiplicative updates outperform generic no-regret
learning in congestion games. In Proc. 41th STOC,
2009.
[21] N. Littlestone, P. M. Long, and M. K. Warmuth.
On-line learning of linear functions. In Proc. of the
23rd Symposium on Theory of Computing, pages
465–475. ACM Press, New York, NY, 1991. See
also UCSC-CRL-91-29.
[22] N. Littlestone and M. K. Warmuth. The weighted
majority algorithm. Information and Computation,
108(2):212–261, 1994.
[23] J.R. Marden and J.S. Shamma. Revisiting loglinear learning: asynchrony, completeness, and
payoff-based implementation. Submitted, September 2008.
[24] D. Monderer and L.S. Shapley. Potential games.
Games and Economic Behavior, 14:124–143,
1996.
[25] Rajeev Motwani and Prabhakar Raghavan. Randomized Algorithms. Cambridge University Press,
1997.
[26] N. Nisan, T. Roughgarden, E. Tardos, and V. Vazirani, editors. Algorithmic Game Theory. Cambridge, 2007.
[27] R.W. Rosenthal. A class of games possessing purestrategy nash equilibria. Int. Journal Game Theory, 2(1), 1973.
[28] T. Roughgarden. Intrinsic Robustness of the Price
of Anarchy. In STOC, 2009.
[29] Y. Sharma and D. P. Williamson. Stackelberg
thresholds in network routing games or the value
of altruism. In ACM Conference on Electronic
Commerce, 2007.
[30] B. Widrow and M. E. Hoff. Adaptive switching
circuits. 1960 IRE WESCON Convention Record,
pages 96–104, 1960. Reprinted in Neurocomputing (MIT Press, 1988).
[31] M. Zinkevich. Online convex programming and
generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on
Machine Learning, 2003.
13
A Further Discussion of Other Dy- A.2 A Lower Bound for Noisy BestResponse
namics
A.1 Players Entering One at a Time
In noisy best-response dynamics (also called
log-linear learning [23]), when it is player i’s turn
to move, it probabilistically chooses an action with
a probability that decreases exponentially with the
gap between the cost of that action and the cost
of the best-reponse action. The rate of decrease
is controlled by a temperature term τ , much like
in simulated annealing. In fact, the dynamics can
be viewed as a form of simulated annealing with
the global potential as the objective function. At
zero temperature, the dynamics is equivalent to
standard (non-noisy) best response, and at infinite
temperature the dynamics is completely random.
While it is known that with appropriate temperature control this process in the limit will stabilize at states of optimum global potential, we show
here there exist cost sharing instances such that no
dynamics of this form can achieve expected cost
o(n · OPT/ log n) within a polynomial number of
time steps.
We begin with a definition capturing a broad
range of dynamics of this form.
As mentioned in Section 1.2, one form of natural dynamics that have been studied in potential
games is where the system starts empty and players join one at a time. Charikar et al. [11] analyze
this setting for fair cost sharing on an undirected
graph where all players have a common sink. They
consider a two-phase process. In Phase 1, the players arrive one by one and each connects to the root
by greedily choosing a path minimizing its cost,
i.e., each selects a greedy (best response) path relative to the selection of paths by the previous players. In Phase 2, players are allowed to change their
paths in order to decrease their costs, namely, in
the second step players play best response dynamics. Charikar et al. [11] show that interestingly the
sum of the players’ costs at the end of the first step
will be within an O(log2 n) factor of the cost of
a socially optimal solution (which in this case is
defined to be a minimum Steiner tree connecting
the players to the root). This then then implies that
the cost of the Nash equilibrium achieved in the Definition 1 Let generalized noisy best response
second step, as well as all states reached along the dynamics be any dynamics where players move
way, are O(log3 n) close to OPT.
one at a time in a random order, and when it is
Note that in the directed case the result above a given player i’s turn to move, it probabilistically
does not hold, in fact such dynamics can lead to selects among its available actions. The sole revery poor equilibria. Figure 1 shows an example quirement is that if action a is a worse response
where if the players arrive one by one and each for player i than action b to the current state S,
connects to the root by greedily choosing a path and furthermore a has been a worse response than
′
minimizing its cost, then the cost of the equilib- b to all past states S , then the probability of choosrium obtained can be much worse than the cost of ing a should be at most the probability of choosing
OPT. The optimal solution which is also a Nash b.
equilibrium in this example is (P1 , . . . , Pn ) where
The above definition captures almost any natuPi = si → v → t for each i; however the soral individual learning-based dynamics. We now
lution obtained if the players arrive one at a time
show that no dynamics of this form can achieve
and each connects to the root by greedily choosexpected cost o(n · OPT/ log n) within a polyno′
′
ing a path minimizing its cost is (P1 , . . . , Pn )
mial number of time steps for fair cost-sharing.
′
where Pi = si → t for each player i. Clearly,
′
′
cost(P1 , . . . , Pn ) = n which is much worse than Theorem 9 For the fair cost sharing game, no
cost(OPT) = k. Moreover, if one modifies the generalized noisy best response dynamics can
example by making many copies of the edge of achieve expected cost o(n · OPT/ log n) within
cost k, then even if behavior begins in a random a polynomial number of time steps.
initial configuration, with high probability each
edge of cost k will have few players on it and so
Proof: We consider a version of the “cars or
best-response behavior will lead to the equilibrium public transit” example of Figure 1, but where each
of cost n.
player has n cars (options of cost 1 that cannot
14
be shared by others). For this problem, we can
describe the evolution of the system as a random
walk on a line, where the current position t indicates the number of players currently using the
public transit (the shared edge of cost k). The exact probabilities in this walk depend on specifics
of the dynamics and may even change over time,
but one immediate fact is that so long as the walk
has never reached t ≥ k, the shared edge of cost
k is a worse response to any player than its edges
of cost 1. Therefore, by definition of generalized
noisy best response, each player has at most a 1/n
chance of choosing the shared edge, and at least a
1 − 1/n chance of choosing a private edge, when it
is that player’s turn to move. Since position t corresponds to a t/n fraction of players on the shared
edge and a 1 − t/n fraction on private edges, this
in turn implies that given that the random walk is
in position 1 ≤ t ≤ k − 1,
1. the probability pt,t+1 of moving to position
t + 1 is at most n1 (1 − nt ), and
2. the probability pt,t−1 of moving to position
t − 1 is at least (1 − n1 ) nt ,
(with remaining probability 1 − pt,t+1 − pt,t−1 the
walk remains in position t). In particular,
pt,t+1 /pt,t−1 ≤
n−t
≤ 1/t.
t(n − 1)
We now argue that the expected time for this
walk to reach position k is superpolynomial in
n for k = log n. In particular, consider a simplified version of the above Markov chain where
pt,t+1 /pt,t−1 = 1/t (rather than ≤ 1/t) and
we delete all self-loops except at the origin (so
pt,t+1 = 1/(t + 1) and pt,t−1 = t/(t + 1) for 1 ≤
t ≤ k − 1). Deleting self-loops can only decrease
the expected time to reach position k since it corresponds to simply ignoring time spent in self-loops,
and the same for setting pt,t+1 /pt,t−1 = 1/t. So,
it suffices to show the expected time for this simplified walk to reach position k is superpolynomial
in n.
For convenience, set pk,k−1 = 1. We can
now solve for the stationary distribution π of this
chain. In particular, the simplified Markov chain is
now equivalent to a random walk on an undirected
multigraph with vertices v0 , v1 , . . . , vk having one
edge between vk and vk−1 , (k − 1) edges between
vk−1 and vk−2 , (k −1)(k −2) edges between vk−2
and vk−3 , and in general (k − 1)(k − 2) · · · t edges
between vt and vt−1 for 1 ≤ t ≤ k−1. In addition,
node v0 has (n − 1) · (k − 1)! edges in self-loops
since the probability p0,0 is at least (n − 1)/n.
Therefore, since the stationary distribution of an
undirected random walk is proportional to the degree of each node [25], we have that
πk ≤ 1/(n · (k − 1)!) < 1/k!.
Lastly, since the expected time hkk between consecutive visits to node vk satisfies hkk = 1/πk by
the Fundamental Theorem of Markov Chains, the
expected time h0k to reach vk from v0 is at least
1/πk as well. So, the expected time to reach vk
is at least k! which is superpolynomial in n for
k = log(n) (or even k = ω(log n/ log log n)).
Finally, the fact that the expected time to reach
position k is superpolynomial in n implies that the
probability of reaching position k within a polynomial number of time steps is less than 1/poly(n):
specifically, if the walk has probability p of reaching position k in T time steps starting from position 0, then by the Markov property the expected
time to reach position k is at most T /p. Moreover, so long as the walk has not yet reached position k, the cost of the system is Θ(n) = Θ(n ·
OPT/k). Thus, the expected cost of the system within a polynomial number of time steps is
Ω(n · OPT/ log n) as desired.
B
Hoeffding-Azuma
For completeness we present here
Hoeffding-Azuma concentration bound
supermartingales.
the
for
Theorem 10 Suppose X0 , X1 , X2 , . . . is a supermartingale, namely that
E[Xk |X1 , . . . , Xk−1 ] ≤ Xk−1 ,
for all k. Suppose also that for all k we have
|Xk − Xk−1 | ≤ C.
Then, for any λ > 0 and any n ≥ 1,
15
2
Pr[Xn ≥ X0 + λ] ≤ e−λ
/(2C 2 n)
.