A Simple Learning Rule in Games and Its

A Simple Learning Rule in Games and Its Convergence to
Pure-Strategy Nash Equilibria
Siddharth Pal and Richard J. La
Abstract— We propose a simple learning rule in games. The
proposed rule only requires that (i) if there exists at least one
strictly better reply, an agent switches its action to each strictly
better reply with positive probability or stay with the same
action (with positive probability), and (ii) when there is no
strictly better reply, the agent either stays with the previous
action or switches to another action that yields the same payoff.
We first show that some of existing algorithms (or simple
modifications) are special cases of our proposed algorithm.
Secondly, we demonstrate that this intuitive rule guarantees
almost sure convergence to a pure-strategy Nash equilibrium in
a large class of games we call generalized weakly acyclic games.
Finally, we show that the probability that the action profile
does not converge to a pure-strategy Nash equilibrium decreases
geometrically fast in the aforementioned class of games.
I. I NTRODUCTION
There has been an increasing interest in game theory as
a tool to solve complex engineering problems that involve
interactions among (many) distributed agents or subsystems.
As these agents are expected to interact many times over
time, researchers proposed a variety of learning rules with
desired properties. One example of such desired properties
is convergence to some form of equilibrium. By now, there
exist many different types of learning rules, e.g., [2], [3], [4],
[5], [7], [8], [10], [11], [12], [15].
We propose a simple learning rule in (repeated) games.
The proposed rule only requires the following: (i) if there is
no strictly better reply, the agent stays with the same action,1
and (ii) if there exists at least one strictly better reply, it
switches to each strictly better reply with positive probability
and also sticks with the previous action with positive probability (called inertia). This inertia may, for instance, model
the scenario where an agent waits at least one more period
and tries the same action before switching/committing to a
strictly better reply.
We believe that the proposed rule is very natural; it
only requires the agents to choose actions among the better
replies, making sure that all strictly better replies get a chance
to be played along with the previous action. Once an agent
observes that there are no strictly better replies and only
the played action yields the highest payoff, it stays with it.
However, if the agent finds that there are no strictly better
*This work was supported in part by National Institute of Standards and
Technology and National Science Foundation.
Authors are with the Department of Electrical Computer Engineering and
the Institute for Systems Research, University of Maryland, College Park,
MD 20742. Email: {spal, hyongla}@umd.edu.
1 In a slightly modified version of the original learning rule, this can
relaxed to the requirement that the agent chooses any of best replies that
yield the same payoff as the current action.
replies but several actions offer the same highest payoff,
it could play the same action as before, hence maintaining
status quo, or pick any new action among the best replies.
Moreover the proposed rule is computationally inexpensive
as it demands only simple comparisons.
We first show that some of existing learning rules or their
minor modifications are special cases of our proposed rule.
Secondly, we demonstrate that the proposed rule guarantees
almost sure convergence to a pure-strategy Nash equilibrium
(PSNE) in a broad class of games, which we call generalized
weakly acyclic games (GWAGs). This class of games includes the well-known weakly acyclic games (WAGs) which
received much attention recently (e.g., [1], [7]). We also
provide an example of a GWAG that is not a WAG, thereby
illustrating that the class of GWAGs is strictly larger than
the set of WAGs.
The goal for our study is two-fold: (a) We aim to identify a
simple, generic learning rule for agents in distributed settings
and to demonstrate that even simple, intuitive learning rule
can ensure convergence to a PSNE in a large class of games.
(b) In addition to proving convergence to a PSNE, we show
that the probability that the proposed rule does not converge
to a PSNE decreases geometrically fast.
The rest of the paper is organized as follows. Section II
provides a short summary of studies that are closely related to
our study. Section III describes strategic-form repeated game
settings we adopt, defines the GWAGs and offers an example
of GWAG that is not a WAG. We describe both the proposed
rule and two special cases in Section IV. The main results
are presented in Section V. We conclude in Section VI.
II. R ELATED L ITERATURE
There is already a large volume of literature on learning
in games, and some of earlier models and results can be
found in, for instance, a manuscript by Fudenberg and
Levine [2]. Learning rules in games in general aim to help
agents learn equilibrium conditions in games, oftentimes
with special structure, e.g., identical interest games, potential
games (PGs), WAGs, congestion games, just to name a few.
While a PSNE is not guaranteed to exist in an arbitrary
game, it is shown to exist for the class of PGs [16]. This
led to further research in this field with problems being
formulated as PGs. For instance, Arslan et al. [1], model the
autonomous vehicle-target assignment problem as a PG, with
discussions on procedures for vehicle utility design. They
present two learning rules and show their convergence to
PSNEs.
In [9], Marden et. al. study large-scale games with many
players with large strategy spaces. They generalize the notion
of PGs and propose a learning rule that ensures convergence
to a PSNE in the class of games they consider, with applications to congestion games.
The WAGs are identified and first studied in a systemic
fashion in [17]. Since then, there has been considerable
interest in such games. Marden et al. [6] establish the relations between cooperative control problems (e.g., consensus
problem) and game theoretic models; they discuss the better
reply with inertia dynamics (described in Section IV-B),
which motivate our proposed rule, and apply it to a class
of games, which they call sometimes weakly acyclic games,
to address time-varying objective functions and action sets.
Another class of learning rules is based on regrets, which
capture the additional payoff an agent could have received
by playing a different action. There are already many regretbased learning rules, e.g., [3], [4], [5], which guarantee
convergence to either Nash equilibria or correlated equilibria
in an appropriate sense.
In [7], Marden et al. propose regret-based dynamics, which
achieve almost sure convergence to a strict PSNE in WAGs,
and apply them to distributed traffic routing problems. A
special case of our proposed rule (Section IV-B) is largely
inspired by the regret-based dynamics proposed in [7].
However, our proposed rule ensures almost convergence to a
PSNE even when there is no strict PSNE in a broader class
of GWAGs (which are defined in the following section).
We also note that there are other payoff-based learning
rules with provable convergence to Nash equilibria (with high
probability) in WAGs, e.g., [8], [10].
III. L ANGUAGE OF GAME THEORY
In this section, we first describe the strategic-form repeated
games we will adopt for analysis. Then, we define generalized weakly acyclic games, which include WAGs as special
cases.
A. Strategic-form repeated games
Finite stage game: Let [n] := {1, 2, . . . , n} be the set of
agents of a game G. The pure action space of agent i ∈
[n] and the joint action space of all agents are denoted by
Ai = {1, 2, . . . , Ai }, where Ai is the
Q number of available
pure actions for agent i, and A := i∈[n] Ai , respectively.
We assume Ai < ∞ for all i ∈ [n]. The payoff function of
agent i is given by Ui : A → IR.
An agent i ∈ [n] chooses its action according to a probability distribution pi ∈ ∆(Ai ), where ∆(Ai ) denotes the
probability simplex over Ai . If the probability distribution pi
is degenerate, it is called a pure strategy. Otherwise, agent i
is said to play a mixed strategy.
Given an action profile a = (a1 , a2 , . . . , an ) ∈ A, a−i
denotes the action profile of all the agents other than agent
i, i.e., a−i = (a1 , . . . , ai−1 , ai+1 , . . . , an ). An action profile
a? ∈ A is a PSNE if, for every agent i ∈ [n],
Ui (a?i , a?−i ) ≥ Ui (ai , a?−i ) for all ai ∈ Ai \ {a?i }.
(1)
The PSNE is said to be strict if the inequality in (1) is strict
for all agents i ∈ [n]. Otherwise, it is called a weak PSNE.
Repeated game: In an (infinitely) repeated game, the
above stage game is repeated at each time k ∈ IN :=
{1, 2, . . .}. At time k, agent i ∈ [n] chooses its action ai (k)
according to a mixed strategy pi (k) ∈ ∆(Ai ).
We denote the action profile chosen by the agents at time
k by a(k) = (ai (k); i ∈ [n]). Based on the selected action
profile a(k), each agent i ∈ [n] receives a payoff Ui (a(k))
at time k.
The agents are allowed to adapt their strategies based on
the history of payoffs. In this paper, we first propose a simple
strategy update rule for the agents, which we believe is very
natural in many settings. Then, we demonstrate that when
all agents adopt the rule, the action profile of the agents,
namely a(k), converges to a PSNE of the stage game with
probability 1 as k → ∞ in a large class of games described
in the following subsection.
B. Generalized weakly acyclic games
In order to define the GWAGs, we first need to introduce
the notion of generalized better reply paths (GBRPs). A
GBRP is a sequence of action profiles (a1 , a2 , . . . , aL ), such
that, for every 1 ≤ ` ≤ L − 1, there exists a set of agents
I` ⊆ [n] that satisfies
i. a`i 6= a`+1
and Ui (a` ) < Ui (a`+1
, a`−i ) for all i ∈ I` ;
i
i
`+1
ii. a`i = ai for all i ∈ [n] \ I` .
These conditions mean that a GBRP consists of transitions
from one action profile to another action profile, in each
of which a set of agents that can achieve a higher payoff
via unilateral deviation switch their actions simultaneously,
while the remaining agents stay with their previous actions.
A better reply path (BRP) used to define WAGs [6], [7] is a
special case of GBRPs with |I` | = 1 for all ` = 1, 2, . . . , L.
Definition 3.1: A game is a GWAG if (i) the set of PSNEs
is nonempty and (ii) for every action profile a∗ ∈ A, there
exists a GBRP (a1 , a2 , . . . , aL ) such that a1 = a∗ and aL =
aN E for some PSNE aN E .
It is clear from the definition that WAGs are special cases
of GWAGS where only BRPs are allowed, i.e., only a single
agent is allowed to switch or deviate at a time. Due to this
constraint, we suspect that there is a large class of games that
are GWAGs, but not WAGs as illustrated by the following
simple game.
Example of a GWAG that is not a WAG: Consider a
three-players game shown in Fig 1 in normal form. All three
players have the identical action space {0, 1}. The unique
PSNE of the game is aN E = (1, 1, 1). The solid red arrows
in Fig. 1(b) indicate all possible unilateral deviations that
would improve the payoff of the agent that deviates. From
the figure, we can verify that there does not exist a BRP from
any action profile that is not the PSNE or a† = (0, 1, 1); the
PSNE is reachable only from a† , and it is not possible to
reach either the PSNE or a† from any other action profile.
For this reason, this game is not a WAG.
Player 1
0
Player 2
0
1
5, 5, 5
5, 7, 5
1
4, 8, 5
Player 1
6, 7, 5
0
Player 2
0
1
0, 0, 6
0, 0, 0
1 1, 10, 0 10, 10, 5
a =0
•
3
(a)
•
•
0, 0, 0
1, 1, 0
0, 1, 1
1, 0, 1
1, 1, 1
0, 0, 1
GBRPA-I Algorithm
a =1
3
1, 0, 0
as follows:
0, 1, 0
simultaneous deviation by multiple players
unilateral deviation
(b)
Fig. 1. An example of a GWAG that is not a WAG. (a) Game in normal
form, (b) all possible unilateral deviations and simultaneous deviations
On the other hand, this game is a GWAG. To see this,
note that both players 2 and 3 have an incentive to deviate
at a∗ = (0, 0, 0). Therefore, from action profile a∗ , we can
find a GBRP given by (a1 , a2 , a3 ) = (a∗ , a† , aN E ) with
I1 = {2, 3} and I2 = {1}. Since it is possible to reach
a∗ from other remaining action profiles, we can construct a
GBRP from them as well. Therefore, this example illustrates
that the GWAGs give rise to a strictly larger class of games
than WAGs.
IV. P ROPOSED A LGORITHM
In this section, we describe the proposed rule for updating
agents’ actions. The convergence property of the proposed
rule in the GWAGs is established in the following section.
A. Generalized Better Reply Path Algorithm (GBRPA)
We assume that, at time k, agent i ∈ [n] observes the
payoff vector Ui (k) = (Ui (ai , a−i (k)); ai ∈ Ai ).2 A
similar assumption is also made in [4], [6], [7]. For each
agent i ∈ [n], we define two mappings BRi : A → 2Ai and
IRi : A → 2Ai , where
BRi (a) = {a0i ∈ Ai | Ui (a0i , a−i ) > Ui (a)}, and
IRi (a) = {a0i ∈ Ai | Ui (a0i , a−i ) = Ui (a)}.
Clearly, BRi (a) is the set of strictly better replies for agent
i given the action profile of the other agents, namely a−i .
Note that, for all a ∈ A, the set IRi (a) is nonempty because
it always contains ai .
Suppose that and are two constants satisfying
0 < ≤ < 1. Under the proposed rule, each agent makes
a decision at time k + 1 ∈ IN on the basis of (i) a(k) and
(ii) BRi (a(k)) and IRi (a(k)), independently of each other,
2 As we will see shortly, this assumption can be relaxed with a less
stringent assumption that (i) the agent sees its payoff Ui (a(k)) and (ii)
it can determine if Ui (ai , a−i (k)) >, =, or < Ui (a(k)) for all ai ∈ Ai .
if BRi (a(k)) = ∅ and |IRi (a(k))| = 1
– choose ai (k + 1) = ai (k) ;
elseif BRi (a(k)) = ∅ and |IRi (a(k))| > 1
– choose ai (k + 1) = ai (k) ;
(∗)
else (i.e., BRi (a(k)) 6= ∅)
– choose ai (k + 1) = ai with some probability
β(ai ; a(k)) ∈ [, ] for each ai ∈ BRi (a(k)), and
ai (k + 1) = ai (k) with the remaining probability;
We assume
max max
∗
i∈[n]
a ∈A
X
β(ai ; a∗ ) =: µ < 1.
ai ∈BRi (a∗ )
Note that when BRi (a(k)) = ∅, agent i stays with the
action ai (k). On the other hand, if BRi (a(k)) 6= ∅, i.e.,
there exists a strictly better reply, agent i (a) picks each of
the strictly better replies in BRi (a(k)) with probability at
least > 0 and (b) continues to play the same action ai (k)
at time k + 1 with probability at least 1 − µ > 0.
In the GBRPA-I rule, the same action ai (k) is chosen at
time k + 1 whenever BRi (a(k)) = ∅. In another slightly
modified rule, which we call the GBRPA-II rule, we change
(∗) to the following:
- choose each action ai ∈ IRi (a(k)) with probability
at least ? > 0 (such that only actions in IRi (a(k))
is selected) and continue to play it until time k ? :=
inf{` > k | BRi (a(`)) 6= ∅}.
These rules are quite intuitive; under these rules, an agent
picks every strictly better reply with positive probability
(assuming that the actions of other agents remain the same).
However, if no strictly better reply exists, the agent either
continues playing the same action (GBRPA-I) or selects an
action that yields the same payoff as the last action ai (k)
(GBRPA-II).
B. Special cases of GBRPA rules
The previous subsection presents the generic GBRPA rules
that can be adapted in different forms. In this subsection, we
describe two special cases. The first algorithm is a special
case of GBRPA-I and converges to a PSNE almost surely in
WAGs [6]. The second algorithm is a special case of GBRPAII, in which the regrets are used to determine the actions to be
played. In particular, the second algorithm can be viewed as
a minor modification of the regret-based algorithm proposed
by Marden et al. [7], which is shown to converge to a strict
PSNE (assuming one exists) almost surely in WAGs.
a) Better reply with inertia dynamics in [6]: At each
time k ∈ IN, each agent i ∈ [n] selects an action according
to the following strategy:
• if BRi (a(k)) = ∅
– ai (k + 1) = ai (k);
• else (i.e., BRi (a(k)) 6= ∅)
– choose ai (k + 1) = ai (k) with probability αi (k)
and, for every a∗i ∈ BRi (a(k)), ai (k + 1) = a∗i
with probability (1 − αi (k))/|BRi (a(k))|;
The constant αi (k) is referred to as the agent’s inertia at time
k and is assumed to satisfy 0 < αmin < αi (k) < αmax < 1
for all k ∈ IN.
Observe that when there is no strictly better reply, the
agent stays with its previous action. Otherwise, the agent
chooses every strictly better reply with the same probability
and the previous action with probability α(k).
One can verify that this algorithm is a special of GBRPAI with = (1 − αmax )/ maxi∈[n] Ai , = 1 − αmin , and
1 − µ ≥ αmin > 0.
b) Modified regret-based dynamics with inertia
(MRBDI): The MRBDI consists of three key components:
The first updates the regrets maintained by each agent,
based on the previous regrets and a new payoff vector.
The second specifies how regrets are used to compute the
probability with which each action will be chosen by an
agent without inertia, i.e., selection of the previous action.
The third determines the actual probabilities with inertia.
i. Regret updates – Initially, the regret vector is set to zero
vector for every agent i ∈ [n], i.e., Ri (1) = (Riai (1); ai ∈
Ai ) = 0. Starting at time k = 1, each agent updates its
regret vector according to the following rule:
Let us first define the following notation.
• Umax = max(Ui (a); i ∈ [n], a ∈ A), i.e., the largest
possible payoff among all agents under all possible
action profile;
• Umin = min(Ui (a); i ∈ [n], a ∈ A), i.e., the smallest
possible payoff among all agents under all possible
action profile;
• ∆max = Umax − Umin .
For each k ∈ IN, denote the payoff difference vector
of agent i at time k as ∆Ui (k) = (Ui (ai , a−i (k)) −
Ui (a(k)); ai ∈ Ai ). Initially, the payoff difference vector
∆Ui (0) = λ · 1, where λ ∈ (−∆max , ∆max ) and 1 =
(1, 1, . . . , 1).
Here, we describe the simplest form of regret update
rule, but it can be modified in different manners, including
discounting of regret vector similar to that in [7]. Suppose
that c is some positive constant.
Regret update rule:
• if ∆Ui (k) 6= ∆Ui (k − 1)
– set R̃i (k + 1) = ∆Ui (k);
– if R̃iai (k + 1) = 0 for all ai ∈ IRi (a(k)) and
R̃iai (k + 1) < 0 for all ai ∈ Ai \ IRi (a(k))
∗ choose an action a∗i ∈ IRi (a(k)) according to
a uniform distribution over IRi (a(k)) and set
Riai (k

+ 1)
c
if ai = a∗i
0
if ai ∈ IRi (a(k)) \ {a∗i }
=
 ai
Ri (k) otherwise

– else
∗ set Ri (k + 1) = R̃i (k + 1);
else
– set Ri (k + 1) = Ri (k);
ii. Regret-based dynamics – In order to determine the
action to play at the next time k + 1, each agent makes use
of the following mappings. Suppose RB i : IRAi → ∆(Ai )
is any continuous function that satisfies the following: Let
x = (xj ; j ∈ {1, . . . , Ai }) ∈ IRAi .
1) If there exists j ∈ {1, . . . , Ai } with xj > 0, then
RBji (x) > 0.
2) Suppose that there exists I ⊆ {1, . . . , Ai } such that
xj = 0 for all j ∈ I and xj < 0 otherwise. Then,
1
if j ∈ I,
|I|
RBji (x) =
0 otherwise.
•
iii. Agent action selection – Assume that ξ and ξ are two
constants such that 0 < ξ < ξ < 1. Let ξ i = (ξi (k); k ∈ IN)
be a sequence of real numbers satisfying ξi (k) ∈ (ξ, ξ) for
all k ∈ IN.
Each agent selects its action at time k + 1 based on (i)
ai (k) and (ii) its regret vector Ri (k + 1) as follows:
Action selection rule: For each ai ∈ Ai , agent i chooses
ai (k + 1) = ai with probability ξi (k + 1) · RBai i (Ri (k +
1)) + (1 − ξi (k + 1))1 [ai = ai (k)].
Note that the action ai (k) is selected with probability
at least 1 − ξi (k + 1). Hence, there is some inertia to
continue to play the same action. In this sense, ξi (k) can
be viewed as a parameter that represents agent i’s desire to
optimize or increase its payoff at time k. The other actions
are picked with probability that is given by the mapping RB i
as a function of the regret vector Ri (k + 1) multiplied by
ξi (k + 1).
We denote by R̃i (a∗ ) the regret vector of agent i when
the action profile is a∗ ∈ A. For each i ∈ [n] and a∗ ∈ A,
define A?i : A → 2Ai , where
A?i (a∗ ) := {ai ∈ Ai | R̃iai (a∗ ) > 0}.
Let
i
i ∗
τ := max max
max
RB
R̃
(a
)
and
ai
i∈[n] a∗ ∈A ai ∈Ai
τ := min min
min
RBai i R̃i (a∗ )
.
∗
?
∗
i∈[n]
a ∈A
ai ∈Ai (a )
Then, one can show that MRBDI is a special case of
GBRPA-II with (i) = ξ · τ , (ii) = ξ · τ , and (iii)
? = ξ · (maxi∈[n] Ai )−1 .
V. M AIN R ESULTS
In this section, we state the main results and provide a
sketch of the proof for the convergence property under the
GBRPA-I rule.
Theorem 5.1: Suppose that the game G is a GWAG.
Denote the nonempty set of PSNE(s) by N E. Let a(k),
k ∈ N , be the sequence of action profiles generated by a
GBRPA (either GBRPA-I or GBRPA-II) rule. Then,
lim P [a(m) ∈ N E for all m ≥ k] = 1.
k→∞
In other words, the action profile converges to a PSNE with
probability 1.
We state the following theorem without a proof. Its proof
for the GBRPA-I rule can be obtained from the proof of
Theorem 5.1 provided in the subsequent subsection.
Theorem 5.2: Suppose that the game G is a GWAG.
Then, under a GBRPA (either GBRPA-I or GBRPA-II) rule,
there exist c1 < ∞ and c2 ∈ (0, 1) such that
Lemma 5.3: Let θ := min(1 − µ, ? ) > 0. For every
a ∈ A, we have
n·L(a∗ )
P a(L(a∗ ) + 1) ∈ N E a(1) = a∗ ≥ (θ · )
.
Proof: To simplify the notation, we introduce a mapping Φ : A → A where Φ(a∗ ) denotes the second action
profile in p(a∗ ) following a∗ . Also, define another mapping
I : A → 2[n] , where I(a∗ ) gives us the set of agents that
change their actions going from action profile a∗ to Φ(a∗ ),
i.e., I(a∗ ) = {i ∈ [n] | a∗i 6= Φi (a∗ )}.
Fix k ∈ IN, and let A(k) := (a(`); ` ∈ {1, . . . , k}) denote
the history of action profiles till time k. By the definition of
GBRP in Section III,
∗
Φi (a(k)) ∈ BRi (a(k)) for all i ∈ I(a(k)) and
1 − P [a(m) ∈ N E for all m ≥ k] ≤ c1 · (c2 )k .
Φi (a(k)) = ai (k) for all i ∈
/ I(a(k)).
(k)
Note that Theorem 5.2 implies that the probability that
the action profile does not converges to a PSNE decreases
geometrically fast. Hence, we can view − log(c2 ) as a lower
bound to the convergence rate of the rule.
For every a ∈ A , we have
i
h
P a(k + 1) = Φ(a(k)) A(k) = a(k)
h
i
Y
=
P ai (k + 1) = Φi (a(k)) A(k) = a(k)
i∈I(a(k))
Y
×
A. Proof of Theorem 5.1
Due to a space constraint, we only provide the proof for
the GBRPA-I rule. However, the proof for the GBRPA-II
rule only requires minor modifications, and we will provide
some pointers as to how the proof differs for GBRPA-II.
We first state several lemmas that are used in the proof
of Theorem 5.1, which establish the following. First, starting
from any action profile a∗ ∈ A, the probability of following
a GBRP from a∗ to a PSNE is strictly positive. Second, once
the agents reach a PSNE at some time, the probability that
they will stay at the PSNE for good is strictly positive.
Consider a mapping Γ : A∗ → Z+ := {0, 1, . . .}, where
A∗ denotes the Kleene star on A. Given a sequence of `
action profiles, say (a1 , a2 , . . . , a` ), Γ((a1 , a2 , . . . , a` )) =
` − 1. In other words, the mapping Γ tells us the length of
the sequence of action profiles minus 1, i.e., the number of
transitions in action profiles.
Since we assume that the game G is generalized weakly
acyclic, for every action profile a∗ ∈
/ N E, there exists
at least one GBRP that starts with a∗ and leads to some
PSNE. For each non-PSNE action profile a∗ , we randomly
choose a GBRP with the shortest length according to the
mapping Γ and denote the GBRP by p(a∗ ). For a PSNE
aN E , p(aN E ) = (aN E ), i.e., the GBRP only consists of the
PSNE. Let L : A → Z+ , where L(a∗ ) = Γ(p(a∗ )) is the
length of the GBRP starting at action profile a∗ .
Although it is not necessary, in order to facilitate our proof,
we assume that the GBRPs {p(a∗ ), a∗ ∈ A} satisfy the
following consistency condition: Suppose that a non-PSNE
action profile a† appears in the GBRP of another action
profile a? (a? 6= a† ), namely p(a? ). Then, the subsequence
in p(a? ) that starts with a† is identical to p(a† ). Such
GBRPs can be constructed easily in a manner similar to
Dijkstra’s algorithm, starting with N E.
(2)
k
h
i
P ai (k + 1) = ai (k) A(k) = a(k)
i∈I(a(k))
/
≥ |I(a(k))| · θn−|I(a(k))|
n
≥ (θ · ) ,
(3)
where the first inequality is consequence of the fact that
(i) each strictly better reply in BRi (a(k)) is selected with
probability at least and ai (k) is chosen at time k + 1 with
probability at least min(1 − µ, ? ) = θ according to the
GBRPA rules.
Denote the mth order composition of the mapping Φ by
Φm . By the definition of the mapping L and the earlier
assumption on the GBRPs {p(a), a∗ ∈ A},
∗
ΦL(a ) (a∗ ) ∈ N E for every a∗ ∈ A.
Note that, starting from any non-PSNE a∗ , if the
action profile follows the GBRP p(a∗ ) at each time,
the agents will reach a PSNE at time L(a∗ ) + 1.
Therefore, we have the following lower bound for
P [a(L(a∗ ) + 1) ∈ N E | a(1) = a∗ ].
P [a(L(a∗ ) + 1) ∈ N E | a(1) = a∗ ]
≥ P a(k) = Φk−1 (a∗ ), 1 < k ≤ L(a∗ ) + 1 a(1) = a∗
L(a∗ )+1
=
Y
P a(k) = Φk−1 (a∗ ) a(`) = Φ`−1 (a∗ ),
k=2
for all 1 ≤ ` < k]
(4)
Using the lower bound in (3) for each conditional probability
in (4), we obtain the desired inequality
P [a(L(a∗ ) + 1) ∈ N E | a(1) = a∗ ] ≥ (θ · )
n·L(a∗ )
.
We state the following observation as a lemma. It follows
directly from the description of the GBRPA-I rule, in particular the first two cases of the rule. For the GBRPA-II rule, we
can prove a strictly positive lower bound for the probability.
NE
Lemma 5.4: For every a
∈ N E and k ∈ IN, we have
P a(m) = aN E for all m ≥ k + 1 | a(k) = aN E = 1. (5)
Let Lmax := maxa∗ ∈A L(a∗ ). Lemmas 5.3 and 5.4
together lead to the following corollary, which is essential
for proving Theorem 5.1.
We now consider the conditional probabilities in (6). Using
the law of total probability,
P ∆N` < ∞ E(` − 1)
X =
P ∆N` < ∞ E(` − 1), a(N`−1 ) = a∗
a∗ ∈A
∗
×P [a(N`−1 ) = a | E(` − 1)]
(7)
We can show that Corollary 5.5 implies that, for all a∗ ∈ A,
P ∆N` < ∞ E(` − 1), a(N`−1 ) = a∗ ≤ 1 − ρ. (8)
Corollary 5.5: For every a∗ ∈ A,
P a(k) ∈ N E for all k ≥ Lmax + 1 a(1) = a∗
n·L
≥ (θ · ) max .
Proof: First, note that, for all non-PSNE a∗ , we have
P a(k) ∈ N E for all k ≥ Lmax + 1 a(1) = a∗
≥ P a(k) ∈ N E for all k ≥ L(a∗ ) + 1 a(1) = a∗
= P a(L(a∗ ) + 1) ∈ N E a(1) = a∗ ,
where the equality follows from Lemma 5.4. Now, the
corollary is consequence of Lemma 5.3 and the inequality
∗
(θ · )n·L(a ) ≥ (θ · )n·Lmax .
The corollary also holds for GBRPA-II with a slightly
different positive lower bound.
n·L
Proof of Theorem 5.1: First, define ρ := (θ · ) max .
Because θ > 0 and Lmax < ∞, we have ρ > 0.
We define a sequence of random variables {Nm , m ∈
Z+ } as follows: First, let N0 = 1. We define the remaining
random variables recursively: For every k ∈ IN,
Nm = inf {k > Nm−1 | Φ(a(k − 1)) 6= a(k)}
with the usual convention that the infimum of the empty set
is ∞. In other words, Nm is the first time after Nm−1 the
current action profile deviates from the GBRP of the previous
action profile.
From the above sequence of random variables Nm , m ∈
Z+ , we define the following sequence of inter-deviation
times: Let ∆N0 = 1 and, for each m ∈ IN,
Substituting the inequality (8) in (7), we obtain
P ∆N` < ∞ E(` − 1) ≤ 1 − ρ.
(9)
Using (9) in (6) and letting m → ∞, we get
lim P [a(`) ∈ N E for all ` ≥ k]
k→∞
= 1 − P [∆N` < ∞ for all ` ∈ IN]
≥ 1 − lim (1 − ρ)m
m→∞
= 1.
This completes the proof of Theorem 5.1.
VI. CONCLUSIONS
We proposed a new learning rule in games and proved
the almost sure convergence of action profile to a PSNE in
GWAGs under the proposed rule. In addition, we showed
that the probability that the action profile does not converge
to a PSNE diminishes geometrically fast in GWAGs.
We are currently working to identify a tight lower bound
on the convergence rate of the proposed algorithm. Furthermore, we are interested in computing good bounds for the
expected convergence time. Finally, we are investigating how
we can relax the assumption that the agent knows the entire
payoff vector, including the payoffs for actions that were not
played, at each time.
R EFERENCES
∆Nm := Nm − Nm−1 .
Note that the action profile a(k), k ∈ IN, does not converge
to a PSNE if and only if ∆Nm < ∞ for all m ∈ IN or,
equivalently, Nm < ∞ for all m ∈ IN.
For notational simplicity, for each m ∈ IN, we define
E(m) := {∆N` < ∞ for all ` = 1, 2, . . . , m}
= {Nm < ∞}.
Assume that the initial action profile a(1) is chosen according to some distribution over A. Then, for each m ∈ IN,
P [E(m)] =
=
m
Y
`=1
m
Y
`=1
P E(`) E(` − 1)
P ∆N` < ∞ E(` − 1) .
(6)
[1] G. Arslan, J.R. Marden, and J. S. Shamma, “Autonomous vehicletarget assignment: A game-theoretical formulation,” Journal of Dynamic Systems, Measurement, and Control, 129(5):584-596, Apr.
2007.
[2] D. Fudenberg and D.K. Levine, The Theory of Learning in Games,
The MIT Press, 1998.
[3] F. Germano and G. Lugosi, “Global Nash convergence of Foster and
Young’s regret testing,” Games and Economic Behavior, 60(1):135154, Jul. 2007.
[4] S. Hart and A. Mas-Colell, “A simple adaptive procedure leading to
correlated equilibrium,” Econometrica 68(5):1127-1150, Sep. 2000.
[5] S. Hart and A. Mas-Colell, “A reinforcement procedure leading to
correlated equilibrium,” Economics Essays, ed. by G. Debreu, W.
Neuefeind and W. Trockel, pp.181-200, Springer Berlin Heidelberg,
2001.
[6] J.R. Marden, G. Arslan and J.S. Shamma, “Connections between
cooperative control and potential games illustrated on the consensus
problem,” Proceedings of European Control Conference (ECC), 2007.
[7] J.R. Marden, G. Arslan and J.S. Shamma, “Regret based dynamics:
convergence in weakly acyclic games,” Proceedings of the 6th International Conference on Autonomous Agents and Multiagent Systems
(AAMAS), 2007.
[8] J.R. Marden, H.P. Young, G. Arslan and J.S. Shamma, “Payoff-based
dynamics for multiplayer weakly acyclic games,” SIAM Journal on
Control and Optimization, 48(1):373-396, 2009.
[9] J.R. Marden, G. Arslan and J.S. Shamma, “Joint strategy fictitious play
with inertia for potential games,” IEEE Trans. on Automatic Control,
54(2):208-220, Feb. 2009.
[10] J.R. Marden and J.S. Shamma, “Revisiting log-linear learning: Asynchrony, completeness and payoff-based implementation,” Proceedings
of the 48th Annual Allerton Conference on Communication, Control,
and Computing, Monticello (IL), Oct. 2010.
[11] J.R. Marden, H.P. Young and L.Y. Pao, “Achieving pareto optimality
through distributed learning,” Proceedings of the 51st IEEE Conference on Decision and Control, Maui (HI), Dec. 2014.
[12] A. Menon and J.S. Baras, “Convergence guarantees for a decentralized
algorithm achieving pareto optimality,” Proceedings of American
Control Conference, Jun. 2013.
[13] D. Monderer and L.S. Shapley, “Potential games,” Games and Economic Behavior, 14(1):124-143, 1996.
[14] D. Monderer and L.S. Shapley, “Fictitious play property for games
with identical interests,” Journal of Economic Theory, 68(1):258-265,
Jan. 1996.
[15] B.S.R. Pradelski and H.P. Young, “Learning efficient Nash equilibria
in distributed systems,” Games and Economic Behavior, 75:882-897,
2012.
[16] R.W. Rosenthal, “A class of games possessing pure-strategy Nash
equilibria,” International Journal of Game Theory, 2(1):65-67, 1973.
[17] H.P. Young, “The evolution of conventions,” Econometrica: Journal
of the Econometric Society, 61(1):57-84, Jan. 1993.

Download Report

A Simple Learning Rule in Games and Its

Paperzz.com

Your Paperzz