Convergence of “Best-response
Dynamics” in Zero-sum Stochastic Games
David Leslie, Steven Perkins, and Zibo Xu
April 3, 2015
Abstract
Given a two-player zero-sum discounted-payoff stochastic game, we introduce three
classes of continuous-time best-response dynamics, stopping-time best-response
dynamics, closed-loop best-response dynamics, and open-loop best-response dynamics. We show the global convergence of the first two classes to the set of
minimax strategy profiles, and the convergence of the last class when the players
are not patient. We also show that the payoffs in a modified closed-loop bestresponse dynamic converges to the asymptotic value in the zero-sum stochastic
game.
1
Introduction
Continuous-time best-response dynamic is a well-known evolutionary dynamic. It
is in the form of differential inclusion, with a constant revision rate in myopic
optimization; see Matsui (1989), Gilboa and Matsui (1991), Hofbauer (1995), and
Balkenborg et al. (2013). A state in the dynamic specifies the strategy profile of
all players, and the frequency of a strategy increases only if it is a best response
to the current state.
The continuous-time best-response dynamic has been analyzed in various classes
of games; see Hofbauer and Sigmund (1998) and Sandholm (2010). In particular,
the convergence of a continuous-time best-response dynamic to the set of Nash
equilibria has been shown in Harris (1994), Hofbauer (1995), and Hofbauer and
Sorin (2006) for two-player zero-sum games, in Harris (1994) for weighted-potential
games, and in Berger (2005) for 2 × n games.
A zero-sum stochastic game with discounted payoff is introduced in Shapley
(1953). The study on non-zero-sum stochastic games is on the basis of zero-sum
stochastic games. By Folk Theorem, the analysis on the former games often needs
1
the values in related zero-sum stochastic games, which serve as punishment to any
player who deviate from the strategy agreed beforehand; see Dutta (1995).
In the present paper, we introduce several candidates to the continuous-time
best-response dynamic in two-player zero-sum stochastic games with discounted
payoff. We should first point out that it is “tricky” to define a best-response
dynamic in a stochastic game, and indeed no established notion is available in
the literature yet. We construct a few dynamics in the present paper, and the
convergence result heavily depends on the dynamic condition. The key question
here is what information and how much rationality the players have in the game.
For instance, given the stationary strategy of the opponent, can the player find
her best stationary strategy against it? This innocent-looking question is not easy
to answer, or at least the searching takes quite a long time, even if the player is
equipped with a modern computer. Note that the best stationary strategy may not
be a pure strategy due to the concave or convex property of discounted payoff in a
stochastic game. Maybe a more suitable question for myopic game players is that
given a stationary strategy profile, whether a player can compute the discounted
payoff, or whether a referee is available for such information.
For the convergence in best-response dynamic, one may wonder why not view a
zero-sum stochastic game as a general zero-sum game with a compact and convex
strategy space for each player, and apply results such as those shown in Hofbauer
and Sorin (2006). Our answer is that the transition of states after each stage
in a stochastic game makes the dynamic more complicated than those studied in
Hofbauer and Sorin (2006). In particular, Hofbauer and Sorin (2006) consider
only those games with payoff concave in player 1’s strategy space and convex in
player 2’s strategy space. In a stochastic game, however, it can be the opposite
case, i.e., convex to player 1 and concave to player 2: player 1’s strategy in the
current state may give her not only a better payoff today but also a better state
position tomorrow.
Another natural attempt is to approximate the stochastic game by a “super”
normal-form game where each player’s strategy is a bounded sequence of actions
in the state games. However, one can only find an approximate value of the game.
Furthermore, the stationary minimax strategy in the stochastic game may still
looks elusive, as we can only see how players best respond in a truncated stochastic
game. Finally, as the discount factor increases in the game, i.e., the players become
more and more patient, we need a bigger and bigger super normal-form game to
approximate the stochastic game.
For an evolutionary dynamic in a stochastic game, we assume that a player
2
does not have the full rationality to compute the global best response given the
strategy profiles of all players. However, the player should be able to see the
difference between her state-game payoff and the continuation payoff in that state.
For stopping-time best-response dynamics and closed-loop best-response dynamics
defined in this paper, we specify how the continuation payoff vector evolves in each
state game in the dynamic. We then study how to apply the convergence result
of continuous-time best-response dynamic in normal-form zero-sum games in the
context of evolving state games.
We first follow the original idea in Shapley (1953), define stopping-time bestresponse dynamic in a zero-sum stochastic game, and show the convergence of
the dynamic to the minimax strategy profile. This dynamic requires that both
players just play best response against each other as in a standard continuous-time
best-response dynamic. The continuation payoff vector in each state game is only
updated to the payoff vector in that state game at countably many times, and the
time interval between the two consecutive updates is sufficiently long.
The stopping-time best-response dynamic reminds us that for a state game
with fixed continuation payoff vector, best-response dynamic converges to the minimax strategy profile in that zero-sum game, but how to apply it to the learning
in the original stochastic game? To this end, for each state s, we introduce a feedback loop from the current payoff in the state game at s to the continuation payoff
of state s assumed in each state game. We further require that players play best
response against each other in all state games simultaneously, and the commonly
shared continuation payoff vector is approaching the state-game payoff vector as in
a continuous-time fictitious play. As time passes, the continuation payoffs change
more and more slowly, but the players can still adjust their strategies in the same
speed as before. The key to the convergence of closed-loop best-response dynamic
in a zero-sum stochastic game is simply the different adjustment speed between
best-response dynamics on players’ strategy and fictitious play on the continuation
payoffs.
In the literature, there are a few paper concerning algorithms for computation
of the value in a zero-sum stochastic game with discounted payoff; see, e.g., Vrieze
and Tijs (1982) and Borkar (2002). However, they do not study from the perspective of evolutionary or learning process. Closed-loop best-response dynamic
instead follows a rudimentary approach in the real world: when the state is changing very slowly, the player may simply best response in the current state, even if
they know that the future states depend on their current behavior and that the
currently assumed continuation payoffs may not match the payoffs generated in
3
the state games. The familiar examples are the production with consumption of
natural resources and the issue of global warming. In the long run, what you obtain in a state game must match the corresponding continuation payoff, and the
players will finally learn what behavior is best suited and how much payoff could
be sustained in each state.
We can progress further and propose a variant of the closed-loop best-response
dynamic such that the value in each state game converges to the asymptotic value
of the zero-sum stochastic game when the discount factor increases to 1. For this
purpose, we just make the discount factor changes even more slowly than the
continuation payoffs. Note that we can only see the convergence in value here
but not necessarily in stationary strategies. Similarly to Harris (1994), we can
show that the rate of convergence in payoff terms is 1/t and 1/ ln t for closed-loop
best-response dynamic and discount-factor-converging best-response dynamic, respectively.
In the case that the feedback loop is unavailable, we introduce open-loop bestresponse dynamics, and assume that a referee is telling each player her discounted
payoff given the current stationary strategy profile of both players in the zero-sum
stochastic game. In this dynamic, a player is unable to understand the infinitely
long stochastic state-transition process, but naively reduce the stochastic game to
a zero-sum state game with current discounted payoff vector as the continuation
payoff vector. When players best response against each other in a such defined
state game, each player still assumes that the other player would use the same
current strategy from stage 2 on, even if she is aware that the opponent is adjusting
her strategy at stage 1 in the stochastic game. We regard the open-loop bestresponse dynamic as a primary model to study the myopic behavior in state games
that approximate the stochastic game. We are also interested in whether this
dynamic can serve as an alternative algorithm to compute the value of the zerosum stochastic game. In Section 3.3, we show that when the discount factor is
not too big, i.e., when the players are not patient, the open-loop best-response
dynamic converges to the set of stationary minimax strategy profiles.
2
The Model
We begin by reviewing two-player normal-form zero-sum games.
4
2.1
Normal-form Zero-sum Games
In a zero-sum game G where player 1 and 2’s pure strategy sets are A1 are A2 ,
respectively, the (a1 , a2 ) element u(a1 , a2 ) in the payoff matrix denotes the payoff
to player 1 when player 1 plays a1 and 2 plays a2 . We can then linearly extend
the payoff function of player 1 to mixed strategies, i.e. u(x1 , x2 ) is defined for any
x1 ∈ ∆(A1 ) and x2 ∈ ∆(A2 ). Recall the value of the game is
v(G) = 1max 1
min u(x1 , x2 ) =
x ∈∆(A ) x2 ∈∆(A2 )
min
max u(x1 , x2 ).
x2 ∈∆(A2 ) x1 ∈∆(A1 )
A minimax strategy of player 1 guarantees the payoff no less than v(G), regardless
of the strategy of player 2; similarly, a minimax strategy of player 2 guarantees the
payoff no more than v(G). A minimax strategy profile is also a Nash equilibrium
in G.
2.1.1
Preliminary Results
Lemma 2.1. Given a positive finite number c, if we modify the payoff function u
to u0 with the property |u0 (a1 , a2 ) − u(a1 , a2 )| ≤ c for all (a1 , a2 ) ∈ A1 × A2 , then
for any (mixed) strategy profile (x1 , x2 ), |u0 (x1 , x2 ) − u(x1 , x2 )| ≤ c.
Proof. This follows from the linear property of u.
Lemma 2.2. Given a positive finite number c, if we modify the payoff function u
to u0 with the property |u0 (a1 , a2 ) − u(a1 , a2 )| ≤ c for all (a1 , a2 ) ∈ A1 × A2 , then
|v(G) − v(G0 )| ≤ c, where G0 is the game with the modified payoff function u0 .
Proof. For any minimax strategy profile (x1 , x2 ) in G and any minimax strategy
profile (x̄1 , x̄2 ) in G0 , we have
u(x1 , x̄2 ) ≥ u(x1 , x2 ) and u0 (x̄1 , x̄2 ) ≥ u0 (x1 , x̄2 ).
Thus
u(x1 , x2 ) − u0 (x̄1 , x̄2 ) ≤ u(x1 , x̄2 ) − u0 (x1 , x̄2 ) ≤ max
|u0 (a1 , a2 ) − u(a1 , a2 )| ≤ c,
1 2
a ,a
by Lemma 2.1. Similarly, we can show that
u0 (x̄1 , x̄2 ) − u(x1 , x2 ) ≤ max
|u0 (a1 , a2 ) − u(a1 , a2 )| ≤ c.
1 2
a ,a
5
2.1.2
Best-response Dynamics in Normal-form Zero-sum Games
In continuous-time best-response dynamics, player 1 revises her strategy according
to the set of current best-response strategies br1 (x2 ) := argmax u(ρ1 , x2 ); similarly,
ρ1 ∈∆(A1 )
2
1
1
2
for player 2, br (x ) := argmin u(x , ρ ). A best-response dynamic in a normalρ2 ∈∆(A2 )
form zero-sum game G satisfies
ẋi ∈ bri (x−i ) − xi , i = 1, 2,
(2.1)
where the dot represents derivative with respect to time, and we have suppressed
the time argument. Since best-response strategies are in general not unique, rigorously speaking, this is a differential inclusion, not always reduced to a differential equation. In normal-form zero-sum games, the set bri (x−i ) is upper semicontinuous in x−i . Hence, from any initial strategy profile (x1 (0), x2 (0)), a solution
trajectory (x(t))t≥0 exists, and x(t) is Lipschitz continuous satisfying (2.1) for almost all t ≥ 0; see Aubin and Cellina (1984).
Given a strategy profile (x1 , x2 ), we define
H(x2 ) := 1max 1 u(y 1 , x2 ),
(2.2)
L(x1 ) :=
(2.3)
y ∈∆(A )
and
min u(x1 , y 2 ).
y 2 ∈∆(A2 )
For any t ≥ 0, we define in a solution trajectory (xt )t≥0
w(t) := H(x2 (t)) − L(x1 (t)).
(2.4)
We call w(t) the energy of the dynamic at time t. It is straightforward to see that
|u(x1 (t), x2 (t)) − v(G)| ≤ w(t), ∀t ≥ 0,
(2.5)
and that w(t) = 0 if and only if x1 (t) and x2 (t) are minimax strategies of player
1 and 2, respectively. Thus, at that time t, u(x1 (t), x2 (t)) = v(G).
Harris (1994) and Hofbauer and Sorin (2006) show the following result.
Theorem 2.3. Given a normal-form zero-sum game G, along every solution trajectory (xt )t≥0 of (2.1), w(t) satisfies
ẇ(t) = −w(t) for almost all t,
(2.6)
w(t) = e−t w(0).
(2.7)
hence
Thus, every solution trajectory of (2.1) converges to the set of minimax strategy
profiles in G.
6
Sketch Proof :
ẇ = H 0 (x2 ; ẋ2 ) − L0 (x1 ; ẋ1 ),
where f 0 (xi ; ẋi ) denotes the one-sided directional derivative of f in the direction
xi . The picked best-response strategies in the solution trajectory at time t are
denoted by b1t ∈ br1 (x2t ) and b2t ∈ br2 (x1t ). Then, by the envelope theorem, we
have
ẇ = u(b1 , ẋ2 ) − u(ẋ1 , b2 ).
From (2.1), it follows that
ẇ = u(b1 , b2 − x2 ) − u(b1 − x1 , b2 ) = −u(b1 , x2 ) + u(b1 , x1 ) = −w.
Hofbauer and Sorin (2006) have extended this convergence result to continuous
concave-convex zero-sum games, i.e., u(x1 , x2 ) is concave for each fixed x2 and
convex for each fixed x1 . Barron et al. (2009) further extend the convergence
result to all continuous quasiconcave-quasiconvex zero-sum games for the bestresponse dynamics on convex/concave envelops of the payoff function.
2.2
Zero-sum Stochastic Games
A two-player zero-sum discounted-payoff stochastic game is a tuple Γ = hI, S, A, P, r, wi
constructed as follows.
• Let I = {1, 2} be the set of players.
• Let S be a finite set of states.
• For each player i in state s, Ais denotes the set of finitely many actions. For
each state s, we put the set of action pairs As := A1s × A2s .
• For each pair of states (s, s0 ) and each action pair a ∈ As , we define Ps,s0 (a)
to be the transition probability from state s to state s0 given the action pair
a.
• We define rs (·) to be the stage payoff function for player 1. That is, when
the process is at a state s, rs (a) is the stage payoff to player 1 for the action
pair a ∈ As . Note that, in a zero-sum game, player 2 always receives stage
payoff −rs (a).
• ω is a discount factor that affects the importance of future stage payoffs
relative to current stage payoff.
7
In any state s, player i can play a mixed action πsi ∈ ∆(Ais ): πsi (ai ) denotes
the probability that when in state s, player i selects action ai ∈ Ais . In this paper,
we only consider stationary strategies for both players. A stationary strategy
π i ∈ ∆i := ×s∈S ∆(Ais ) of player i specifies for each state s a mixed action πsi to
be played whenever the state is s. For convenience, we may write ∆is for ∆(Ais ).
We denote a strategy profile by π = (π 1 , π 2 ) = ((πs1 )s∈S , (πs2 )s∈S ), and the set of
strategy profiles by ∆ := ∆1 × ∆2 . Given a strategy profile π, for any state s, we
may write
X
rs (π) = rs (πs1 , πs2 ) =
πs1 (a1 )πs2 (a2 )rs (a),
a∈A(s)
and similar treatment applies for the transition probability Ps,s0 (π) from state s
to s0 . We can then define the expected discounted payoff for player 1 starting in
state s under the strategy profile π as
#
"
∞
X
(2.8)
ω n rsn (π)s0 = s ,
us (π) := E (1 − ω)
n=0
where {sn }n∈N is a stochastic process representing the state of the process at each
iteration, and (1 − ω) is to normalize the discounted payoff. Of course, player 2
has an expected discounted payoff −us (π). Define
b1 :=
min rs (a), b2 := max rs (a), and B := [b1 , b2 ].
s∈S,a∈As
s∈S,a∈As
(2.9)
Then B is the set of achievable discounted payoff in Γ, and us (π) is in B for any
strategy profile π starting at any state s.
A Nash equilibrium π̃ ∈ ∆ requires for both player i = 1 and 2 and for all
states s in S,
uis (π̃) ≥ uis (π i , π̃ −i ), ∀π i ∈ ∆i .
(2.10)
Note that in a zero-sum stochastic game Γ, any strategy of any player i in
a Nash equilibrium is a minimax strategy of that player. Shapley (1953) proves
that for all two-player zero-sum discounted-payoff stochastic games Γ starting at
any state s, there exists a unique optimal value Vs , called the value of state s,
equal to the expected discounted payoff of player 1 which she can guarantee by
any minimax strategy. Shapley (1953) further shows the existence of a stationary
minimax strategy profile, and that for any stationary minimax strategy profile π̃,
Vs satisfies equations
Vs = (1 − ω)rs (π̃) + ω
X
s0 ∈S
8
Ps,s0 (π̃)Vs0 ∀s ∈ S.
(2.11)
We can also study the asymptotic behavior in a stochastic game Γ(ω) where
ω increases to 1. Given a finite stochastic game, at each state s ∈ S, the asymptotic value limω→1 Vs (ω) exists; see Bewley and Kohlberg (1976) and Mertens and
Neyman (1981).
Denote the set of stationary minimax strategies of player 1 and player 2 in the
stochastic game by X̃ 1 and X̃ 2 , respectively, and the set of stationary minimax
strategy profiles by X̃.
2.3
A State Game
If every stage payoff after the initial stage is a constant, then the stochastic game
reduces to a one-shot normal-form game. Given a zero-sum stochastic game Γ,
for any payoff vector ~u = (us )s∈S in finite numbers, we define for each state s a
normal-form zero-sum state game Gs (~u) upon action set A1s and A2s for player 1
and 2, respectively: the payoff function of player 1 in Gs (~u) is
zs~u (a) := (1 − ω)rs (a) + ω
X
Ps,s0 (a)us0 , ∀a ∈ As .
(2.12)
s0 ∈S
As Gs (~u) is a zero-sum game, player 2 receives payoff −zs~u (a). We view an action
at state s in Γ as a strategy in the state game Gs (~u). The payoff function above
can be linearly extended to mixed strategy profiles in ∆(As ). We call the payoff
vector ~u in (2.12) as the continuation payoff vector in the state game. We denote
the value of this state game by vs (~u) and a minimax strategy of player i by x̃is (~u)
for player i = 1, 2.
We call Gs (V~ ) the value state game at state S, where V is defined in (2.11).
To define a best-response dynamic in a stochastic game and to show the convergence, we will apply continuous-time best-response dynamics in state games. We
are now ready to study three classes of continuous-time best-response dynamics
in zero-sum stochastic games in discounted payoff.
3
Best-response Dynamics in Zero-sum Stochastic Games
All three classes of continuous-time best-response dynamics below may be viewed
as a variation of an agent-form best-response dynamic; see Appendix B. In these
dynamics, each player in the stochastic game is represented by an agent in each
state, and two agents continuously play best response against each other in the
9
state game under some condition, along with all other agents playing simultaneously in all other states. The profile of the strategies of agents in all state games
for player i at any time t is the evolving (stationary) strategy of player i in the
stochastic game at t in the dynamic.
3.1
Stopping-time Best-response Dynamics
For each state in the zero-sum stochastic game Γ, we construct below a continuoustime best-response dynamic in that state game where its continuation payoff vector
is updated at countably many discrete times. To be specific, the continuation
payoffs in each state game are updated at a sequence of stopping times defined on
the energy ws in that state game.
We choose an arbitrary positive number µ ∈ (0, 1), and we first pick an arbitrary payoff vector ~u0 with each element us,0 ∈ B. Given any initial condition
(xs (0))s∈S , for each state s, consider a continuous-time best-response dynamics
(xs (t))0≤t≤T1 defined in (2.1) in the state game Gs (~u0 ), where T1 is to be defined
in (3.2) later.
If there is no ambiguity, we abbreviate the payoff function in Gs (~u0 ) to zs0 (·).
We denote ws (t) to be the energy in Gs (~u0 ) at time t ∈ (0, T1 ]. Note that when
t ∈ (0, T1 ], xs (t) is a minimax strategy profile in the state game Gs (~u0 ) if and only
if ws (t) = 0. Moreover, from (2.5), it follows that
|zs0 (xs (t)) − vs (~u0 )| ≤ ws (t).
(3.1)
We stop the best-response dynamic at time
T1 := min max ws (t) ≤ µ ,
(3.2)
t≥0
s∈S
and record ~u1 := (zs0 (xs (T1 )))s∈S . We then run the best-response dynamics in
state games Gs (~u1 ) at all states s for all t ∈ (T1 , T2 ] with
T2 := min max ws (t) ≤ µ/2 ,
t≥T1
s∈S
where ws (t) is defined in Gs (~u1 ). After recording ~u2 := (zs1 (xs (T2 )))s∈S , we then
run best-response dynamics in state games Gs (~u2 )...
For completeness, we let T0 = 0. In thus defined best-response dynamics
(xs (t))s∈S , there is an increasing sequence (Tn )n∈N , possibly with Tn = Tn+1 at
some n, such that for each n ≥ 0,
µ
(3.3)
Tn+1 := min max ws (t) ≤ n ,
t≥Tn
s∈S
2
10
where ws (t) is defined in Gs (~un ), and ~un+1 := (zsn (xs (Tn+1 )))s∈S is defined recursively.
For every finite time t ≥ 0, there is n ≥ 0 such that
Tn ≤ t ≤ Tn+1 .
(3.4)
Note that we run the best-response dynamic in all state games Gs (~un ) with such
n at this t. If there is no ambiguity about n, we further denote
ys (t) := zsn (xs (t)) ∀s ∈ S
at each t ∈ (Tn , Tn+1 ].
Under thus defined stopping-time best-response dynamics (xs (t))s∈S,t≥0 , at each
state s, xs (t) is Lipschitz continuous except countably many times and w(t) is
continuous in every period (Tn , Tn+1 ] for n ≥ 0, but possibly discontinuous at
some Tn .
Lemma 3.1. Each Tn is bounded.
Proof. From the definition of b1 , b2 , and B in (2.9), it follows that a continuation
payoff us,n is always in B. Thus, at any state s and at any time t ≥ 0,
0 ≤ ws (t) ≤ b2 − b1 .
(3.5)
From Theorem 2.3, it follows that ẇs (t) = −ws (t) for almost all t in [Tn , Tn+1 ) for
any n ≥ 0. Therefore, at any state s,
ws (t) = ws (Tn ) exp(−(t − Tn )) ∀Tn < t ≤ Tn+1 , ∀n ≥ 0.
(3.6)
From (3.5) and (3.3), we have
lim ws (T + t) ≤ b2 − b1 and max ws (Tn+1 ) =
s∈S
t↓0
µ
,
2n
when Tn < Tn+1 . For such pair of (Tn , Tn+1 ), we may then further infer from (3.6)
that
µ
Tn+1 − Tn ≤ ln(b2 − b1 ) − ln n .
(3.7)
2
Theorem 3.2. For each state s, as t → ∞, ys (t) → Vs , and x1 (t) and x2 (t)
converge to the set of stationary minimax strategies of player 1 and 2, respectively,
in the stochastic game Γ.
11
Proof. The proof is analogous to the argument used in Shapley (1953). For each
s ∈ S and n ≥ 0, denote by vs,n the value of the state game Gs (~un ). Note that
us,n+1 is the payoff to player 1 in the state game Gs (~un ) at time t = Tn+1 . After
time Tn+1 , the state game Gs (~un ) transforms to Gs (~un+1 ) with continuation payoffs
ys0 (Tn+1 ) = us0 ,n+1 for each state s0 in S.
From (3.3), it follows that at each state s, for each n ≥ 0,
µ
.
2n
(3.8)
µ
µ
+ n−1 + |vs,n − vs,n−1 |.
n
2
2
(3.9)
|us,n+1 − vs,n | ≤
Therefore, for all n > 0,
|us,n+1 − us,n | ≤
Compare the state game Gs (~un−1 ) and Gs (~un ), we find that each element in the
payoff matrix changes at most ω maxs∈S |us,n−1 − us,n |. Hence, by Lemma 2.2,
max |vs,n−1 − vs,n | ≤ ω max |us,n−1 − us,n |.
s∈S
s∈S
(3.10)
From (3.9) and (3.10), it follows that
max |us,n+1 − us,n | ≤
s∈S
3µ
+ ω max |us,n−1 − us,n |.
s∈S
2n
By iteration, we find that for any n > 0,
X
n n
n−k 3µ
max |us,n+1 − us,n | ≤ ω max |us,1 − us,0 | +
ω
· k .
s∈S
s∈S
2
k=1
(3.11)
P
Note that for fixed n, nk=1 (ω n−k /2k ) is increasing with respect to ω ∈ (0, 1). It
then follows
(
n
ωn
X
, when ω > 0.75,
ω n−k
2ω−1
<
k
2
2(0.75n ), when ω ≤ 0.75.
k=1
Thus, from (3.11), as n increases to ∞, maxs∈S |us,n+1 − us,n | decreases to 0. From
(3.8), it then follows that
max |us,n − vs,n | → 0.
(3.12)
s∈S
For each state s, as the state game Gs (~un−1 ) transforms to to Gs (~un ), ws (Tn )
jumps at most 2ω maxs∈S |us,n − us,n−1 |, i.e.,
| lim ws (Tn + t) − ws (Tn )| ≤ 2ω max |us,n − us,n−1 |.
s∈S
t↓0
From (3.3), it then follows that
ws (t) ≤ (2ω max |us,n − us,n−1 |) + µ/2n
s∈S
12
for all t ∈ (Tn , Tn+1 ]. Thus, ws (t) decreases to 0, as t increases to ∞. Hence
maxs∈S |ys (t) − us,n | decreases to 0, where n is defined with respect to t in (3.4).
Therefore, by (3.12) and (2.11),
∀s ∈ S, ys (t) → Vs ,
as t → ∞. The convergence of xi (t) to X̃ i with i = 1, 2 follows the standard
arguments in Shapley (1953).
Comment: By (3.7), we can define for the best-response dynamic a sequence
of bounded stopping times independent of (ws (t))s∈S such that Theorem 3.2 still
holds.
3.2
Closed-loop Best-response Dynamics
Inspired by Shapley (1953) and the stopping-time best-response dynamics, we
study a continuous-time dynamic system where the continuation payoff vector is
slowly and continuously affected by the current payoffs in all state games. There
is a closed loop between the continuation payoff and the state-game payoff, and
each player’s strategy in each state game is always moving towards her current
best response there.
Given a zero-sum stochastic game Γ, we adapt continuous-time best-response
dynamics (xs (t))t≥0 in each evolving state game in the following way.
Pick an arbitrary ~u(0) = (us (0))s∈S with us (0) ∈ B for every s ∈ S, where B is
defined in (2.9). Suppose that the initial stationary strategy profile (xs (0))s∈S is
given. At each time t ≥ 0, for each state s ∈ S, we consider the state game Gs (t)
with continuation payoff vector ~u(t) defined in the following dynamic system
ys (t) − us (t)
(3.13)
u̇s (t) =
1+t
i
i
(3.14)
ẋs ∈ bri (x−i
s ) − xs i = 1, 2,
where ys (t) := zs~ut (xs (t)) is the payoff to player 1 in the state game Gs (t). We call
such a defined dynamic system a closed-loop best-response dynamic. (3.14) says
that the best-response dynamics defined in (2.1) are played in the state game Gs (t)
at every time t ≥ 0, while the continuation payoff vector in Gs (t) continuously
evolves according to (3.13). Thus, there is a feedback from ys to us such that us
is always moving towards ys , though more and more slowly. We may view that
as a fictitious play applied on the continuation payoff vector with respect to the
state-game payoff vector.
13
With similar argument to (2.1), from any initial condition (xs (0), us (0))s∈S ,
there exists a solution trajectory (xs (t), us (t))s∈S,t≥0 , where xs (t) and us (t) are
Lipschitz continuous for almost all t ≥ 0 at all states s; see Aubin and Cellina
(1984)..
For each state s in S, at each time t ≥ 0, denote the value of the state game
Gs (t) by vs (t). We still define ws (t) to be the energy in Gs (t), as in (2.4).
Lemma 3.3. For each state s in S, as t increases to infinity, ys (t) converges to
vs (t) in the state game Gs (t) and ws (t) decreases to 0.
Proof. We prove it by the standard results of the convergence of best-response
dynamics in normal-form zero-sum games.
Suppose that a finite number > 0 is given. The definition of b1 and b2 in
(2.9) implies that at any state s, |ys (t) − us | ≤ b2 − b1 for all t ≥ 0. Therefore, it
follows from (3.13) that there exists t > 0 such that
|u̇s (t)| ≤ ∀t ≥ t , ∀s ∈ S.
(3.15)
On the one hand, from Harris (1994) and Hofbauer and Sorin (2006), we know
that if us (t) and ws (t) are differentiable and u̇s (t) = 0 holds for all states s at
some time t, then ẇs (t) = −ws (t) at all s. On the other hand, from Lemma
2.1, it follows that given any period of time [t1 , t2 ], if maxs∈S |us (t1 ) − us (t2 )| ≤ c
for some c > 0 and x(t1 ) = x(t2 ) for the state game Gs (t1 ) and Gs (t2 ), then
|ws (t1 ) − ws (t2 )| ≤ 2ωc. From these two observations, it follows that the total
derivative of ws satisfies
ẇs ≤ −ws + 2ω max
|u̇s0 | ∀s ∈ S
0
s ∈S
for almost all time t ≥ 0.
Taking (3.15) into account, we can find a time T > t such that ws (t) ≤ 2
for all states s and all time t ≥ T . Note that in (3.15) can be arbitrarily
small. Hence, ys (t) converges to vs (t) in the evolving state game Gs (t), and ws (t)
converges to 0.
Recall that for all states s, us (t) and vs (t) are Lipschitz continuous and hence
differentiable almost everywhere. For convenience, we mean in the lemmata below
the right derivative whenever the derivative of us or vs does not exist.
We take an arbitrarily small > 0. Here are some preparation for the next
lemma.
14
• For any time t ≥ 0, we mark a state
s(t) ∈ arg max |ys (t) − us (t)|.
s∈S
(3.16)
• For any time t ≥ 0, we mark a state
s̄(t) ∈ arg max |vs (t) − us (t)|.
s∈S
(3.17)
• From Lemma 3.3, there exists a time t1 such that for all t ≥ t1 and all states
s in S,
|ys (t) − vs (t)| ≤ (1 − ω)/64.
(3.18)
Lemma 3.4. For any time t ≥ t1 , if
|us(t) (t) − ys(t) (t)| ≥ ,
4
(3.19)
then for any state s with the property
we have
|us(t) (t) − vs(t) (t)| − |us (t) − vs (t)| ≤ (1 − ω) ,
32
(3.20)
dus(·) 1
d|us − vs |
≤ − (1 − ω) .
≤ − (1 − ω) dt
2
dt 8(1 + t)
(3.21)
This lemma says that under the condition (3.19), at any state s with the
property (3.20), the distance |us − vs | is always decreasing with speed at least
linear to 1/(1 + t).
Proof. From Lemma 2.2, the definition of s(t), and the differential equation
(3.13), it follows that
dus(·) dvs ≤ ω
(3.22)
∀s ∈ S, dt .
dt For a state with the property (3.20) at time t ≥ t1 , we may infer from the definition
of t1 that
(1 − ω)
|us (t) − ys (t)| − |us(t) (t) − ys(t) (t)| ≥ −
.
16
Thus, from (3.19), it follows that
(1−ω) dus dus(·) dus(·) −
1
−
ω
4
16
(3.23)
dt .
dt ≥
dt ≥ ω + 2
4
Note that us is moving to vs regardless of the movement of vs . By (3.22), we have
dus(·) d|us − vs |
1−ω
.
≤ − ω+
+ω dt
2
dt We complete the proof by (3.19).
15
Lemma 3.5. There exists time t() such that for all t ≥ t(), |us(t) (t)−ys(t) (t)| ≤ .
Proof. For any time t ≥ t1 ,
|us(t) (t) − vs(t) (t)| − |us̄(t) (t) − vs̄(t) (t)| ≤ (1 − ω) .
32
From Lemma 3.4, it follows that for all t ≥ t1 with the property (3.19)
d|us̄(·) − vs̄(·) |
(1 − ω)
≤−
.
dt
8(1 + t)
(3.24)
Hence, there exists a time period (t1 , t2 ) such that at any t between t1 and t2 ,
(3.24) holds and |us(t2 ) (t2 ) − ys(t2 ) (t2 )| ≤ /4.
Recall that X̃ i is the set of stationary minimax strategies of player i in Γ.
Theorem 3.6. For each state s and for both players, as time t increases to infinity,
both ys (t) and us (t) converge to vs (t); xis (t) converges to the set of stationary
minimax strategies of player i = 1, 2 in the state game Gs (t), and hence xi (t)
converges to the X̃ i in the stochastic game Γ.
Proof. Note that ys (t) → vs (t) by Lemma 3.3. We take a sequence (/2n )n≥0 in
Lemma 3.5 and find that us (t) also converges to vs (t). We complete the proof by
the results in Shapley (1953), in particular the equations (2.11).
We now show that for the asymptotic value of the zero-sum stochastic game
when discount factor increase to 1, us (t) of any solution trajectory to the following
system converges to the asymptotic value.
ys (t) − us (t)
u̇s (t) =
(3.25)
t+2
i
ẋis ∈ bri (x−i
(3.26)
s ) − xs , i = 1, 2
1 − ω(t)
ω̇(t) =
.
(3.27)
(2 + t) ln(t + 2)
We call such a dynamic an ω-converging best-response dynamic. Again, one can
show the existence of a solution trajectory to the dynamic system. Note that it is
straightforward to see from (3.27) that
ω(ω) = 1 −
where 1 − ec / ln 2 = ω(0).
16
ec
ln(t + 2)
(3.28)
Lemma 3.7. In any ω-converging best-response dynamic, for all > 0, there
exists t() such that for all t ≥ t(), |us(t) − ys(t) | < .
Proof. We firstly observe that Lemma 3.3 still holds for any ω-converging bestresponse dynamic and that both us (t) and vs (t) are still Lipschitz continuous. We
can then find a time t1 with the property
(3.29)
∀t ≥ t1 , ∀s ∈ S, |ys (t) − vs (t)| ≤ .
64
We define δ := max{|b1 |, |b2 |}, where b1 and b2 are defined in (2.9). We then
take a time t2 ≥ t1 such that
∀t ≥ t2 ,
4δ
≤ .
ln(t + 2)
8
(3.30)
We still use the notation of s(t) and s̄(t) introduced in (3.16) and (3.17), respectively. Suppose that at a time t ≥ t2
|ys(t) (t) − us(t) (t)| ≥ /4.
(3.31)
Then, by (3.25)
dus(·) (3.32)
dt ≥ 4(2 + t) ,
at this t. We now consider both vs̄(·) and us̄(·) as functions of ω and t. (When in
closed-loop best-response dynamics, they are functions of t only.) From Lemma
2.2 and (3.30), it follows that at this t
∂vs̄(·) dω
· dt
∂ω
dus(·)
dt
≤
δ(1−ω(t))
(t+2) ln(t+2)
4(2+t)
=
4δ(1 − ω(t))
1 − ω(t)
≤
.
ln(t + 2)
8
(3.33)
On the other hand, (3.21) implies that
∂|vs̄(·) − us̄(·) |
1 − ω(t) dus(·) ≤−
dt .
∂t
2
(3.34)
Note that us̄(·) is moving to vs̄(·) regardless of the movement of vs̄(·) . Thus, from
(3.33), (3.34), and (3.32), it follows that
d|vs̄(·) − us̄(·) |
3(1 − ω(t)) dus(·) 3(1 − ω(t))
≤−
≤−
.
(3.35)
dt
8
dt
32(t + 2)
We may further deduce from (3.28) that
d|vs̄(·) − us̄(·) |
3ec
≤−
,
dt
32(t + 2) ln(t + 2)
(3.36)
where c is defined in (3.28).
Thus, there exists time t3 ≥ t2 such that (3.36) holds for all t between t2 and
t3 , and |ys(t3 ) (t3 ) − us(t3 ) (t3 )| ≤ /4.
17
Theorem 3.8. For each state s, as time t increases to infinity in the ω-converging
best-response dynamic, ys (t), us (t), and vs (t) all converge to the asymptotic value
at state s in the stochastic game where the discount factor ω increases to 1.
Proof. This follows from Lemma 3.7 and Theorem 3.6.
3.3
Open-loop Best-response Dynamics
In contrast to the closed-loop best-response dynamics, for open-loops ones in zerosum stochastic games, the continuation payoff vector in each state game is equal
to the expected discounted payoff generated by the current stationary strategy
profile in the stochastic game starting from that state.
3.3.1
The Dynamics
Given a stationary strategy profile π, recall the expected discounted payoff us (π)
at each state s defined in (2.8). Denote the vector (us (π))s∈S by ~u(π). For each
state s, we denote the payoff function of player 1 in the state game Gs (~u(π)) by
Qs , i.e., for a joint action a ∈ As and the current strategy profile π
X
Qs (π, a) := (1 − ω)rs (a) + ω
Ps,s0 (a)us0 (π).
(3.37)
s0 ∈S
We can linearly extends this payoff function to Qs (π, ρs ) for a strategy profile π
and a ρs ∈ ∆(As ) at state s. Note that Qs (π, πs ) = us (π).
Given the current strategy profile π, the best-response set of player 1 and
player 2 for Q at state s are the sets
BRs1 (π) := argmax Qs (π, ρ1s , πs2 )
(3.38)
BRs2 (π) := argmin Qs (π, πs1 , ρ2s ),
(3.39)
ρ1s ∈∆1s
and
ρ2s ∈∆2s
respectively. We denote {BRs1 (π), BRs2 (π)} by BRs (π).
The open-loop best-response dynamic in the stochastic game is defined by the
differential inclusions:
π̇si ∈ BRsi (π) − πsi , ∀i, ∀s.
(3.40)
A solution to the open-loop best-response inclusion, also called an open-loop bestresponse trajectory, is an absolutely continuous function π(·) : R+ → ∆ that
satisfies (3.40) for almost all t in R+ .
18
An open-loop best-response trajectory starts from the initial strategy profile
π(0). At any time t in the trajectory, given the current strategy profile π(t), each
player i calculates the expected discounted payoff uis (t) and the set BRsi (t) at
every state s, based on (2.8) and (3.38) or (3.39). Each player then chooses an
element in BRsi (t) to generate π̇si (·) at t, which specifies the adjustment direction
and rate of the mixed action πsi (t).
3.3.2
The Result
As we can see, the adjustment of a continuation payoff in the open-loop bestresponse dynamic depends on the movement of the current strategy profile, while
in the closed-loop dynamic, it depends on the distance between the continuation
payoff and the current state-game payoff, as well as the current time.
Perkins (2013) shows the following theorem.
Theorem 3.9. Given any two-player zero-sum stochastic game with
1
,
P
ω≤
0 (a))
1 + maxs∈S
max
(P
0
s,s
a∈A(s)
s ∈S
(3.41)
from any initial strategy profile π0 ∈ ∆, any open-loop best-response trajectory
converges to the set of stationary minimax strategy profiles. That is,
lim π(t) ∈ X̃ and lim us (π(t)) = Vs ∀s ∈ S.
t→∞
t→∞
The convergence result holds when the discount factor ω is not too big. In
particular, for a zero-sum stochastic game with |S| states, a sufficient condition for
the convergence of an open-loop best-response trajectory to X̃ is ω < 1/(1 + |S|).
Comment: The proof in Perkins (2013) elaborates the approach of the Lyapunov function used in the proof of best-response dynamics in normal-form zerosum games. To be specific, if the continuation payoffs were fixed in each state
game, then the technique of Lyapunov function could be applied in all state games,
as in normal-form zero-sum games. We also know that in the dynamic, if the
continuation payoff us changes δ, then the energy ws in the Lyapunov function
changes at most ωδ. However, the problem is that the contribution to ws due to
the magnitude of u̇s may overpower the decline tendency of ws in the one-shot
state game Gs . We have so far only found an upper bound of u̇s with respect to
Q̇s , the transition probabilities, and ω; see Lemma 4.4.4 in Perkins (2013).
In short, at each s, us in a stochastic game may be convex due to the discount
factor and the transition probabilities, while Lyapunov function can only be applied to a concave (or linear) payoff function; see Hofbauer and Sorin (2006). The
19
Figure 1: BR may not be a global best response.
closer ω to 1 in a stochastic game and the more diverse the transition probabilities,
the more convex us can be.
3.3.3
Discussion
We would like to point out that in a stochastic game Γ, for player 1, BRs1 (π) in
(3.38) is a best-response set to the expected discounted payoff Qs (π, ·), a betterresponse subset to the expected discounted payoff us (π) (see Lemma 3.10), but
BRs1 (π) is not necessary the best-response set to us (π). The example below (see
Figure 2) is a one-player stochastic game, so called a Markov-decision process.
We may view it as a trivial two-player zero-sum game, if we let player 2 not
affect any r(s) by any means. From any state in this game, the player can move
to an adjacent state, or stay at the current state, both with probability 1. The
stage payoffs are independent of the player’s moves: r(s1 ) = 1, r(s5 ) = 100,
r(s2 ) = r(s3 ) = r(s4 ) = 0. Suppose that the initial strategy satisfies π(s3 , s1 ) = 1
and π(s4 , s2 ) = 1, while at all other states the player just stays there at time 0.
It follows that Qs3 (π(0), (s3 , s1 )) = ω > Qs3 (π(0), (s3 , s4 )) = 0, and hence (s3 , s4 )
is not included in BRs3 (π0 ). However, when ω is close to 1, the best-response
strategy for the expected discounted payoff u at any time t is a constant one,
denoted as π̃, where the player is always moving towards state s5 , i.e.,
π̃(s1 , s3 ) = π̃(s3 , s4 ) = π̃(s4 , s5 ) = π̃(s2 , s4 ) = π̃(s5 , s5 ) = 1.
The difference between BRs3 (π0 ) and π̃ arises from the fact that BRs3 (π0 ) is
optimal for the payoff in the state game Gs3 (~u(0)), while π̃ is optimal for the
expected discounted payoff of the player in the whole game. The best-response
dynamic in (3.40) is in the agent-form setting: the player has one agent in each
state, and each agent chooses an action independently. In contrast, a strategy of
20
the player in the whole game is a sequence of correlated actions. (See Appendix B
for more discussion on the agent-form best-response dynamic.) However, we can
still show that BRs1 (π) is a subset of better-response strategies for player 1 at s.
The proof of the following lemma is straightforward.
Lemma 3.10. Given a strategy profile π in a zero-sum stochastic game Γ, for
player 1 at any state s and any action x1s ∈ ∆(A1s ),
Qs (π, x1s , πs2 ) ≥ us (π) ⇔ us (π−s , x1s , πs2 ) ≥ Qs (π, x1s , πs2 )
and
Qs (π, x1s , πs2 ) ≤ us (π) ⇔ us (π−s , x1s , πs2 ) ≤ Qs (π, x1s , πs2 )
where π−s = ×s0 6=s πs0 .
Appendix A
Minor results on open-loop bestresponse dynamics
For further research, we present additional results regarding players’ behavior in
the open-loop best-response dynamics in a zero-sum stochastic game Γ where each
player has at most two actions in each state and the minimax strategy of each
player in each value state game is a mixed strategy.
To ease the exposition, we consider in this toy game the case |S| = 2 for the
stochastic game Γ. The result can be generalized to the case of any finite S.
We define the maximum payoff to player 1 given the strategy profile π as
Us1 (π) := max
Qs (π, ρ1s , πs2 ),
1
1
ρs ∈∆s
(A.1)
and similarly the minimum payoff to player 1 as
Us2 (π) := min
Qs (π, πs1 , ρ2s ).
2
2
ρs ∈∆s
(A.2)
For convenience, given a best-response trajectory (π(t))t≥0 , we may sometimes
write us (t) for us (π(t)), BRsi (t) for BRsi (π(t)), Usi (t) for Usi (π(t)), and given a
mixed action ai ∈ ∆is , Qs (t, ai ) for Qs (π(t), ai , πs−i (t)).
We denote two states in S by α and β. When we are referring to continuation
payoffs (u, v) in a state game Gs at any state s, we mean the continuation payoff
to state α is u and the payoff to β is v. We pick a strategy π̃si in each X̃si for
each state s and each player i. Given a best-response trajectory (π(t))t≥0 , we
put δs (t) := Vs − us (t) for s ∈ {α, β}. The lemmata in this section concern the
behavior of player 1, but the dual results hold for player 2.
21
Lemma A.1. If at time t, δα (t) = maxs∈S δs (t), then
Uα1 (t) ≥ Qα (t, π̃α1 ) ≥ Vα − ωδα (t).
If this δα (t) > 0, then for πα (t), π̃α1 is a better-response strategy for player 1 in the
state game Gα with continuation payoffs (Vα − δα (t), Vβ − δα (t)).
Proof. Since π̃α1 is a component of the minimax strategy of player 1, we may infer
that
Qα (t, π̃α1 )
=rα (π̃α1 , πα2 (t)) + ω
X
=rα (π̃α1 , πα2 (t))
X
Pαs (π̃α1 , πα2 (t))us (t)
s∈S
+ω
Pαs (π̃α1 , πα2 (t))(Vs − δs (t))
s∈S
≥Vα − ωδα (t).
Here is a dual lemma.
Lemma A.2. If at time t, δα (t) = mins∈S δs (t), then
Qα (t, π̃α1 ) ≤ Vα − ωδα (t).
If this δα (t) < 0, then for πα (t), π̃α1 is not a better-response strategy for player 1
in the state game Gα with continuation payoffs (Vα − δα (t), Vβ − δα (t)).
Recall that in (3.40) the strategy of a player is alway moving to a current
best-response strategy in the best-response dynamic, and that |A1s | = 2 for all s in
Γ. At time t, if BRs1 (π(t)) = {a1s }, i.e., the best-response strategy for player 1 in
the state game Gs (t) is a pure strategy a1s , and if the strategy π̃s1 ∈ X̃s1 is a convex
combination of a1s and πs1 (t), then the strategy πs1 (t) is also moving towards π̃s1 .
In fact, πs1 (t) is moving towards any better-response strategy when BRs1 (π(t)) is
a singleton. In this case, Qs (t, π̃si ) > us (t).
Lemma A.3. If uα (t) < Vα and δβ (t) ≤ δα (t), then Qα (t, π̃α1 ) > uα (t), πα1 (t) is
moving towards π̃α1 , and BRα1 (t) is the same as the set of best-response strategies
of player 1 in the state game Gα with continuation payoffs (Vα − δα (t), Vβ − δα (t)).
If it also satisfies that
δα (t) < δβ (t)/ω,
(A.3)
then Qβ (t, π̃β1 ) > uβ (t), πβ1 (t) is also moving towards π̃β1 , and BRβ1 (t) is the same as
the set of best-response strategies of player 1 in the state game Gβ with continuation
payoffs (Vα − δβ (t), Vβ − δβ (t)).
22
Note that this result is independent of player 2’s strategy π 2 (t).
Proof. The conclusion for state α follows from Lemma A.1. At state β, we first
observe that 0 < δβ (t) ≤ δα (t) and
Qβ (t, π̃β1 )
(A.4)
=rβ (π̃β1 , πβ2 (t)) + ωPβα (π̃β1 , πβ2 (t))(Vα − δα (t)) + ωPββ (π̃β1 , πβ2 (t))(Vβ − δβ (t))
=Vβ − ωPβα (π̃β1 , πβ2 (t))δα (t) − ωPββ (π̃β1 , πβ2 (t))δβ (t).
From (A.3), it follows that for any Pββ π̃β1 , πβ2 (t) ,
1 − ωPββ π̃β1 , πβ2 (t) δβ (t)
,
δα (t) <
ω 1 − Pββ π̃β1 , πβ2 (t)
and thus,
ωPβα (π̃β1 , πβ2 (t))δα (t)
.
δβ (t) >
1 − ωPββ (π̃β1 , πβ2 (t))
Therefore,
δβ (t) > ωPβα (π̃β1 , πβ2 (t))δα (t) + ωPββ (π̃β1 , πβ2 (t))δβ (t).
Combined with (A.4), it follows that π̃β1 is a better-response strategy for player 1
in the state game Gβ (t). Since there are only two pure strategies for each player
in Gβ (t), πβ1 (t) moving to the best-response strategy is equivalent to moving to a
better-response strategy. Finally, by a similar argument to Lemma A.1 and the
case of state α above, we reach the conclusion that BRβ1 (t) is the same as the
set of best-response strategies of player 1 in the state game Gβ with continuation
payoffs (Vα − δβ (t), Vβ − δβ (t)).
Here is a dual lemma.
Lemma A.4. If uα (t) > Vα and −δβ (t) ≤ −δα (t), then Qα (t, π̃α1 ) < uα (t), πα1 (t) is
moving away from π̃α1 , and BRα1 (t) is the same as the set of best-response strategies
of player 1 in the state game Gα with continuation payoffs (Vα − δα (t), Vβ − δα (t)).
If it also satisfies that
−δα (t) < −δβ (t)/ω,
then Qβ (t, π̃β1 ) < uβ (t), πβ1 (t) is also moving away from π̃β1 , and BRβ1 (t) is the
same as the set of best-response strategies of player 1 in the state game Gβ with
continuation payoffs (Vα − δβ (t), Vβ − δβ (t)).
The next lemma concerns the behavior in the best-response dynamic when
δα δβ < 0.
23
Lemma A.5. If uα (t) < Vα and uβ (t) ≥ Vβ , then regardless of player 2’s current
strategy π 2 (t), it follows that πα1 (t) 6= π̃α1 and πβ1 (t) 6= π̃β1 . Moreover, player 1
is moving towards π̃α1 at state α and moving away from π̃β1 at state β at time t.
BRα1 (t) and BRβ1 (t) in Gβ are the same as the set of best-response strategies of
player 1 in the state game Gα with continuation payoffs (Vα − δα (t), Vβ − δα (t))
and (Vα − δβ (t), Vβ − δβ (t)), respectively.
Proof. This is a corollary of Lemma A.3 and Lemma A.4.
Appendix B
Agent-form Best-response Dynamics
Given a zero-sum stochastic game Γ in discounted payoff, for any stationary strategy π i of player i and any state s, we denote player i’s actions except the one
i
. Given a strategy profile π and a state s, the agent-form bestat state s by π−s
response set of player 1 and player 2 for the expected discounted payoff us are
defined as
1
ABRs1 (π) := argmax us (ρ1s , π−s
, π2)
(B.1)
ρ1s ∈∆1s
and
2
ABRs2 (π) := argmin us (π 1 , ρ2s , π−s
),
(B.2)
ρ2s ∈∆2s
respectively.
We can then define the player i’s agent-form best-response differential inclusion
as
∀s ∈ S, π̇si ∈ ABRsi (π) − πsi , i = 1, 2.
(B.3)
Again, as in normal-form zero-sum games, the set ABRsi (π) is upper semi-continuous.
Hence, from any initial strategy profile π(0), a solution trajectory exists, and π(t)
is Lipschitz continuous satisfying (B.3) for almost all t ≥ 0.
We conjecture that not every agent-form best-response trajectory in every Γ
converges to the set of minimax strategy profiles. As for the general convergence
results in Barron et al. (2009), in a stochastic game we need to consider the
expected discounted payoff function u : X → R rather than us at only one state s.
It would be interesting to study the characterization of which Γ has a quasiconcave
u in π 1 for a fixed π 2 (or the weaker condition in that paper).
24
By the one-player game example in Section 3.3.3, we can show that ABRs1 (π)
may still not be the set of best-response strategies of the agent for player 1 to
us (π) at state s: when players best reply against each other in the state game
Gs (~u(πt )) at time t, they do not take into account that the continuation payoff
vector ~u(πt ) is adapting under current strategies.
References
Aubin, J.P. and A. Cellina, Differential Inclusion, Springer, Berlin, 1984.
Balkenborg, D., C. Kuzmics, and J. Hofbauer (2013): “Refined Best-Response
Correspondence and Dynamics,” Theoretical Economics, 8 (1), 165–192.
Barron, E.N., R. Goebel, and R.R. Jensen (2010): “Best Response Dynamics for
Continuous Games,” Proceedings of the American Mathematical Society, 138
(3), 1069–1083.
Berger, U. (2005): “Fictitious play in 2×n games,” Journal of Economic Theory,
120, 139–154.
Bewley, T. and E. Kohlberg (1976): “The asymptotic theory of stochastic games,”
Mathematics of Operations Research, 1, 197–208.
Borkar, V. (2002): “Reinforcement learning in Markovian evolutionary games,”
Advances in Complex Systems, 5, 55–72.
Dutta, P.K. (1995): “A Folk Theorem for Stochastic Games,” Journal of Economic
Theory, 66, 1–32.
Harris, C. (1998): “On the Rate of Convergence of Continuous-Time Fictitious
Play”, Games and Economic Behavior, 22, 238–259.
Hofbauer, J. (1995): “Stability for the Best Response Dynamics,” mimeo, University of Vienna.
Hofbauer, J., and K. Sigmund (1998): Evolutionary Games and Population Dynamics. Cambridge University Press.
Hofbauer, J., and S. Sorin (2006): “Best Response Dynamics for Continuous Zerosum Games,” Discrete and Continuous Dynamical Systems-Series B, 6 (1), 215–
224.
25
Gilboa, Y., and A. Matsui (1991): “Social Stability and Equilibrium,” Econometrica, 59, 859–867.
Matsui, A. (1989): “Social Stability and Equilibrium,” CMS-DMS No. 819, Northwestern University.
Mertens, J.-F. and A. Neyman, (1981): “Stochastic games,” International Journal
of Game Theory, 10, 53–66.
Perkins, S., Advanced Stochastic Approximation Frameworks and their Applications, PhD thesis, University of Bristol, 2013.
Sandholm, W. H., Population Games and Evolutionary Dynamics. MIT Press.
Shapley, L. (1953): “Stochastic Games,” Proceedings of National Academy of Sciences of the United States of America, 39, 1095–1100.
Vrieze, O. and S. Tijs, (1982): “Fictitious play applied to sequences of games
and discounted stochastic games,” International Journal of Game Theory, 11,
71–85.
26
© Copyright 2026 Paperzz