Determining the Optimal Strategies for Zero-Sum

Available online at www.sciencedirect.com
Electronic Notes in Discrete Mathematics 55 (2016) 155–159 www.elsevier.com/locate/endm
Determining the Optimal Strategies for
Zero-Sum Average Stochastic Positional Games
Dmitrii Lozovanu
Applied Mathematics
Institute of Mathematics and Computer Science of Moldova Academy of Sciences
Chisinau, Moldova
Stefan Pickl
Institute for Theoretical Computer Science, Mathematics and Operations Research
Universität der Bundeswehr, München,
Neubiberg-München, Germany
Abstract
We consider a class of zero-sum stochastic positional games that generalizes the
deterministic antagonistic positional games with average payoffs. We prove the
existence of saddle points for the considered class of games and propose an approach
for determining the optimal stationary strategies of the players.
Keywords: Average Markov decision process, Zero-sum stochastic positional
games, Optimal strategies, Saddle points.
1
2
Email: [email protected]
Email: [email protected]
http://dx.doi.org/10.1016/j.endm.2016.10.039
1571-0653/© 2016 The Author(s). Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
156
1
D. Lozovanu, S. Pickl / Electronic Notes in Discrete Mathematics 55 (2016) 155–159
Introduction and Problem Formulation
The aim of this paper is to prove the existence of saddle points for a class of antagonistic stochastic games that extends deterministic positional games with
average payoffs from [1,2,5]. The considered class of games we formulate by using the framework of a Markov decision process (X, A, c, p, x0 ) with a finite set
of states X, a finite set of actions A = ∪x∈X A(x) where A(x) is the set of actions in the state x ∈ X, a probability
transition function p : X×A×X → [0, 1]
a
that satisfies the condition y∈X px,y = 1, ∀x ∈ X, a ∈ A(x), and the cost
function c : X × X → R. We assume that the Markov process describes the
evolution of a dynamic system that is controlled by two players as follows:
The set of states is divided into two subsets X = X1 ∪ X2 with X1 ∩ X2 = ∅,
where X1 represents the position set of the first player and X2 represents the
position set of the second player. At time moment t = 0 the dynamical system is in the state x0 . If this state belongs to the set of positions of the first
player then the action a0 ∈ A(x0 ) in this state is selected by the first player,
otherwise the action a0 ∈ A(x0 ) is chosen by the second player. After that the
dynamical system passes randomly to an another state according to the probability distribution {pax00 ,y }. At time moment t = 1 the players observe the state
x1 ∈ X of the system. If x1 belongs to the set of positions of the first player
then the action a ∈ A(x1 ) is chosen by the first player, otherwise the action
is chosen by the second one and so on, indefinitely. In this process the first
t
aτ
player intends to maximize lim inf 1t
μ(xτ −1 ), where μ(xτ ) =
pxτ ,y cxτ ,y ,
t→∞
τ =1
and the second player intends to minimize lim sup 1t
t→∞
t
τ =1
y∈X
μ(xτ −1 ). We assume
that the players use stationary strategies of a selection of the actions in their
position sets. We define the stationary strategies of the players as maps:
s1 : x → a ∈ A(x) for x ∈ X1 ; s2 : x → a ∈ A(x) for x ∈ X2 . Let s1 , s2 be
arbitrary strategies of the players.Then s = (s1 , s2 ) determines a Markov prosi (x)
cess induced by probability distributions {px,y } in the states x ∈ Xi , i = 1, 2
and a given starting state x0 . For this Markov process with transition costs
cx,y , x, y ∈ X we can determine the average cost per transition Fxo (s1 , s2 ).
The function Fxo (s1 , s2 ) on S = S 1 × S 2 , where S 1 and S2 represent the corresponding sets of the stationary strategies of player 1 and player 2, defines an
antagonistic game. This game is determined by position sets X1 , X2 , the set of
actions A, the probability function p, the cost function c and the starting state
x0 . We denote this game by (X1 , X2 , A, c, p, x0 ) and call it zero-sum average
stochastic positional game. In this game, we are seeking for the saddle points.
D. Lozovanu, S. Pickl / Electronic Notes in Discrete Mathematics 55 (2016) 155–159
2
157
The Main Results
We show that for the considered zero-sum game there exists a saddle point,
∗
∗
i.e there exist s1 , s2 for which
∗
∗
Fx (s1 , s2 ) = max
min2 Fx (s1 , s2 ) = min
max1 Fx (s1 , s2 ).
1
1 2
2
2 1
s ∈S s ∈S
s ∈S s ∈S
Theorem 2.1 Let (X1 , X2 , A, c, p, x) be an antagonistic stochastic positional game with average payoff Fx (s1 , s2 ). Then the system of equations
⎧
a
⎪
⎪
ε
+
ω
=
max
+
p
ε
μ
x
x,a
x,y y ,
⎨ x
a∈A(x)
y∈X
a
⎪
⎪
px,y εy ,
⎩ εx + ωx = min μx,a +
(1)
a∈A(x)
y∈X
∀x ∈ X1 ;
∀x ∈ X2 ;
has a solution under the set of solutions of the system of equations
⎧
a
⎪
⎪
max
px,y ωy ,
∀x ∈ X1 ;
⎨ ωx = a∈A(x)
y∈X
(2)
⎪
⎪
pax,y ωy ,
∀x ∈ X2 ,
⎩ ωx = min
a∈A(x)
y∈X
i.e. the system of equations (2) has such a solution ωx∗ , x ∈ X for which
there exists a solution ε∗x , x ∈ X of the system of equations
⎧
a
∗
⎪
⎪
+
ω
=
max
+
p
ε
μ
, ∀x ∈ X1 ;
ε
x
x,a
y
x
x,y
⎨
a∈A(x)
y∈X
a
⎪
∗
⎪
+
ω
=
min
+
p
ε
μ
∀x ∈ X2 .
ε
x,a
⎩ x
x
x,y y ,
a∈A(x)
y∈X
∗
∗
The optimal stationary strategies s1 , s2 of the players can be found by fixing
∗
∗
arbitrary maps s1 (x) ∈ A(x) for x ∈ X1 and s2 (x) ∈ A(x) for x ∈ X2 such
that
1∗
s (x) ∈ Arg max
a∈A(x)
2∗
s (x) ∈ Arg max
a∈A(x)
y∈X
y∈X
pax,y ωy∗
a ∗ px,y εy , x ∈ X1 ,
Arg max μx,a +
pax,y ωy∗
a ∗ px,y εy , x ∈ X2 .
Arg max μx,a +
a∈A(x)
a∈A(x)
y∈X
y∈X
158
D. Lozovanu, S. Pickl / Electronic Notes in Discrete Mathematics 55 (2016) 155–159
Proof (Sketch) Let x ∈ X be an arbitrary state and consider the stationary
strategies s1 ∈ S 1 , s2 ∈ S 2 for which Fx (s1 , s2 ) = mins2 ∈S 2 maxs1 ∈S 1 Fx (s1 , s2 ).
We show that Fx (s1 , s2 ) = maxs1 ∈S 1 mins2 ∈S 2 Fx (s1 , s2 ), i.e. we show that
∗
∗
s1 = s1 , s2 = s2 .
If we consider the Markov decision process induced by strategies s1 , s2 then
according to the properties of the bias equations for this decision process the
system of linear equations
⎧
a
⎪
εx + ωx = μx,a +
px,y εy , ∀x ∈ X1 , a = s1 (x);
⎪
⎪
⎪
y∈X
⎪
⎪
⎪
⎪
⎪ εx + ωx = μx,a + pa εy , ∀x ∈ X2 , a = s2 (x);
⎪
x,y
⎨
y∈X
(3)
a
⎪
⎪
px,y ωy ,
∀x ∈ X1 , a = s1 (x);
ωx =
⎪
⎪
⎪
y∈X
⎪
⎪
⎪
a
⎪
⎪
∀x ∈ X2 , a = s2 (x)
px,y ωy ,
ωx =
⎩
y∈X
has the solution
(x ∈ X) which for a fixed strategy s2 ∈ S 2 satisfies the
condition:
⎧
a ∗
⎪
ε∗x + ωx∗ ≥ μx,a +
px,y εy , ∀x ∈ X1 , a ∈ A(x);
⎪
⎪
⎪
y∈X
⎪
⎪
⎪
a ∗
⎪
∗
∗
⎪
⎪
px,y εy , ∀x ∈ X2 , a = s2 (x);
⎨ εx + ωx = μx,a +
ε∗x ,
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎩
ωx∗
ωx∗ ≥
ωx∗ =
y∈X
y∈X
y∈X
pax,y ωy∗ ,
∀x ∈ X1 , a ∈ A(x);
pax,y ωy∗ ,
∀x ∈ X2 , a = s2 (x)
and Fx (s1 , s2 ) = ωx∗ , ∀x ∈ X.
Taking into account that Fx (s1 , s2 ) = mins2 ∈S 2 Fx (s1 , s2 ) then for a fixed
strategy s1 ∈ S 1 the solution ∗x , ωx∗ (x ∈ X) satisfies the condition
⎧
a ∗
∗
∗
⎪
ε
+
ω
=
μ
+
px,y εy , ∀x ∈ X1 , a = s1 (x);
⎪
x,a
x
x
⎪
⎪
y∈X
⎪
⎪
⎪
a ∗
⎪
∗
∗
⎪
⎪
px,y εy , ∀x ∈ X2 , a ∈ A(x);
⎨ εx + ωx ≤ μx,a +
y∈X
a ∗
⎪
∗
⎪
=
px,y ωy ,
∀x ∈ X1 , a = s1 (x);
ω
⎪
x
⎪
⎪
y∈X
⎪
⎪
⎪
a ∗
⎪
∗
⎪
px,y ωy ,
∀x ∈ X2 , a ∈ A(x)
ωx ≤
⎩
y∈X
D. Lozovanu, S. Pickl / Electronic Notes in Discrete Mathematics 55 (2016) 155–159
So, the following system
⎧
a
⎪
⎪
px,y εy ,
εx + ωx ≥ μx,a +
⎪
⎪
⎪
y∈X
⎪
⎪
⎪
a
⎪
⎪
px,y εy ,
⎪
⎨ εx + ωx ≤ μx,a +
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎩
ωx ≥
ωx ≤
y∈X
y∈X
y∈X
159
∀x ∈ X1 , a ∈ A(x);
∀x ∈ X2 , a ∈ A(x);
pax,y ωy ,
∀x ∈ X1 , a ∈ A(x);
pax,y ωy ,
∀x ∈ X2 , a ∈ A(x)
∗
has a solution, which satisfies condition (3). This means that s1 = s1 ,
∗
s2 = s2 and
∗
Fx (s1 , s2 ) = max
min2 Fx (s1 , s2 ) = min
max1 Fx (s1 , s2 ), ∀x ∈ X,
1
1 2
2
2 1
s ∈S s ∈S
i.e. , the theorem holds.
s ∈S s ∈S
2
The obtained saddle point condition for zero-sum stochastic games generalizes the saddle point condition for deterministic average positional games
from [1,5]. Based on Theorem 2.1 we may conclude that the optimal strategies
of the players in the considered game can be found if we determine a solution
of equations (1),(2). We have shown that a solution of these equations can
be determined using iterative algorithms like algorithms for determining the
optimal solutions of an average Markov decision problem [3,4].
References
[1] Ehrenfeucht, A., Mycielski, J., Positional strategies for mean payoff games
International Journal of Game Theory. 8 (1979), 109–113.
[2] Lozovanu, D., Pickl, S., ”Optimization and Multiobjective Control of TimeDiscrete Systems”, Springer, 2009.
[3] Lozovanu, D., Pickl, S., ”Optimization of Stochastic Discrete Systems and
Control on Complex Networks”, Springer, 2015.
[4] Puterman, M., ”Markov Decision Processes: Discrete Stochastic Dynamic
Programming”, John Wiley, New Jersey, 2005.
[5] Zwik U., Paterson M., The complexity of mean payoff games on graphs, TCS
158 (1996), 344-359.