Multigrid methods for two-player zero-sum stochastic games

Multigrid methods for two-player zero-sum
stochastic games
Sylvie Detournay and Marianne Akian
INRIA Saclay and CMAP, École Polytechnique (France)
European Multi-Grid Conference
September 19-23, 2010
Ischia, Italy
Sylvie Detournay (INRIA and CMAP)
MG for zero-sum stochastic games
EMG 2010
1 / 22
DP for zero-sum stochastic games
Dynamic programming equation of zero-sum
two-player stochastic games
v (x) = max
min
α∈A(x) β∈B(x,α)
X
γP(y |x, α, β)v (y ) + r (x, α, β)
y ∈X
∀x ∈ X (DP)
v (x) the value of the game starting at x ∈ X , state space,
α, β action of the 1st, 2nd player MAX, MIN
r (x, α, β) reward paid by MIN to MAX
P(y |x, α, β) transition probability from x to y given the
actions α, β
γ ≤ 1 discount factor
Sylvie Detournay (INRIA and CMAP)
MG for zero-sum stochastic games
EMG 2010
2 / 22
DP for zero-sum stochastic games
Value of the game
"
v (x) = max min E
(αk )k≥0 (βk )k≥0
∞
X
#
γ k r (xk , αk , βk )
k=0
where the state dynamics is given by a process Xk
P(Xk+1 = y |Xk = x, αk = α, βk = β) = P(y |x, α, β)
Sylvie Detournay (INRIA and CMAP)
MG for zero-sum stochastic games
EMG 2010
3 / 22
A deterministic zero-sum game
Deterministic zero-sum two-player game
The circles (resp. squares) represent the nodes at which
Max (resp. Min) can play.
5
3
−2
4’
−3
0
2
11
1’
−1
1
9
3’
Values in the (DP) equation:
X = {Max nodes}
A(x) = {Min nodes accessible from x}
B(x, α) = {Max nodes accessible from
α}
r (x, α, β) =weight(x, α)+weight(α, β)
y =β
γ=1
7
1
−5
2’
6
Sylvie Detournay (INRIA and CMAP)
0
2
MG for zero-sum stochastic games
EMG 2010
4 / 22
A deterministic zero-sum game
5
3
−2
4’
−3
0
11
1’
2
If Max initially moves
to 20
−1
1
9
3’
7
1
−5
2’
0
6
Sylvie Detournay (INRIA and CMAP)
2
MG for zero-sum stochastic games
EMG 2010
5 / 22
A deterministic zero-sum game
5
3
−2
4’
−3
0
11
1’
2
If Max initially moves
to 20
−1
1
9
3’
7
1
−5
2’
0
6
Sylvie Detournay (INRIA and CMAP)
2
MG for zero-sum stochastic games
EMG 2010
5 / 22
A deterministic zero-sum game
5
3
−2
4’
−3
0
11
1’
2
If Max initially moves
to 20
−1
1
9
3’
7
1
−5
2’
0
6
Sylvie Detournay (INRIA and CMAP)
2
MG for zero-sum stochastic games
EMG 2010
5 / 22
A deterministic zero-sum game
5
3
−2
4’
−3
0
11
1’
2
−1
1
9
If Max initially moves
to 20
he eventually looses 5
per turn.
3’
7
1
−5
2’
0
6
Sylvie Detournay (INRIA and CMAP)
2
MG for zero-sum stochastic games
EMG 2010
5 / 22
A deterministic zero-sum game
5
3
−2
4’
−3
But if Max initially
moves to 10
0
11
1’
2
−1
1
9
3’
7
1
−5
2’
0
6
Sylvie Detournay (INRIA and CMAP)
2
MG for zero-sum stochastic games
EMG 2010
6 / 22
A deterministic zero-sum game
5
3
−2
4’
−3
But if Max initially
moves to 10
0
11
1’
2
−1
1
9
3’
7
1
−5
2’
0
6
Sylvie Detournay (INRIA and CMAP)
2
MG for zero-sum stochastic games
EMG 2010
6 / 22
A deterministic zero-sum game
5
3
−2
4’
−3
But if Max initially
moves to 10
0
11
1’
2
−1
1
9
3’
7
1
−5
2’
0
6
Sylvie Detournay (INRIA and CMAP)
2
MG for zero-sum stochastic games
EMG 2010
6 / 22
A deterministic zero-sum game
5
3
−2
4’
−3
But if Max initially
moves to 10
0
11
1’
2
he only looses eventually (1+0+2+3)/2 = 3
per turn.
−1
1
9
3’
7
1
−5
2’
6
Sylvie Detournay (INRIA and CMAP)
2
0
When γ < 1 close
to 1, strategy for Max
starting from 1 is
α(1) = 10
MG for zero-sum stochastic games
EMG 2010
6 / 22
DP for zero-sum stochastic games
Optimal strategies and dynamic programming
v (x) = max
min
α∈A(x) β∈B(x,α)
X
γP(y |x, α, β)v (y ) + r (x, α, β)
y ∈X
|
{z
F (v ;(x,α),β)
}
ᾱ(x) policy maximizing (DP)eq for MAX
β̄(x, α) policy minimizing F (v ; (x, α), β) for MIN
αk (Xk , αk−1 , βk−1 , · · · ) = ᾱ(Xk )
βk (Xk , αk , αk−1 , βk−1 , · · · ) = β̄(Xk , ᾱ(Xk ))
Method to solve (DP) eqs : Policy iteration algorithm
[Howard, 60 (1player game)], [Hoffman and Karp, 66 (2players, ergodic case)]
Sylvie Detournay (INRIA and CMAP)
MG for zero-sum stochastic games
EMG 2010
7 / 22
DP for zero-sum stochastic games
Dynamic programming equation of zero-sum
two-player stochastic differential games
PDE of Hamilton-Jacobi-Bellman or Isaacs (diffusion
problems)
"
max
−λv (x) +
min
α∈A(x) β∈B(x,α)
X
gi (x, α, β)
i
∂v (x)
+
∂xi

1X
∂ 2 v (x)
+ r (x, α, β) = 0 (I)
aij (x, α, β)
2
∂xi ∂xj
ij
Discretisation with monotone schemes of (I) yields (DP)
Sylvie Detournay (INRIA and CMAP)
MG for zero-sum stochastic games
EMG 2010
8 / 22
DP for zero-sum stochastic games
Motivation
Solving dynamic programming equations arising from the
discretization from Hamilton-Jacobi-Bellman or Isaacs
equations
for example, varitional inequalities, diffusions problems, finance, . . .
Solving large scale zero-sum stochastic games (with
discrete state space)
for example, problems arising from the web, problems in verification of
programs in computer science, . . .
→ Use policy iteration algorithm combined with
multigrids to solve the dynamic programming equations
Sylvie Detournay (INRIA and CMAP)
MG for zero-sum stochastic games
EMG 2010
9 / 22
Policy Iteration (PI) Algorithm
Policy Iteration (PI) Algorithm for games
v (x) = max
min
α∈A(x) β∈B(x,α)
X
γP(y |x, α, β)v (y ) + r (x, α, β)
y ∈X
|
{z
F (v ;x,α)
}
Start with ᾱ0
1
The value v n+1 of policy ᾱn is solution of
v n+1 (x) = F (v n+1 ; x, ᾱn (x)) ∀x ∈ X .
2
Improve the policy: find ᾱn+1 optimal for v n+1 :
ᾱn+1 (x) ∈ argmax F (v n+1 ; x, α) ∀x ∈ X .
α∈A(x)
Step 1 is solved by PI
Sylvie Detournay (INRIA and CMAP)
MG for zero-sum stochastic games
EMG 2010
10 / 22
Policy Iteration (PI) Algorithm
v (x) = max
X
min
α∈A(x) β∈B(x,α)
γP(y |x, α, β)v (y ) + r (x, α, β)
y ∈X
|
{z
}
F (v ;(x,α(x)),β)
Start with β̄n,0
1
The value v n,k+1 of policy β̄n,k is solution of
v n,k+1 (x) = F (v n,k+1 ; (x, ᾱn (x)), β̄n,k (x)) ∀x ∈ X .
1
Improve the
policy: find
β̄n,k+1 optimal
for v n,k+1
Sylvie Detournay (INRIA and CMAP)
PIext












 α0
PIint







..



.


αn
MG for zero-sum stochastic games







 β0,0
 0,0,0

 v
..
AMG
.

 0,0,l
v


..


.



β0,k
EMG 2010
11 / 22
Policy Iteration (PI) Algorithm
(v n )n≥1 non decreasing (v n,k )k≥1 non increasing
PI stops after a finite time when sets of actions are finite
Internal loop (1player game): PI ≈ Newton algorithm
where differentials are replaced by superdiffentials of the
(DP) operator
External loop (2player game): PI ≈ Newton algorithm
where the (DP) operator is approached by piecewise
affine and concave maps
→ expect super linear convergence in good cases
Sylvie Detournay (INRIA and CMAP)
MG for zero-sum stochastic games
EMG 2010
12 / 22
Policy Iteration Algorithm and AMG - (AMGπ)
AMG for a linear system Ax = b
Setup phase:
construct “grids” based on the elements of matrix
aij
define interpolation (I )ij ≈ somefactor
, restriction R = I T
Solving phase: (two grids)
v ← apply ν1 relaxations on the fine level to v
v ← v + Iw where w is solution of
RAIw = R(b − Av ) (on the coarse grid)
v ← apply ν2 relaxations on the fine level to v
when applied recursively → V -cycle, W -cycle . . .
Sylvie Detournay (INRIA and CMAP)
MG for zero-sum stochastic games
EMG 2010
13 / 22
Policy Iteration Algorithm and AMG - (AMGπ)
AMGπ
Combine PI for two-player games and AMG:
Apply AMG to v = γPv + r in the internal loop of PI
→ AMGπ
Previous works in stochastic controle (one player games):
MG + PI in [Hoppe,86-87][Akian, 88-90]
AMG + learning methods [Ziv and Shinkin, 05]
→ two player games never considered
Sylvie Detournay (INRIA and CMAP)
MG for zero-sum stochastic games
EMG 2010
14 / 22
Policy Iteration Algorithm and AMG - (AMGπ)
AMG for the discrete case
Intersting for zero-sum stochastic games with discrete
state space
But AMG not always work efficiently for problems with
random M-matrices. Need an adaptation of AMG.
Current numerical tests are on Hamilton Jacobi Bellman
or Isaacs equations, discretisation of this problem =
discretisation of linear elliptic equations → good for AMG
Sylvie Detournay (INRIA and CMAP)
MG for zero-sum stochastic games
EMG 2010
15 / 22
Numerical tests for AMGπ
Example on a Isaacs equations
Dynamic programming equation
∆v + k∇v k2 − 0.5 k∇v k22 + f = 0 in Ω
v = g on ∂Ω
where
with v (x1 , x2 ) = sin(x1 ) × sin(x2 ) on
k∇v k2 = max (α·∇v )
kαk2 ≤1
k∇v k22
kβk22
−
= min(β·∇v +
)
β
2
2
Sylvie Detournay (INRIA and CMAP)
Ω = [0, 1] × [0, 1]
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1
0.8
0
0.6
0.2
MG for zero-sum stochastic games
0.4
0.4
0.6
0.2
0.8
1 0
EMG 2010
16 / 22
Numerical tests for AMGπ
Numerical results with a 1025 × 1025 points grid
n = current iteration for MAX, k = number of iterations for MIN
AMGπ
n
1
2
3
k
3
2
1
kr k∞
8.51e − 7
2.44e − 8
7.92e − 13
kr kL2
5.96e − 7
6.16e − 9
2.02e − 13
kek∞
4.47e − 2
1.84e − 4
4.13e − 6
kekL2
2.48e − 2
1.05e − 4
2.16e − 6
cpu time s
2.65e + 1
4.59e + 1
5.56e + 1
kek∞
4.47e − 2
1.84e − 4
4.13e − 6
kekL2
2.48e − 2
1.05e − 4
2.16e − 6
cpu time s
1.40e + 2
2.31e + 2
2.77e + 2
only 6 linear systems are solved
Policy iteration with LU
n
1
2
3
k
3
2
1
kr k∞
8.51e − 7
2.44e − 8
7.38e − 13
kr kL2
5.96e − 7
6.16e − 9
2.03e − 13
parameter AMG: CF relaxion, W -cycles
Sylvie Detournay (INRIA and CMAP)
MG for zero-sum stochastic games
EMG 2010
17 / 22
Numerical tests for AMGπ, VI problems
Variational inequalities problem (VI)
optimal stopping time for first player
i
h
(
2
max ∆v − 0.5 k∇v k2 + f , φ − v = 0 in Ω
v = φ on ∂Ω
with solution on Ω = [0, 1] × [0, 1]
MAX chooses between
play or stop (]A(x) = 2)
and receives φ when he
stops
MIN leads k∇v k22
Sylvie Detournay (INRIA and CMAP)
given by
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1
0.8
0
0.6
0.2
MG for zero-sum stochastic games
0.4
0.4
0.6
0.2
0.8
1 0
EMG 2010
18 / 22
Numerical tests for AMGπ, VI problems
VI with 129 × 129 points grid
iterations = 100
Sylvie Detournay (INRIA and CMAP)
MG for zero-sum stochastic games
EMG 2010
19 / 22
Numerical tests for AMGπ, VI problems
VI with 129 × 129 points grid
iterations = 200
Sylvie Detournay (INRIA and CMAP)
MG for zero-sum stochastic games
EMG 2010
19 / 22
Numerical tests for AMGπ, VI problems
VI with 129 × 129 points grid
iterations = 300
Sylvie Detournay (INRIA and CMAP)
MG for zero-sum stochastic games
EMG 2010
19 / 22
Numerical tests for AMGπ, VI problems
VI with 129 × 129 points grid
iterations = 400
Sylvie Detournay (INRIA and CMAP)
MG for zero-sum stochastic games
EMG 2010
19 / 22
Numerical tests for AMGπ, VI problems
VI with 129 × 129 points grid
iterations = 500
Sylvie Detournay (INRIA and CMAP)
MG for zero-sum stochastic games
EMG 2010
19 / 22
Numerical tests for AMGπ, VI problems
VI with 129 × 129 points grid
iterations = 600
Sylvie Detournay (INRIA and CMAP)
MG for zero-sum stochastic games
EMG 2010
19 / 22
Numerical tests for AMGπ, VI problems
VI with 129 × 129 points grid
iteration 700!
in ≈ 8148 seconds
slow convergence
Policy iterations bounded by
]{possible policies}
→ can be exponential in ]X
like Newton → improve with good initial guess? → FMG
Sylvie Detournay (INRIA and CMAP)
MG for zero-sum stochastic games
EMG 2010
19 / 22
Full Multilevel AMGπ
Full Multilevel AMGπ
define the problem on each coarse grid ΩH
Interpolation of strategies and value
AMG Policy Iterations
interpolation of value v and strategies α, β
stopping criterion for AMGπ kr kL2 < cH 2 (with c = 0.1)
Sylvie Detournay (INRIA and CMAP)
MG for zero-sum stochastic games
EMG 2010
20 / 22
Numerical test for Full multilevel AMGπ
Full multilevel AMGπ
Ω = [0, 1] × [0, 1], 1025 nodes in each direction
ΩH coarse grids (number of nodes in each direction)
n =current iteration from MAX, k = number of iterations from MIN
ΩH
3
3
5
9
9
9
n
1
2
1
1
2
3
k
1
1
2
2
1
1
kr k∞
2.17e − 1
1.14e − 2
2.17e − 4
4.99e − 3
2.68e − 3
2.72e − 4
kr kL2
2.17e − 1
1.14e − 2
8.26e − 5
1.06e − 3
5.41e − 4
5.49e − 5
kek∞
1.53e − 1
3.30e − 2
3.02e − 2
1.65e − 2
1.66e − 2
1.68e − 2
kekL2
1.53e − 1
3.30e − 2
1.71e − 2
7.99e − 3
8.15e − 3
8.30e − 3
cpu time s
<< 1
<< 1
<< 1
<< 1
<< 1
<< 1
513
1025
1025
1
1
2
1
1
1
2.57e − 7
1.31e − 7
6.77e − 8
4.04e − 9
1.90e − 9
5.83e − 10
3.15e − 4
1.57e − 4
1.57e − 4
1.33e − 4
6.63e − 5
6.62e − 5
2.62
1.17e + 1
2.11e + 1
Sylvie Detournay (INRIA and CMAP)
MG for zero-sum stochastic games
EMG 2010
21 / 22
Numerical test for Full multilevel AMGπ
Conclusion
Full multilevel scheme can make policy
iteration faster and efficient!
can we proove it?
(see [Akian, 90] for smooth HJB equations)
can we generalize it for stochastic games with finite
state space?
finding a polynomial time algorithm for zero-sum
game is an open problem
Thank you!
Sylvie Detournay (INRIA and CMAP)
MG for zero-sum stochastic games
EMG 2010
22 / 22