Why Value Iteration Runs in Pseudo

Why Value Iteration Runs in Pseudo-Polynomial Time for
Discounted-Payoff Games
Axel Haddad1 and Benjamin Monmege2
1
Université de Mons, Belgium
thomas.brihaye,[email protected]
2
Université libre de Bruxelles, Belgium
gigeerae,[email protected]
Abstract. We present in this note the technical proof formally showing why value iteration algorithm of Zwick and Paterson for discounted-payoff games runs in pseudo-polynomial time (and
not polynomial time), i.e., polynomial with respect to the size of the game, and exponential with
respect to the representation of the discount factor.
We fix a λ-discounted-payoff game G = hV, E, ω, DPλ i with 0 < λ < 1 a rational. Vertices are
partitioned into the ones for player 1 and the ones for player 2. We only consider memoryless strategies
σ1 and σ2 for both players, since they suffice for optimal behaviours, as shown in [1]: recall that these
strategies are simply mapping a vertex of the player to one of its successor. Such a pair of strategies,
when started in an initial vertex v0 , defines a unique infinite play π = Play(v, σ1 , σ2 ). The payoff of this
play π = v0 v1 v2 · · · is defined as the λ-discounted sum of the weights encountered in the run, i.e.
DPλ (π) =
∞
X
λi ω(vi , vi+1 ) .
i=0
Then, the optimal value for player 1 from a vertex v is the minimal payoff that he can guarantee starting
in v no matter what strategy is chosen by player 2. Symmetrically, the optimal value for player 2 from
a vertex v is the maximal payoff that he can guarantee starting in v no matter what strategy is chosen
by player 1. Discounted-payoff games are known to be determined, meaning that these two values are
identical, henceforth denoted by Val(v).
We now consider the discount factor λ to be the rational ab with two integers 0 < a 6 b. We give the
sequence of arguments showing the correctness, and then the complexity, of the value iteration algorithm
of [1] to compute the optimal values Val(v) of vertex v in game G.
Q|V | j j
|V |
Lemma 1. For all vertices v ∈ V , there exists N ∈ Z such that Val(v) = N
j=1 (b −a ).
D , with D = b
Proof. Let σ1 and σ2 be memoryless optimal strategies for both players. Thus, if we let π = Play(v, σ1 , σ2 ),
we have Val(v) = DPλ (π). Furthermore, we know that π is of the form π = v0 · · · vk (u0 · · · u` )ω with
0 6 k, ` < |V |. Therefore the following holds:
Val(v) = DPλ (π)
= (1 − λ)
k−1
X
λi ω(vi , vi+1 ) + λk ω(vk , u0 ) + λk+1
=
b−a
b
i=0
b
λ(`+1)r
r=0
i=0
k−1
X
∞
X
`−1
X
λi ω(ui , ui+1 ) + λ` ω(u` , u0 )
i=0
`−1 `−i i
X
a ω(i , vi+1 ) a
λ
b a ω(ui , ui+1 ) a`
+
ω(v
,
u
)
+
+
ω(u
,
u
)
k
0
`
0
bk
bk
1 − λ`+1 i=0
b`
b`
k−i i
k
k+1
N1
ak+1 b`+1
N2
= k+1 + k+1 `+1
b
b
(b
− a`+1 ) b`
where
h k−1
i
X
N1 = (b − a)
bk−i ai ω(vi , vi+1 ) + ak w(vk , u0 ) ,
i=0
`−1
hX
N2 = (b − a)
i=0
b`−i ai ω(ui , ui+1 ) + a` ω(u` , u0 )
i
are two integers. We further simplify the formula:
ak+1 N2
N1
+
bk+1
bk (b`+1 − a`+1 )
N3
= k+1 `+1
b
(b
− a`+1 )
N
=
Q|V | j
j
|V
|
b
j=1 (b − a )
Val(v) =
where
N3 = N1 (b`+1 − a`+1 ) + ak+1 N2 ,
and
N = N3 b|V |−k
Y
(bj − aj )
16j6|V |,j6=`+1
t
u
are also integers.
Lemma 2. For all vertices v ∈ V and x ∈ R such that Val(v) −
Val(v) =
1
2D
< x < Val(v) +
1
2D ,
bDx + 12 c
.
D
Proof. We have
Val(v) −
1
1
< x < Val(v) +
2D
2D
that implies
Val(v) < x +
1
1
< Val(v) + .
2D
D
Multiplying by D, we obtain
DVal(v) < Dx +
1
< DVal(v) + 1 .
2
Since DVal(v) = N is an integer, this proves that
1
DVal(v) = Dx +
2
t
u
which allows us to conclude.
Now define a functional F : RV → RV as follows: for all X ∈ RV and v ∈ V ,

 0min (1 − λ)ω(v, v 0 ) + λX(v 0 ) if v ∈ V1
v ∈E(v)
F(X)(v) =
 max (1 − λ)ω(v, v 0 ) + λX(v 0 ) if v ∈ V2
0
v ∈E(v)
By [1], we know that Val is the unique fixed point of F. Moreover, operator F is λ-contracting, since
kF(X) − F(Y )k 6 λkX − Y k
for every vectors X, Y ∈ RV , where kXk = supv∈V |X(v)|.
For all i, we inductively define Vali ∈ RV as follows: Val0 (v) = 0 for all v ∈ V and Vali+1 = F(Vali ).
Let W be the maximal absolute value of the weights of the arena.
Lemma 3. For all i, kVali − Valk 6 λi W .
Proof. Since F is λ-contracting and has Val as a fixed point, we have
kVali − Valk = kF(Vali−1 ) − F(Val)k 6 λkVali−1 − Valk
By induction, knowing that Val0 = 0, we have
kVali − Valk 6 λi kValk .
2
Finally, using that Val(v) is obtained as an infinite sum of weights smaller than W is absolute value,
discounted by λ, we know that
∞
X
|Val(v)| 6 (1 − λ)W
λi 6 W .
i=0
t
u
that allows us to conclude.
We are now ready to prove the crucial argument for the termination of the value iteration.
|V |(|V |+3)
1
log(b)
+
log(W
)
+
2
,
Lemma 4. For all K > − log
λ
2
2
1
.
2D
kValK − Valk <
Proof. First notice that:
D=b
|V |
|V |
Y
j
j
(b − a ) 6 b
|V |
j=1
Thus
|V |(|V |+3)
2
|V |
Y
j
b 6b
|V |
j=1
|V |
Y
b|V | 6 b
|V |(|V |+1)
+|V
2
|
j=1
log2 b > log2 D. Then,
K>
1
× log2 D + log2 W + log2 4 = log1/λ (4DW )
− log2 λ
Hence,
1K
> 4DW
λ
so that
λK 6
1
4DW
and finally
1
1
<
4D
2D
which allows us to conclude by the previous lemma.
λK W 6
t
u
As a consequence of the previous lemmas the following algorithm computes the mapping Val.
1. Initializel X as X(v)
= 0 for all v.
m
1
2. Repeat − log λ |V |(|V2 |+3) log2 b + log2 W + 2 times X := F(X).
2
3. For all v ∈ V , output Val(v) =
bDX(v)+ 21 c
D
.
We conclude by commenting on the complexity of this algorithm. To that extent, we suppose that
λ = 1 − n1 with n an integer. In particular, the size of the input has a component in log2 n to represent λ.
With that set in mind, we claim that the procedure works in pseudo-polynomial time, i.e., polynomial in
the size of the arena (in particular in log2 W ), and polynomial in n (so exponential in log2 n). The only
reason to be pseudo-polynomial (and not polynomial) is in the factor λ, and not in the arena (contrary to
the mean-payoff case where the pseudo-polynomial complexity comes from the weights in the arena). To
show that claim, it is sufficient to study the number of iterations performed in the algorithm (since every
iteration requires a polynomial complexity). Notice first that |V |(|V2 |+3) log2 b + log2 W + 2 is polynomial,
since, in particular log2 b = log2 n. Then, let us bound the factor −1/ log2 λ. It is easy to see that when
n is large, −1/ log2 (1 − 1/n) ∼ n ln 2, which ends the proof of the claim.
Theorem 1. Value iteration algorithms runs in pseudo-polynomial time on discounted-payoff games,
more precisely, polynomial in the size of the game, but exponential in the representation of the discount
factor λ.
References
1. Uri Zwick and Michael S. Paterson. The complexity of mean payoff games. Theoretical Computer Science,
158:343–359, 1996.
3