Computational comparison of value iteration algorithms for

Calhoun: The NPS Institutional Archive
DSpace Repository
Reports and Technical Reports
All Technical Reports Collection
1982-12
Computational comparison of value iteration
algorithms for discounted Markov decision processes
Thomas, L. C.
Monterey, California. Naval Postgraduate School
http://hdl.handle.net/10945/29227
Downloaded from NPS Archive: Calhoun
LIBRARY
RESEARCH REPORTS
NAVAL FO
MONTEREY, CALIF.
NPS55-82-024
NAVAL POSTGRADUATE SCHOOL
Monterey, California
COMPUTATIONAL COMPARISON OF VALUE ITERATION
ALGORITHMS FOR DISCOUNTED MARKOV DECISION
PROCESSES
by
L.
R.
A.
C.
Thomas
Hartley
Lavercombe
C.
December 1982
Approved for public release; distribution unlimited
Prepared for:
Naval Postgraduate School
Monterey, Ca 93940
FEDDOCS
D 208.14/2:NPS-55-82-034
Dl\i
NAVAL POSTGRADUATE SCHOOL
iMonterey, California
Rear Admiral J. J. Ekelund
Superintendent
David A. Schrady
Provost
Reproduction of all or part of
this report la authorized.
UNCLASSIFIED
SECURITY CLASSIFICATION OF THIS PAGE (When Dmta
1
Znteted)
READ INSTRUCTIONS
BEFORE COMPLETING FORM
REPORT DOCUMENTATION PAGE
1.
REPORT NUMBER
2.
GOVT ACCESSION
NO.
3.
RECIPIENT'S CATALOG NUMBER
5.
TYPE OF REPORT
NPS55-82-034
4.
TITLE (and
Subtitle)
COMPUTATIONAL COMPARISON OF VALUE ITERATION
ALGORITHMS FOR DISCOUNTED MARKOV DECISION
PROCESSES
L.
R.
A.
S.
CONTRACT OR GRANT NUMBERfsJ
10.
CONTROLLING OFFICE NAME AND AOORESS
MONITORING AGENCY NAME
4
PROGRAM ELEMENT. PROJECT, TASK
AREA a WORK UNIT NUMBERS
61152N; RR000-01-10
N0001483WR30104
AODRESSf//
12.
REPORT OATE
13.
NUMBER OF PAGES
10
15.
SECURITY CLASS,
December 1982
Naval Postgraduate School
Monterey, Sa 9 3940
14.
PERFORMING ORG. REPORT NUMBER
Thomas
Hartley
C. Lavercombe
Naval Postgraduate School
Monterey, CA 93940
II.
6.
C.
PERFORMING ORGANIZATION NAME AND AOORESS
9.
PERIOD COVERED
Technical
AUTHORi'JJ
7.
4
Jlllerenl /root Controlling Otllce)
(ol thle report)
Unclassified
15«.
DECLASSIFICATION/ DOWNGRADING
SCHEDULE
16.
DISTRIBUTION STATEMENT
(oi this Report)
Approved for public release; distribution unlimited.
17.
DISTRIBUTION STATEMENT
18.
SUPPLEMENTARY NOTES
19.
KEY WORDS
(ol the ebetreet entered In
(Continue on teveree elde
Block
20. It different from Report)
neceeemry end Identity by block number)
It
Computational comparison
Value iteration
Decision processes
20.
ABSTRACT
Continue on teveree elde
It
neceeemry end Identity by block number)
This paper describes a computational comparison of value iteration algorithms
for discounted Markov decision processes.
DO
FORM
1
JAN
73
1473
EDITION OF
S,
1
NOV
65
IS
N 0102- LF-014- 6601
OBSOLETE
UNCLASSIFIED
SECURITY CLASSIFICATION OF THIS PAGE (When Okie
Entered)
COMPUTATIONAL COMPARISON OF VALUE ITERATION ALGORITHMS
FOR DISCOUNTED MARKOV DECISION PROCESSES
L.
R.
Thomas*
Hartley
C.
Department of Decision Theory, University of Manchester,
Manchester, U.K.
and
A. C. Laver combe
Department of Mathematics, Bristol Polytechnic,
Bristol, U.K.
Department of Operations Research, Naval Postgraduate School,
At present:
Research
Monterey, CA 93940, USA. He is a National Research Council
,
Associateship Fellow.
Abstract
:
This paper describes a computational comparison of value iteration
algorithms for discounted Markov decision processes.
1.
INTRODUCTION
This note describes the results of a computational comparison of value
iteration algorithms suggested for solving finite state discounted Markov
decision processes.
Such a process visits a set of states
When it is in state
i
action set
ability
K.
,
one can choose an action
object is to maximize,
v(i)
[1]
that
The
the maximum discounted reward over an in-
,
finite horizon starting in state
is well known
at the next period.
j
.
and with prob-
r.
1
the process will be in state
p..
{1,2,...M}
from the finite
k
and then receive an immediate reward
,
i
S =
i
where
,
is the discount factor.
3
It
satisfies the optimality equation
v(i)
f k
v(i) = max ^r +
x
k£K. [
M
3
]
k
v(j)
p
J
1J
j=l
KL L)
(
J-
J
We record the time for value iteration algorithms to obtain e-optimal
solutions,
v
n
,
to
(1.1),
(i.e.
jv
n
- vl
<
e
,
where
°°
=
Ivl
'
'
oo
maxlv(i)l)
*
i
on randomly generated problems.
lems each with
3=9
and
e
=
lems have 100 states and between
40 states and between
2
We look at three classes of fifteen prob.0001, where
2
and
7
v(i)
~
2,000.
Class
actions per state; class
and 70 actions per state, whereas class
10 states and up to 500 actions per state.
3
1
2
probhave
have
Details of how the problems
are generated and computing facilities used are given in [12].
In Section two we describe the schemes examined and the various bounds
that can be used for stopping them.
Section three concentrates on one
scheme that did well in the comparison - ordinary value iteration - and
looks an various methods for eliminating non—optimal actions both
permanently and temporarily.
'
[
2.
SCHEMES AND BOUNDS
The scheme usually described as value iteration is
=
max
keK
^(i)l
[
r^ +
max
keK
=
J
B
L
P^.
I
v(j)V
(2
n
1J
j=l
[
J
In analogy with the notation of linear
which was discussed in [1,3].
This analogy leads us to think of
equations we call this Pre-Jacobi (PJ).
the following alternative schemes.
(
Jacobi
(J)
v
:
,.
n+1
=
(l)
max
,
+
(r
3
i
,.
keK.
P
2
V
ii
j))/(1 ~ 3p
(2,
ii^
l
Pre-Gauss-Seidel (PGS)
:
v
..
=
(i)
K
k
max jr. +
keK.
k
Pij
Z
v
:
n +l
=
(i)
(A
V
GS n
}
(i)
(PGS) was suggested by Kushner
(GS)
is found in
value of
a)
=
[9]
:
v
[4],
and SOR in [5].
.
n+1
(J)
(i)
Porteus
(2.
k
maX
k
p*
6
j-i+1
Successive over Relaxation (SOR)
=
keK.
M
+
v
V*>
J=i
Gauss-Seidel (GS)
k
P
J
[
B
l
3=1
M
+
B
1J
{r
\
i
i_1
+
6
k
V
^
j=l
k
v (j))/(l-Sp^, )
X1
n
V
n+l (j)
(2
V
j
= io(A
[8]
v ) (i) + (l-w)v (i)
US n
n
and Reetz [10];
(J)
Experiments with SOR suggested
1.28 for robust and speedy convergence.
and
a
We require bounds on the iterates of the scheme to ensure we stop when
v
is within a specified value of the optimal
the
L
norm bound, which says if
lol
v
a <
<
of
(1.1).
One can use
for all possible
1
transition matrices in the scheme
then
|v -
v
n+1 m
< a
|v
\
k
= max{s
—
k
-
vjj(l-a)
n+1
,
.
by
a
|v
1
(2.6)
+ Q v
n
—
,
it is trivial to show the corresponding
For S.O.R. we estimate
k
—n+1
v
,,
n+1
.
For (PJ)
,
have
L
Q's
- v
/|v
n'°°
'
(2.6)
}
n
- v
(J),
,1
n-l'»
(PCS)
and (GS),
norm less then
3
.
and substitute in
to get a heuristic bound.
Porteus [7] described tighter bounds for these schemes, exploiting the
of
non-negativity of the elements
in (2.6).
They require calq..
LJ
M
k
k
V
=
culation of a.
< f
f°
o]r the maximizing action
k at each iterate,
)
q
1
1
j-i
and we call these the P.C. bounds - (Porteus with calculation).
In [12]
"
2.
«
•
.
•
.
we describe how to estimate the
ot.
initially, which avoids the calcula-
tion at each step, but gives looser bounds, which we denote P.N.C. For the (PJ) scheme we also use the second order
(Porteus no calculation).
bounds (S.O.) described in [11], which uses the last three iteration values
to get a tighter lower bound than Porteus' s bound.
The results are given in the following table where
(Av)
is the average
C.P.U. time for solving the fifteen problems, S.D. che standard deviation
of the C.P.U.
at solving.
time, and
N
the number of problems that method was quickest
TABLE 1
class
X XT' T*T T
/~\ T"\
METHOD
:
L
(40 STATE)
CLASS
.
3
(10 STATE)
S.D.
N.
15
S.D.
N.
.79
.10
14
AV.
AV.
S.D.
N.
.54
.07
4
11
PC=PNC
1.66
.11
SO
1.69
.11
.80
.11
1
.54
.08
11.88
.59
14.38
2.27
o
7.66
1.14
PNC
10.49
.63
11.55
2.05
6.90
1.09
Loo
6.86
.33
8.62
1.34
5.18
.82
PC
6.59
.33
8.34
1.32
5.03
.81
PNC
6.60
.33
8.40
1.33
5.01
.81
Loo
6.55
.32
8.13
1.41
3.99
.61
PC
6.32
.34
7.77
1.38
3.90
.61
PNC
6.25
.32
7.70
1.41
3.83
.61
Loo
3.30
.22
4.00
.55
1.86
.30
L
J
2
*
AV.
PJ
CLASS
(100 STATE)
BOUNDS
CO
PGS
GS
SOR
For Jacobi, the P.C. bound is the same as the P.N.C. bound and so the
latter must be faster as it involves less calculation.
Table
1
It is obvious from
that P.J. with Porteus bounds performs very well, and in the next
section we concentrate on this scheme and apply elimination of non-optimal
actions.
3.
ACTION ELIMINATION
MacOueen
[6]
described how for any bounds one can observe a test to
identify actions that cannot optimize the right hand side of (1.1) and so
can be permanently eliminated from the calculation.
bounds [6] and Porteus
's
bound
[7]
following tests to eliminate action
Applying MacQueen's
for the PJ algorithm leads to the
k.
in
K.
permanently.
1
MacQueen
v (i)
Porteus
v
where
a
min(v
=
n
.
n
k
(i)
(1)
<
v (i) + 3(a
<
v^(i) +
- v
i
n
n-i
(i))
- b )/(l-B)
B^a^
b^^/Cl-B)
-
b
,
(3.1)
max(v
=
n
.
l
n
(1)
(3.2)
- v
n—
,(i))
.
We looked at four ways of implementing these tests.
Ml.
At
iteration, calculate and store
n
calculate
a
and
n
b
n
Recalculate
.
n
for each
v (i)
v (i)
n
i
Then
.
and use (3.1) to test
for elimination.
M2.
At
n
th
v (i)
n
,
a
b
,
n
n+1
At
th
v (i)
n
culate each
v
P2.
n+1
v
(i)
n+1
th
v
v
k
,,(i)
n+I
at previous stage.
using as
(i)
d
starting with action
,
n+1
(i)
k
that
Apply (3.2) as soon as you cal-
the one that gives maximum
see [7].
stage, calculate and store
= v
Hence calculate
.
so far calculated,
,,(i)
n+1
At
v (i)
stage, calculate
maximized
.
and test for elimination using (3.1) without
n
recalculating
PI.
k
v (i)
n
stage, calculate and store all
v
k
,
(i)
.
Then using
apply (3.2).
As Table 2 shows M2 is
far superior to Ml, but PI and P2 give similar
All three cut the average time by a half though.
results.
Hastings and Van Nunen
[2]
pointed out that one could also eliminate
actions temporarily, i.e. actions that will not be the optimizing actions
at the next iteration of the PJ algorithm.
v
k
^
n+s
(i)
- v
k
(i)
— vn
n+s
>
.
(i)
k
- v (i) - 3
n
This is based on the inequality
,
j
I
u-
=l
(b
,.
.
n+i-1
- a
.
.)
n+i-1
,
(3.3)
——
(3.3) is positive
If the R.H.S. of
k
will not optimize the
—
n +
s
t*i_
n +
iteration, and in that case, at the
subtract another
3(b
,
LI
v
n+s
- a
o
- v
(i)
iteration we need only
1
from this positive number to test if
)
k
11 >^o
If action
could be optimal.
iteration,
i
+
s
V
n+s
is not eliminated at the
k
n + s
is stored for the test at the next iteration,
(i)
looked at four ways of implementing these two elimination procedures.
We
n + 1
Recall that the
—
iteration consists of the following sequence of
calculations.
(I)
V
b
*
n
V
(HI)
(ID
k
n+l
^
(i)
V
n+l
a
(i)
test be made at (I) and if
(3.2) with
v
.
was not temporarily eliminated then
k
,
n+i
,(i)
(I)
- v
is stored TEMP + PI.
,,(i))
n+I
TEMP + M2.
(i)
was stored until (IV) and then the
v
,(i)
n+i
,
- v
,
n+1
,
If the
(rather than
.
(I)
and permanent
.
Temporary elimination occurs at
not eliminated
v
Temporary elimination occurs at
Again this has temporary elimination at
elimination at (III) using P2
.
v (i) + 3a — v ,,(i)
n
n+1
n
and permanent elimination at (II) using PI
TEMP + P2.
v (i) + 3a
n
n
replaced by a lower bound
(i)
action is not permanently eliminated,
,
k
(i)
n+2
The permanent elimination test was made at (II) using
was calculated.
n+1
V
Hastings and Van Nunen [2] suggested the temporary elimination
TEMP HVN.
v
*
b
n+l' n+l
(i)
(I)
and in this case
M2 technique used.
If
action
v
(i)
k
was
was stored for the temporary elimination
test of the next iteration, which followed immediately.
In this case when permanent and temporary
the same stage,
elimination are done at
it is obvious that any action which is permanently eliminated
would also be temporarily eliminated.
This leads us to ask is it worth permanently eliminating, so
flLY
Perform the temporary elimination test at (I).
eliminated
v
+1
(i)
- v
v
-
n +;i/
is stored through state
(i)
i)
at sta 8 e
If action
(II)
is not
k
and changed to
(HI) for the next temporary elimination test
at (IV).
Table
2
describes the results and shows that temporary elimination
further cuts the time by 25%, and that pure temporary elimination might
be particularly good on large scale problems.
TABLE
class
:
L
(100 STATE)
class
2
;I
(40 STATE)
CLASS
.
3
(10 STATE)
METHOD
AV.
Ml
1.51
S.D.
N.
AV.
0.67
.09
S.D.
N.
AV.
0.42
.08
S.D.
N.
.06
|
M2
0.81
.05
PI
0.87
P2
TEMP HVN
0.36
.04
.05
0.43
0.88
.05
0.62
.03
15
0.25
.03
.05
0.26
.03
0.45
.05
0.28
.04
0.27
0.03
0.21
.03
TEMP + PI
0.60
.04
0.25
.02
TEMP + P2
0.58
.03
0.26
.02
TEMP + M2
0.59
.04
0.22
.04
TEMP ONLY
0.55
.03
0.22
.04
15
15
15
o
;
|
0.20
.03
0.23
.03
2
0.21
.03
10
0.21
.03
3
11
4
Our object has not been to obtain a best buy, but to give some idea of
the merits of the various schemes, bounds and improvements.
Obviously for
more structured problems, algorithms which exploit the structure will be
at an advantage.
ACKNOWLEDGEMENTS
The authors are grateful to the Social Science Research Council for
their financial support.
and Dr.
S.
We are also indebted to Professor D. J. White
French for stimulating discussions.
REFERENCES
1.
BLACKWELL, D.
"Discounted Dynamic Programming", Ann. Math. Stat. 36
226-235, (1965).
2.
HASTINGS, N. A. J., VAN NUNEN, J. A. E. E., "The action elimination
algorithm for Markov decision processes: Markov Decision Processes"
ed. by H. C. Tijms, J. Wessels, Mathematical Centre Tract No. 93,
Amsterdam, pp 161-170, (1977).
3.
HOWARD,
,
R.
Dynamic Programming and Markov Processes
,
,
,
Wiley, New York,
(1960).
4.
Introduction to Stochastic Control , Holt, Rinehart and
KUSHNER, H.
Winston, New York, (1971).
5.
KUSHNER, H.
Kleinman, A. J., "Accelerated Procedures for the solution
of discrete Markov control problems", I.E.E.E. Trans, on Automatic
Control 16, 147-152, (1971).
6.
MACQUEEN, J., "A test for sub-optimal actions in Markovian Decision
Problems", Opns. Res. 15, 559-561, (1967).
7.
PORTEUS
,
,
Sci.
,
18,
E.
,
"Some bounds for sequential decision processes'", Man.
(1971).
7-11,
8.
PORTEUS, E. , "Bounds and transformations for finite Markov decision
chains", Opns. Res. 23, 761-784, (1975).
9.
TOTTEN, J., "Accelerated computation of the expected
PORTEUS, E.
discounted return in a Markov chain", Opns. Res. 26 350-358, (1978).
,
,
10.
REETZ, D.
"Approximate solutions of a discounted Markovian decision
process", Dynamische Optimierung, Bonner Math. Schoiften , 98,
,
pp.
77-92,
(1977).
11.
"Second order bounds for Markov decision processes",
THOMAS, L. C.
J. Math. Anal. Appl. 80, 294-297, (1981).
12.
LAVERCOMBE, A., "Computational Comparison
THOMAS, L. C.
HARTLEY, R.
Notes in
for discounted Markov decision processes-value iteration.
Decision Theory, No. 100, Dept. of Decision Theory, Univ. of Manchester,
Manchester, (1982).
,
,
,
DISTRIBUTION LIST
NO. OF COPIES
Library, Code 0142
Naval Postgraduate School
Monterey, CA 93940
4
Dean of Research
Code 012A
Naval Postgraduate School
Monterey, CA 93940
1
Library, Code 55
Naval Postgraduate School
Monterey, CA 93940
2
Professor L. C. Thomas
Code 55
Naval Postgraduate School
Monterey, CA 93940
60
10
6
DUDLEY KNOX LIBRARY
5 6853
-
RESEARCH REPORTS
01069623
U20515

Download Report

Computational comparison of value iteration algorithms for

Paperzz.com

Your Paperzz