Stochastic Shortest Path Problems Under Weak Conditions

August 2013 (revised March 2015 and January 2016)
Report LIDS - 2909
Stochastic Shortest Path Problems
Under Weak Conditions
Dimitri P. Bertsekas† and Huizhen Yu‡
Abstract
In this paper we consider finite-state stochastic shortest path problems, and we address some of the
complications due to the presence of transition cycles with nonpositive cost. In particular, we assume that
the optimal cost function is real-valued, that a standard compactness and continuity condition holds, and
that there exists at least one proper policy, i.e., a stationary policy that terminates with probability one.
Under these conditions, value and policy iteration may not converge to the optimal cost function, which in
turn may not satisfy Bellman’s equation. Moreover, an improper policy may be optimal and superior to
all proper policies. We propose and analyze forms of value and policy iteration for finding a policy that
is optimal within the class of proper policies. In the special case where all expected transition costs are
nonnegative we provide a transformation method and algorithms for finding a policy that is optimal over all
policies.
1.
INTRODUCTION
Stochastic shortest path (SSP) problems are a major class of infinite horizon total cost Markov decision
processes (MDP) with a termination state. In this paper, we consider SSP problems with a state space
X = {t, 1, . . . , n}, where t is the termination state. The control space is denoted by U , and the set of feasible
controls at state x is denoted by U (x). From state x under control u ∈ U (x), a transition to state y occurs
with probability pxy (u) and incurs an expected one-stage cost g(x, u), which is assumed real-valued. At state
t we have ptt (u) = 1, g(t, u) = 0, for all u ∈ U (t), i.e., t is absorbing and cost-free. The goal is to reach t
with minimal total expected cost.
The principal issues here revolve around the solution of Bellman’s equation and the convergence of
the classical algorithms of value iteration (VI for short) and policy iteration (PI for short). An important
classification of stationary policies in SSP is between proper (those that guarantee eventual termination and
are of principal interest in shortest path applications) and improper (those that do not). It is well-known
that the most favorable results hold under the assumption that there exists at least one proper policy and
that each improper policy generates infinite cost starting from at least one initial state. Then, assuming also
a finiteness or compactness condition on U , and a continuity condition on pxy (·) and g(x, ·), the optimal cost
function J * is the unique solution of Bellman’s equation, and appropriate forms of VI and PI yield J * and
an optimal policy that is proper [BeT91].
† Dimitri Bertsekas is with the Dept. of Electr. Engineering and Comp. Science, and the Laboratory for Infor-
mation and Decision Systems, M.I.T., Cambridge, Mass., 02139.
‡ Huizhen Yu is with the Dept. of Computing Science, University of Alberta, Edmonton, Canada, and has been
supported by a grant from Alberta Innovates – Technology Futures.
1
In this paper, we will consider the SSP problem under weaker assumptions, where the preceding
favorable results cannot be expected to hold. In particular, we will replace the infinite cost assumption for
improper policies with the condition that J * is real-valued. Under our assumptions, J * need not be a solution
of Bellman’s equation, and even when it is, it may not be obtained by VI starting from any initial condition
ˆ
other than J * itself, while the standard form of PI may not be convergent. We will show instead that J,
which is the optimal cost function over proper policies only, is the unique solution of Bellman’s equation
ˆ Moreover VI converges to Jˆ from any initial condition J ≥ J,
ˆ while
within the class of functions J ≥ J.
ˆ
a form of PI yields a sequence of proper policies that asymptotically attains the optimal value J. Our line
of analysis relies on a perturbation argument, which induces a more effective discrimination between proper
and improper policies in terms of finiteness of their cost functions. This argument depends critically on the
assumption that J * is real-valued.
In what follows in this section we will introduce our notation, terminology, and assumptions, and we
will explain the nature of our analysis and its relation to the existing literature. In Section 2, we will prove
our main results. In Section 3, we will consider the important special case where g(x, u) ≥ 0 for all (x, u),
and show how we can effectively overcome the pathological behaviors we described. In particular, we will
show how we can transform the problem to an equivalent favorably structured problem, for which J * is the
ˆ while powerful VI and PI convergence results hold.
unique solution of Bellman’s equation and is equal to J,
1.1. Notation and Terminology
We first introduce definitions of various types of policies and cost functions. By a nonstationary policy we
mean a sequence of the form π = {µ0 , µ1 , . . .}, where each function µk , k = 0, 1, . . ., maps each state x to a
control in U (x). We define the total cost of a nonstationary policy π for initial state x to be
Jπ (x) = lim sup Jπ,N (x)
(1.1)
N →∞
with Jπ,N (x) being the expected N -stage cost of π for state x:
Jπ,N (x) = E
"N −1
X
k=0
#
g xk , µk (xk ) x0 = x ,
where xk denotes the state at time k. The expectation above is with respect to the probability law of
{x0 , . . . , xN −1 } induced by π; this is the probability law for the (nonstationary) finite-state Markov chain
that has transition probabilities P (xk+1 = y | xk = x) = pxy µk (x) . The use of lim sup in the definition
of Jπ is necessary because the limit of Jπ,N (x) as N → ∞ may not exist. However, the statements of our
results, our analysis, and our algorithms are also valid if lim sup is replaced by lim inf. The optimal cost at
state x, denoted J * (x), is the infimum of Jπ (x) over π. Note that in general there may exist states x such
that Jπ (x) = ∞ or Jπ (x) = −∞ for some policies π, as well as J * (x) = ∞ or J * (x) = −∞.
Regarding notation, we denote by < the set of real numbers, and by E the set of extended real numbers:
E = < ∪ {∞, −∞}. Vector inequalities of the form J ≤ J 0 are to be interpreted componentwise, i.e.,
J(x) ≤ J 0 (x) for all x. Similarly, limits of function sequences are to be interpreted pointwise. Since
Jπ (t) = 0 for all π and J * (t) = 0, we will ignore in various expressions the component J(t) of a cost function.
Thus the various cost functions arising in our development are of the form J = J(1), . . . , J(n) , and they
will be viewed as elements of the space E n , the set of n dimensional vectors whose entries are either real
numbers or −∞ or +∞, or of the space <n if their components are real. In Section 3, where we will focus
2
on vectors with nonnegative components, J will belong to the nonnegative orthant of <n , denoted <n+ , or
n.
the nonnegative orthant of E n , denoted E+
If π = {µ0 , µ1 , . . .} is a stationary policy, i.e., all µk are equal to some µ, then π and Jπ are also denoted
by µ and Jµ , respectively. The set of all stationary policies is denoted by M. In this paper, we will aim to
find optimal policies exclusively within M, so in the absence of a statement to the contrary, by “policy” we
will mean a stationary policy.
We say that µ ∈ M is proper if when using µ, there is positive probability that the termination state
t will be reached after at most n stages, regardless of the initial state; i.e., if
ρµ = max P {xn 6= t | x0 = x, µ} < 1.
x=1,...,n
Otherwise, we say that µ is improper . It can be seen that µ is proper if and only if in the Markov chain
corresponding to µ, each state x is connected to the termination state with a path of positive probability
transitions, i.e., the only recurrent state is t and all other states are transient.
Let us introduce some notation relating to the mappings that arise in optimality conditions and algorithms. We consider the mapping H : {1, . . . , n} × U × E n 7→ E defined by
H(x, u, J) = g(x, u) +
n
X
pxy (u)J(y),
y=1
x = 1, . . . , n, u ∈ U (x), J ∈ E n .
For vectors J ∈ E n with components that take the values ∞ and −∞, we adopt the rule ∞ − ∞ = ∞ in
the above definition of H. However, the sum ∞ − ∞ never appears in our analysis. We also consider the
mappings Tµ , µ ∈ M, and T defined by
(Tµ J)(x) = H x, µ(x), J ,
(T J)(x) =
inf H(x, u, J),
u∈U (x)
x = 1, . . . , n, J ∈ E n .
We will frequently use the monotonicity property of Tµ and T , i.e.,
J ≤ J0
⇒
Tµ J ≤ Tµ J 0 , T J ≤ T J 0 .
The fixed point equations J * = T J * and Jµ = Tµ Jµ are referred to as Bellman’s equations for the optimal
cost function and for the cost function of µ, respectively. They are generally expected to hold in MDP
models. We will see, however, in Section 2.3 that this is not the case if µ is improper and g(x, u) can take
both positive and negative values.
1.2. Background and Motivation
The SSP problem has been discussed in many sources, including the books [Pal67], [Der70], [Whi82], [Ber87],
[BeT89], [Alt99], [HeL99], and [Ber12a], where it is sometimes referred to by other names such as “first
passage problem” and “transient programming problem.” We may also view the SSP problem as a special
case of undiscounted total cost MDP; see e.g., [Sch75], [van76], [Put94] (Section 7.2), [Fei02], [DuP13],
[Yu14] for analyses of these MDP under various model conditions. One difference, however, is that these
total cost MDP models typically require each policy to have finite expected total cost with respect to either
the positive or the negative part of the cost function, and this excludes certain shortest path problems with
zero-length cycles. Our SSP problem formulation does not require this condition on policy costs; indeed it
3
was motivated originally as an extension of the deterministic shortest path problem, where zero-length cycles
are often present.
We will now summarize the current SSP methodology, and the approaches and contributions of this
paper. We first note that for any policy µ, the matrix that has components pxy µ(x) , x, y = 1, . . . , n, is
substochastic (some of its row sums may be less than 1) because there may be positive transition probability
from x to t. Consequently Tµ may be a contraction for some µ, but not necessarily for all µ ∈ M. For a
proper policy µ, Tµ is a contraction with respect to some weighted sup-norm k · kv , defined for some vector
v ∈ <n with positive components v(i), i = 1, . . . , n, by
J(i)
,
J ∈ <n ,
kJkv = max
i=1,...,n v(i)
(see e.g., [Ber12a], Section 3.3). It follows that Jµ = limk→∞ Tµk J for all J ∈ <n [this is because Jµ is
the unique fixed point of Tµ within <n , and by definition Jµ = limk→∞ Tµk J0 , where J0 is the zero vector,
J0 (x) ≡ 0]. For an improper policy µ, Tµ is not a contraction with respect to any norm. Moreover, T also
need not be a contraction with respect to any norm. However, T is a weighted sup-norm contraction in
the important special case where the control space is finite and all policies are proper. This was shown in
[BeT96], Prop. 2.2 (see also [Ber12a], Prop. 3.3.1).
To deal with the lack of contraction property of T , various assumptions have been formulated in the
literature, under which results similar to the case of discounted finite-state MDP have been shown for SSP
problems. These assumptions typically include the following condition, or the stronger version whereby U (x)
is a finite set for all x. An important consequence of this condition is that the infimum of H(x, ·, J) over
U (x) is attained for all J ∈ <n , i.e., there exists µ ∈ M such that Tµ J = T J. However, the assumption does
not guarantee the existence of an optimal policy, as shown by Example 6.7 of the total cost MDP survey
[Fei02], which is attributed to [CFM00]. This will also be seen in a very similar example, given in Section
2.2 in the context of a pathology relating to PI.
Compactness and Continuity Condition: The control space U is a metric space. Moreover, for
each state x, the set U (x) is a compact subset of U , the functions pxy (·), y = 1, . . . , n, are continuous
over U (x), and the function g(x, ·) is lower semicontinuous over U (x).
In this paper we will generally assume the compactness and continuity condition, since serious anomalies
may occur without it. An example is the classical blackmailer’s problem (see [Ber12a], Example 3.2.1), where
X = {t, 1},
U (1) = (0, 1],
g(1, u) = −u,
p11 (u) = 1 − u2 ,
H(1, u, J) = −u + (1 − u2 )J(1).
Here every stationary policy µ is proper and Tµ is a contraction, but T is not a contraction, there is no
optimal stationary policy, while the optimal cost J * (1) (which is −∞) is attained by a nonstationary policy.
A popular set of assumptions that we will aim to generalize, is the following.
Classical SSP Conditions:
(a) The compactness and continuity condition holds.
(b) There exists at least one proper policy.
(c) For every improper policy there is an initial state that has infinite cost under this policy.
4
a 0 1 2 t b c Destination
Prob. u Prob. 1 − u Cost a1 1
Cost
2 t 1b −
√
u
J ∗ Jµ Jµ� Jµ�� Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
�
�
a 1 2 t J(1)
b Destination
a 0 1 2 t b c Destination
= min c, a + J(2)
√
Prob. u Prob. 1 − au
� Jµ��1
J ∗1Cost
J2µt J1bµCost
J−
µ0 Ju
µ1 Jµ2 J µ3 J µ0
k
2f (x;
t b=θk+1
a 1 2 t b f (x;aθ 1)J(2)
xk F (xk ) F (x) xk+1 F (xk+1 ) xk+2 x∗ = F (x∗ ) F
b
+ )J(1)
J ∗ Jµ Jµ� Jµ�� Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
J ∗ Jµ Jµ� Jµ�� JJµ∗0 Jµ1 JJµµ� 2JJµ��µ3JµJ0µ0J�µ1 Jµ2 Jµ3 �Jµ0
J(1) = min c, a + J(2)
J ∗ Jµ Jµ� Jµ�� Jµ0 Jµ1 Jµ2 LP
Jµ3 CONVEX
Jµ0 f (x; θ k ) NLP
f (x; θk+1 ) xk F (xk ) F (x) xk+1 F (xk+1 ) xk+2 x∗ = F (x∗ )
Figure 1.1. A deterministic shortest path problem with a singlek node 1 and
a termination
t. At 1 there are
k F (xk ) node
f (x; θ ) f (x; θk+1
) x=
F (x) xk+1 F (xk+1 ) xk+2 x∗ = F (x∗ )kF µ
J(2)
b k+1
+ J(1)
two choices; a self-transition, which costs a, and a transition
to
t,
which
costs
b.
k
k+1
∗
k
k+1
k
k
k+1
k+1
(x;)θ F (x)
) xx F (xFk(x
)F
F∗(x
) ∗x)k+2
f (x; θ ) f (x; θ f (x;
) xθ )Ff(x
) xxk+2
=k+1
F (x
F µx(x
k(x)
k+1x
k LP
f (x; θk ) f (x; θk+1 ) xSimplex
F (xkCONVEX
) F (x) xk+1 NLP
F (xk+1 ) xk+2 x∗ = F (x∗ ) F µ (x) F µ (x)
∗
0 Jµ1 Jµ2 J µ3 J µ0
J conditions,
Jµ Jµ� Jµ�� it
Jµhas
Under the classical SSP
been
shown that
J * is the unique fixed point of T within
LP
CONVEX
NLP
LP
CONVEX
NLPExploration
LP
CONVEX
NLP
LP
CONVEX
NLP
Policy
Evaluation
Improvement
Enhancement
Simplex
*
*
n
∗
< . Moreover, a policy µ is optimal if and only if Tµ∗ J = T J , and an optimal
proper policy
exists (so in
*
*
particular J , being the cost function
policy,
addition,
J xcan
computed
k ) F (x) xk+1 In
k+1 ) xk+2
∗ = be
∗ ) F µk (x) F µk+1 (x)
f (x;of
θk )a fproper
(x; Simplex
θk+1
) xk Fis(xreal-valued).
F (x
F (x
Simplex
Simplex
*
Simplex
k
n
J<
Jk+1
µk+1
νk+1given
Qk+1
Policy
Evaluation
Improvement
Enhancement
k Q
by VI, J = limk→∞ T J, starting with any νJk ∈
. k+1
These
results
were
in Exploration
[BeT91], following
related
results in several other works
[Pal67],
[Der70],
[Whi82],
[Ber87]
(see
[BeT91]
for
detailed
references).
An
LP CONVEX NLP
Policy Evaluation Improvement Exploration Enhancement
Policy
Evaluation
Improvement
Exploration
Enhancement
Policy
Evaluation
Improvement
Exploration
Enhancement
Improvement
Exploration
Enhancement
alternative assumption is that g(x, u) ≥ Policy
0 forSubgradient
allνEvaluation
that
there
aQ
proper
policy
that is optimal.
Cutting
Interior
Subgradient
Jku),Qand
Jk+1
µplane
νk+1
k(x,
k+1
k+1exists
k+1point
Our results bear some similarity
to
those
obtained
under
this
assumption
(see
[BeT91]),
but
are
considerably
Simplex
νQ
k Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
νk Jk Qk+1
Jk+1 µProp.
k+1 νk+1
k+1
stronger; see the comment
preceding
νapproximation
Jk+1
Qk+1
νk+1point
Qk+1Subgradient
νk J2.3.
Q
νk+1Jplane
Qk+1µk+1
Polyhedral
kµ
kCutting
k+1
k
k+1 Jk+1
Subgradient
Interior
The standard form of PI
generates
a sequence
{µk } according
to
Policy
Evaluation
Improvement
Exploration
Subgradient
planeEnhancement
Interior point Subgradient
Subgradient Cutting plane Interior
point Cutting
Subgradient
Subgradient
Cutting
plane Subgradient
Interior point Subgradient
Subgradient
Cutting
plane
Interior
point
LPs
are
solved
by
simplex
method
Polyhedral
approximation
Tµk+1 Jµk = T Jµk ,
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
Polyhedral
approximation
approximation
starting from somePolyhedral
initial policy
µ0 . ItPolyhedral
works
in its
strongest
formapproximation
when there are no improper policies.
Polyhedral
approximation
NLPs
solved
LPs are
are
solved by
by gradient/Newton
simplex method methods.
When there are improper policies
and theCutting
classical
SSPInterior
conditions
hold,
the initial policy should be proper
Subgradient
plane
point
Subgradient
LPs are solved
by simplex
LPsguaranteed
are solved to
bybe
simplex
method
(in which case subsequently
generated
policiesmethod
are
proper).
Otherwise, modifications to PI
LPs
areare
solved
by simplex
LPs Convex
areNLPs
solved
bysolved
simplex
method
programs
cases of method
NLPs.
are
by special
gradient/Newton
methods.
are needed, which allow it toPolyhedral
work withapproximation
improper initial policies, and also in the presence of approximate
NLPs
are solved bySeveral
gradient/Newton
methods.
NLPsofare
solved by gradient/Newton
methods.
policy evaluation and
asynchronism.
types
modifications
have been proposed,
including a mixed
NLPs
are1990s
solved
by gradient/Newton
methods.
NLPs
are
solved
byPost
gradient/Newton
methods.
Modern
view:
Convex
programs
are
special
cases
of
NLPs.
VI and PI algorithm where policy
evaluation
is done
through
the use of an optimal stopping problem [BeY12],
LPs are
solved by
simplex
method
Convex programs
are Section
special cases
NLPs.
[YuB13a] (see the summary
in [Ber12a],
3.5.3,offor
a discussion
of thiscases
and other
methods).
Convex
programs
are special
of NLPs.
Convex
programs
Convex
programs
are
special
cases
ofspecial
NLPs. cases of NLPs.
LPs
are often
best
solved
by are
nonsimplex/convex
methods.
Modern
view:
Post
1990s
NLPs
areextensive
solved byliterature
gradient/Newton
methods.is the deterministic shortest path
An important special case,
with
and applications,
Modern view: Post 1990s
view:state
Posty.1990s
problem, where for all (x, u), pxy (u) is equal to Modern
1 for a single
Part (b) of the classical SSP conditions
Modern
view:
Post
Modern
view:
Post
1990s
Convex
problems
are
often
solved
same
methods
as
LPs
are
often
best
solved
by1990s
nonsimplex/convex
methods.
Convex
programs
are
special
cases
of
is equivalent to the existence
of a proper
policy,
while
part
(c)
isNLPs.
equivalent
to by
all the
cycles
of the
graph
of LPs.
LPs are often best solved by nonsimplex/convex methods.
probability 1 transitions have positive length. LPs are often best solved by nonsimplex/convex methods.
LPs solved
are often
best
solved
by
methods.
LPsPost
areConvex
often best
byoften
nonsimplex/convex
methods.
Nondifferentiability
and
piecewise
linearity
common
features.
are
solved
by nonsimplex/convex
the are
same
methods
as
LPs.
Modern
view:
1990s
Many algorithms,
including
value
(VI by
forproblems
short)
policy as
iteration
Convex
problems
are iteration
often solved
the
sameand
methods
LPs. (PI for short), can be
used to find a shortest path from every x to t. Convex
However,
if there are
are often
cyclessolved
of zerobylength,
difficulties
arise:
problems
the same
methods
as LPs.
problems
are
often
solved
by
same
methods
Convex
problems
are
often
solved
by
the
same
methods
as
LPs.
Primal
Problem
Description
LPs
are
often
best
solved
byConvex
nonsimplex/convex
methods.
Nondifferentiability
and
piecewise
linearity
arethe
common
features.as LPs.
the optimality equation
can
have
multiple
solutions,
while
VI
and
PI
may
break
down,
even
in
very
simple
Nondifferentiability and piecewise linearity are common features.
problems. These difficulties are illustrated by the
following example,and
andpiecewise
motivate the
analysis
this paper.
Nondifferentiability
linearity
areofcommon
features.
Convex problems
are
often
solved
by
the
same
methods
as
LPs.
Nondifferentiability
and
piecewise
linearity
are common features
Nondifferentiability
and
piecewise
linearity
are
common
features.
Dual
Problem
Description
Primal
Problem
Description
Primal Problem
Description
Example 1.1 (Deterministic
Shortest
Path Problem)
Primal Problem Description
Nondifferentiability
andtermination
piecewise
linearity
are
common
features.
Primal
Problem
Description
Primal
Problem
Description
Here there is aDual
single
state
1 Description
in addition
toDual
the
state
t (cf.
Fig.
1.1). At
state 1 there are two
Vertical
Distances
Problem
Description
Problem
choices: a self-transition which costs a and a transition to t, which costs b. The mapping H has the form
Dual Problem Description
Description
Primal Problem
Dual
Problem Description
Dual
Problem
Description
a+
JVertical
if u: self
transition,
Crossing
Point
Differentials
Distances
Vertical Distances
H(1, u, J) =
J ∈ <,
b
if u: transition to t,
Vertical Distances
Dual Problem Description
Vertical
Distances
and the mapping
T has the
form
Crossing
Point
Differentials
Vertical
Distances
Values
f (x)
Crossing
points f ∗ (y)
Crossing
Point
Differentials
T J = min{a
+ J, b},
∈ <.
Crossing
Point JDifferentials
Vertical Distances
1
∗ Point
Crossing
Pointpoints
Crossing
Differentials
−fValues
f1∗f(y)
+Crossing
f2∗ (−y)
f2∗Differentials
(−y)f ∗ (y)
(x)
1 (y)
5
Crossing Point Differentials
Values f (x) Crossing points f ∗ (y)
∗y ∗Crossing
∗
∗ Crossing
∗
Values
f (x)
points f ∗ (y)
Values
f (x)
f ∗ (y)
Slope
Slope
y points
−f
1 (y) f1 (y) + f2 (−y) f2 (−y)
1
−f1∗ (y) f1∗ (y)
+ f2∗∗(−y) f2∗ (−y)
∗
∗
∗
∗
∗
∗
∗−f
(y)yf1f2(y)
+ f2 (−y)
−f1 (y)Slope
f1 (y)y+
f21(−y)
(−y)
Slope
f2∗ (−y)
1
There are two policies: the policy that self-transitions at state 1, which is improper, and the policy that
transitions from 1 to t, which is proper. When a < 0 the improper policy is optimal and we have J ∗ (1) = −∞.
The optimal cost is finite if a > 0 or a = 0, in which case the cycle has positive or zero length, respectively.
Note the following:
(a) If a > 0, the classical SSP conditions are satisfied, and the optimal cost, J ∗ (1) = b, is the unique fixed
point of T .
(b) If a = 0 and b ≥ 0, the set of fixed points of T (which has the form T J = min{J, b}), is the interval
(−∞, b]. Here the improper policy is optimal for all b ≥ 0, and the proper policy is also optimal if b = 0.
(c) If a = 0 and b > 0, the proper policy is strictly suboptimal, yet its cost at state 1 (which is b) is a fixed
point of T . The optimal cost, J ∗ (1) = 0, lies in the interior of the set of fixed points of T , which is (−∞, b].
Thus the VI method that generates {T k J} starting with J 6= J ∗ cannot find J ∗ ; in particular if J is a fixed
point of T , VI stops at J, while if J is not a fixed point of T (i.e., J > b), VI terminates in two iterations at
b 6= J ∗ (1). Moreover, the standard PI method is unreliable in the sense that starting with the suboptimal
proper policy µ, it may stop with that policy because (Tµ Jµ )(1) = b = min Jµ (1), b = (T Jµ )(1) [the
other/optimal policy µ∗ also satisfies (Tµ∗ Jµ )(1) = (T Jµ )(1), so a rule for breaking the tie in favor of µ∗
is needed but such a rule may not be obvious in general].
(d) If a = 0 and b < 0, only the proper policy is optimal, and we have J ∗ (1) = b. Here it can be seen that
the VI sequence {T k J} converges to J ∗ (1) for all J ≥ b, but stops at J for all J < b, since the set of
fixed points of T is (−∞, b]. Moreover, starting with either the proper policy µ∗ or the improper policy
µ, the standard form of PI may oscillate, since (Tµ∗ Jµ )(1) = (T Jµ )(1) and (Tµ Jµ∗ )(1) = (T Jµ∗ )(1), as
can be easily verified [the optimal policy µ∗ also satisfies (Tµ∗ Jµ∗ )(1) = (T Jµ∗ )(1) but it is not clear how
to break the tie; compare also with case (c) above].
As we have seen in case (c), VI may fail starting from J 6= J ∗ . Actually in cases (b)-(d) the onestage costs are either all nonnegative or nonpositive, so they belong to the classes of negative and positive
DP models, respectively. From known results for such models, there is an initial condition, namely J =
0, starting from which VI converges to J ∗ . However, this is not necessarily the best initial condition; for
example in deterministic shortest path problems initial conditions J ≥ J ∗ are generally preferred and result
in polynomial complexity computation assuming that all cycles have positive length. By contrast VI has only
pseudopolynomial complexity when started from J = 0. We will also see in the next section, that if there are
both positive and negative one-stage costs and the problem is nondeterministic, it may happen that J ∗ is not
a fixed point of T , so it cannot be obtained by VI or PI.
To address the distinction between the optimal cost J * (x) and the optimal cost that may be achieved
starting from x and using a proper policy [cf. case (c) of the preceding example], we introduce the optimal
cost function over proper policies:
ˆ
J(x)
=
inf
µ: proper
Jµ (x),
x ∈ X.
(1.2)
Aside from its potential practical significance, the function Jˆ plays a key role in the analysis of this paper,
ˆ and not J * .
because when Jˆ 6= J * , our VI and PI algorithms of Section 2 can only obtain J,
In Section 2, we weaken part (c) of the classical SSP conditions, by assuming that J * is real-valued
instead of requiring that each improper policy has infinite cost from some initial states. We show that Jˆ is
ˆ and can be computed by VI starting from any J
the unique fixed point of T within the set {J | J ≥ J},
within that set. In the two special cases where either there exists a proper policy µ∗ that is optimal, i.e.,
J * = Jµ∗ , or the cost function g is nonpositive, we show that J * is the unique fixed point of T within the set
6
{J | J ≥ J * }, and the VI algorithm converges to J * when started within this set. We provide an example
showing that J * may not be a fixed point of T if g can take both positive and negative values, and the SSP
problem is nondeterministic.
The idea of the analysis is to introduce an additive perturbation δ > 0 to the cost of each transition.
Since J * is real-valued, the cost function of each improper policy becomes infinite for some states, thereby
bringing to bear the classical SSP conditions for the perturbed problem, while the cost function of each
proper policy changes by an O(δ) amount. We will also propose a valid version of PI that is based on this
ˆ as a replacement of the standard form of PI, which may oscillate as
perturbation idea and converges to J,
we have seen in Example 1.1.
ˆ they cannot be used to compute J * and an optimal
Since the VI and PI algorithms aim to converge to J,
policy in general. In Section 3 we resolve this issue for the case where, in addition to the compactness and
continuity condition, all transition costs are nonnegative: g(x, u) ≥ 0 for all x and u ∈ U (x) (the negative
DP model). Under these conditions the classical forms of VI and PI may again fail, as demonstrated by cases
(c) and (d) of Example 1.1. However, we will provide a transformation to an “equivalent” SSP problem for
which the classical SSP conditions hold. Following this transformation, we may use the corresponding VI,
which converges to J * starting from any J ∈ <n .
2.
FINITE OPTIMAL COST CASE - A PERTURBATION APPROACH
In this section we allow the one-stage costs to be both positive and negative, but assume that J * is realvalued and that there exists at least one proper policy. As a result, by adding a positive perturbation δ to g,
we are guaranteed to drive to ∞ the cost Jµ (x) of each improper policy µ, for at least one state x, thereby
differentiating proper and improper policies.
We thus consider for each scalar δ > 0 an SSP problem, referred to as the δ-perturbed problem, which
is identical to the original problem, except that the cost per stage is
gδ (x, u) =
g(x, u) + δ
0
if x = 1, . . . , n,
if x = t,
and the corresponding mappings Tµ,δ and Tδ are given by
Tµ,δ J = Tµ J + δe,
Tδ J = T J + δe,
∀ J ∈ <n ,
where e is the unit function [e(x) ≡ 1]. This problem has the same proper and improper policies as the
original. The corresponding cost function of a policy π = {µ0 , µ1 , . . .} ∈ Π is given by
Jπ,δ (x) = lim sup Jπ,δ,N (x),
N →∞
with
Jπ,δ,N (x) = E
"N −1
X
k=0
#
gδ xk , µk (xk ) x0 = x .
We denote by Jµ,δ the cost function of a stationary policy µ for the δ-perturbed problem, and by Jδ* the
corresponding optimal cost function,
Jδ* = inf Jπ,δ .
π∈Π
7
Note that for every proper policy µ, the function Jµ,δ is real-valued, and that
lim Jµ,δ = Jµ .
δ↓0
This is because for a proper policy, the extra δ cost per stage will be incurred only a finite expected number
of times prior to termination, starting from any state. This is not so for improper policies, and in fact the
idea behind perturbations is that the addition of δ to the cost per stage in conjunction with the assumption
that J * is real-valued imply that if µ is improper,
k J)(x) = ∞
Jµ,δ (x) = lim (Tµ,δ
k→∞
for all x 6= t that are recurrent under µ and all J ∈ <n .
(2.1)
Thus part (c) of the classical SSP conditions holds for the δ-perturbed problem, and the associated strong
results noted in Section 1 come into play. In particular, we have the following proposition.
Proposition 2.1: Assume that the compactness and continuity condition holds, that there exists
at least one proper policy, and that J * is real-valued. Then for each δ > 0:
(a) Jδ* is the unique solution of the equation
J(x) = (T J)(x) + δ,
x = 1, . . . , n.
(b) A policy µ is optimal for the δ-perturbed problem (Jµ,δ = Jδ* ) if and only if Tµ Jδ* = T Jδ* .
Moreover, all optimal policies for the δ-perturbed problem are proper and there exists at least
one such policy.
(c) The optimal cost function over proper policies Jˆ [cf. Eq. (1.2)] satisfies
ˆ
J(x)
= lim Jδ∗ (x),
δ↓0
x = 1, . . . , n.
(d) If the control constraint set U (i) is finite for all states i = 1, . . . , n, there exists a proper policy
ˆ
µ̂ that attains the minimum over all proper policies, i.e., Jµ̂ = J.
Proof: (a), (b) These parts follow from the discussion preceding the proposition, and the results of [BeT91]
noted in the introduction, which hold under the classical SSP conditions [the equation of part (a) is Bellman’s
equation for the δ-perturbed problem].
(c) We note that for an optimal proper policy µ∗δ of the δ-perturbed problem [cf. part (b)], we have
Jˆ =
inf
µ: proper
Jµ ≤ Jµ∗ ≤ Jµ∗ ,δ = Jδ* ≤ Jµ0 ,δ ,
δ
δ
∀ µ0 : proper.
Since for every proper policy µ0 , we have limδ↓0 Jµ0 ,δ = Jµ0 , it follows that
Jˆ ≤ lim Jµ∗ ≤ Jµ0 ,
δ↓0
δ
8
∀ µ0 : proper.
By taking the infimum over all µ0 that are proper, the result follows.
(d) Let {δk } be a positive sequence with δk ↓ 0, and consider a corresponding sequence {µk } of optimal
proper policies for the δk -perturbed problems. Since the set of proper policies is finite, some policy µ̂ will be
repeated infinitely often within the sequence {µk }, and since {Jδ*k } is monotonically nonincreasing, we will
have
Jˆ ≤ Jµ̂ ≤ Jδ*k ,
ˆ it follows that Jµ̂ = J.
ˆ
for all k sufficiently large. Since by part (c), Jδ*k ↓ J,
Q.E.D.
Note that the finiteness of U (i) is an essential assumption for the existence result of part (d) of the
preceding proposition. In particular, the result may not be true under the more general compactness and
continuity condition, even if Jˆ = J * (see the subsequent Example 2.1).
2.1. Convergence of Value Iteration
The preceding perturbation-based analysis, can be used to investigate properties of Jˆ by using properties of
Jδ* and taking limit as δ ↓ 0. In particular, we use the preceding proposition to show that Jˆ is a fixed point
ˆ
of T , and can be obtained by VI starting from any J ≥ J.
Proposition 2.2: Assume that the compactness and continuity condition holds, that there exists
at least one proper policy, and that J * is real-valued. Then:
(a) The optimal cost function over proper policies Jˆ is the unique fixed point of T within the set
ˆ
{J ∈ <n | J ≥ J}.
ˆ
(b) We have T k J → Jˆ for every J ∈ <n with J ≥ J.
ˆ if
(c) Let µ be a proper policy. Then µ is optimal within the class of proper policies (i.e., Jµ = J)
ˆ
ˆ
and only if Tµ J = T J.
ˆ Taking infimum over proper µ, we
Proof: (a), (b) For all proper µ, we have Jµ = Tµ Jµ ≥ Tµ Jˆ ≥ T J.
ˆ Conversely, for all δ > 0 and µ ∈ M, we have
obtain Jˆ ≥ T J.
Jδ* = T Jδ* + δe ≤ Tµ Jδ* + δe.
Taking limit as δ ↓ 0, and using Prop. 2.1(c), we obtain Jˆ ≤ Tµ Jˆ for all µ ∈ M. Taking infimum over
ˆ Thus Jˆ is a fixed point of T .
µ ∈ M, it follows that Jˆ ≤ T J.
For all J ∈ <n with J ≥ Jˆ and proper policies µ, we have by using the relation Jˆ = T Jˆ just shown,
Jˆ = lim T k Jˆ ≤ lim T k J ≤ lim Tµk J = Jµ .
k→∞
k→∞
k→∞
Taking the infimum over all proper µ, we obtain
ˆ
Jˆ ≤ lim T k J ≤ J,
k→∞
9
ˆ
∀ J ≥ J.
This proves part (b) and also the claimed uniqueness property of Jˆ in part (a).
ˆ we have Jˆ = Jµ = Tµ Jµ = Tµ J,
ˆ so, using also the relation Jˆ = T Jˆ
(c) If µ is a proper policy with Jµ = J,
ˆ Conversely, if µ satisfies Tµ Jˆ = T J,
ˆ then from part (a), we have Tµ Jˆ = Jˆ
[cf. part (a)], we obtain Tµ Jˆ = T J.
k
ˆ
ˆ
ˆ so Jµ = J.
ˆ
and hence limk→∞ Tµ J = J. Since µ is proper, we have Jµ = limk→∞ Tµk J,
Q.E.D.
Note that if there exists a proper policy but J * is not real-valued, the mapping T cannot have any realvalued fixed point. To see this, let J˜ be such a fixed point, and let be a scalar such that J˜ ≤ J0 +e, where J0
is the zero vector. Since T (J0 +e) ≤ T J0 +e, it follows that J˜ = T N J˜ ≤ T N (J0 +e) ≤ T N J0 +e ≤ Jπ,N +e
for any N and policy π. Taking lim sup with respect to N and then infimum over π, it follows that J˜ ≤ J * +e.
Since J * cannot take the value ∞ (by the existence of a proper policy), this shows that J * must be real-valued.
Note also that there may exist an improper policy µ that is strictly suboptimal and yet satisfies the
optimality condition Tµ J * = T J * [cf. case (d) of Example 1.1], so properness of µ is an essential assumption
in Prop. 2.2(c). It is also possible that Jˆ = J * , but the only optimal policy is improper, as we will show by
example in the next section (see Example 2.1).
ˆ the convergence rate of VI to Jˆ is linear,
The following proposition shows that starting from any J ≥ J,
and provides a corresponding error bound. The proposition assumes that there exists a proper policy that
ˆ By Prop. 2.1(d), this is true if the control constraint set U (i) is finite for all
attains the optimal value J.
states i = 1, . . . , n.
Proposition 2.3: (Convergence Rate of VI) Assume that the compactness and continuity
condition holds and that J * is real-valued. Assume further that there exists a proper policy µ̂ that is
ˆ Then
optimal within the class of proper policies, i.e., Jµ̂ = J.
ˆ v ≤ βkJ − Jk
ˆ v,
T J − Jk
ˆ
∀ J ≥ J,
(2.2)
where k·kv is a weighted sup-norm with respect to which Tµ̂ is a contraction and β is the corresponding
modulus of contraction. Moreover we have
ˆ v≤
kJ − Jk
1
J(i) − (T J)(i)
max
,
1 − β i=1,...,n
v(i)
ˆ
∀ J ≥ J.
(2.3)
ˆ so for all states i and J ≥ J,
ˆ
Proof: By using the optimality of µ̂ and Prop. 2.2, we have Tµ̂ Jˆ = T Jˆ = J,
ˆ
ˆ
ˆ
(T J)(i) − J(i)
(Tµ̂ J)(i) − (Tµ̂ J)(i)
J(i) − J(i)
≤
≤ β max
.
i=1,...,n
v(i)
v(i)
v(i)
By taking the maximum of the left-hand side over i, and by using the fact that the inequality J ≥ Jˆ implies
ˆ we obtain Eq. (2.2).
that T J ≥ T Jˆ = J,
10
ˆ we have for all states i and all J ≥ J,
ˆ
By using again the relation Tµ̂ Jˆ = J,
ˆ
ˆ
J(i) − J(i)
J(i) − (T J)(i) (T J)(i) − J(i)
=
+
v(i)
v(i)
v(i)
ˆ
J(i) − (T J)(i) (Tµ̂ J)(i) − (Tµ̂ J)(i)
+
v(i)
v(i)
J(i) − (T J)(i)
ˆ v.
≤
+ βkJ − Jk
v(i)
≤
By taking the maximum of both sides over i, we obtain Eq. (2.3).
Q.E.D.
If there exists an optimal proper policy, i.e., a proper policy µ∗ such that Jµ∗ = Jˆ = J * , we obtain
the following proposition, which is new. The closest earlier result assumes in addition that g(x, u) ≥ 0 for
all (x, u) (see [BeT91], Prop. 3). Existence of an optimal proper policy is guaranteed if somehow it can be
shown that Jˆ = J * and U (i) is finite for all i [cf. Prop. 2.1(d)].
Proposition 2.4:
Assume that the compactness and continuity condition holds, and that there
exists an optimal proper policy. Then:
(a) The optimal cost function J * is the unique fixed point of T in the set {J ∈ <n | J ≥ J * }.
(b) We have T k J → J * for every J ∈ <n with J ≥ J * .
(c) Let µ be a proper policy. Then µ is optimal if and only if Tµ J * = T J * .
ˆ
Proof: Existence of a proper policy that is optimal implies both that J * is real-valued and that J * = J.
The result then follows from Prop. 2.2. Q.E.D.
Another important special case where favorable results hold is when g(x, u) ≤ 0 for all (x, u). Then, it
is well-known that J * is the unique fixed point of T within the set {J | J * ≤ J ≤ 0}, and the VI sequence
{T k J} converges to J * starting from any J within that set (see e.g., [Ber12a], Ch. 4, or [Put94], Section 7.2).
In the following proposition, we will use Prop. 2.2 to obtain related results for SSP problems where g may
take both positive and negative values. An example is an optimal stopping problem, where at each state
x we have cost g(x, u) ≥ 0 for all u except one that leads to the termination state t with nonpositive cost.
Classical problems of this type include searching among several sites for a valuable object, with nonnegative
search costs and nonpositive stopping costs (stopping the search at every site is a proper policy guaranteeing
that Jˆ ≤ 0).
Proposition 2.5: Assume that the compactness and continuity condition holds, that Jˆ ≤ 0, and
that J * is real-valued. Then J * is equal to Jˆ and it is the unique fixed point of T within the set
{J ∈ <n | J ≥ J * }. Moreover, we have T k J → J * for every J ∈ <n with J ≥ J * .
Proof: We first observe that the hypothesis Jˆ ≤ 0 implies that there exists at least one proper policy, so
ˆ and that
Prop. 2.2 applies, and shows that Jˆ is the unique fixed point of T within the set {J ∈ <n | J ≥ J}
11
ˆ We will prove the result by showing that Jˆ = J * . Since generically we
T k J → Jˆ for all J ∈ <n with J ≥ J.
*
ˆ
have J ≥ J , it will suffice to show the reverse inequality.
Let J0 denote the zero function. Since Jˆ is a fixed point of T and Jˆ ≤ J0 , we have
Jˆ = lim T k Jˆ ≤ lim sup T k J0 .
k→∞
(2.4)
k→∞
Also, for each policy π = {µ0 , µ1 , . . .}, we have
Jπ = lim sup Tµ0 · · · Tµk−1 J0 .
k→∞
Since
T k J0 ≤ Tµ0 · · · Tµk−1 J0 ,
∀ k ≥ 0,
it follows that lim supk→∞ T k J0 ≤ Jπ , so by taking the infimum over π, we have
lim sup T k J0 ≤ J * .
(2.5)
k→∞
Combining Eqs. (2.4) and (2.5), it follows that Jˆ ≤ J * , so that Jˆ = J * .
Q.E.D.
The assumption Jˆ ≤ 0 of the preceding proposition will be satisfied if we can find a proper policy µ
such that Jµ ≤ 0. With this in mind, we note that the proposition applies to MDP where the compactness
and continuity condition holds, and J * is real-valued and satisfies J * ≤ 0, even if there is no termination
state. We can simply introduce an artificial termination state t, and for each x = 1, . . . , n, a control of cost
0 that leads from x to t, thereby creating a proper policy µ with Jµ = 0, without affecting J * . What is
happening here is that for the subset of states x for which J * (x) is maximum we must have J * (x) = 0, so
this subset plays the role of a termination state. Without the assumption Jˆ ≤ 0, we may have J * 6= Jˆ even
if J * ≤ 0, as case (c) of Example 1.1 shows.
In all of the preceding results, there is the question of finding J ≥ Jˆ with which to start VI. One
possibility that may work is to use the cost function of a proper policy or an upper bound thereof. For
example in a stopping problem we may use the cost function of the policy that stops at every state. More
generally we may try to introduce an artificial high stopping cost, which is our approach in Section 3. If
it can be guaranteed that Jˆ = J * , this approach will also yield J * . For example, when g ≤ 0, we may use
J = 0 to start VI, as is well known. In the case where the classical SSP conditions hold, we may of course
start VI with any J ∈ <n , and it is generally recommended to use an upper bound to J * rather than a lower
bound, as noted earlier. In the case of the general convergence model, a method provided in [Fei02], p. 189,
may be used to find an upper bound to J * .
We finally note that once the optimal cost function over proper policies Jˆ is found by VI, there may
still be an issue of finding a proper policy that attains Jˆ (see the discussion following Prop. 2.5). When
Jˆ = J * , a polynomial complexity algorithm for this is given in [FeY08]. The PI algorithm of the next section
can also be used for this purpose, even if Jˆ 6= J * , although its complexity properties are unknown at present.
2.2. A Policy Iteration Algorithm with Perturbations
We will now use our perturbation framework to deal with the oscillatory behavior of PI, which is illustrated
in case (d) of Example 1.1. We will develop a perturbed version of the PI algorithm that generates a sequence
12
ˆ under the assumptions of Prop. 2.2, which include the existence
of proper policies {µk } such that Jµk → J,
*
of a proper policy and that J is real-valued. The algorithm generates the sequence {µk } as follows.
Let {δk } be a positive sequence with δk ↓ 0, and let µ0 be any proper policy. At iteration k, we have a
proper policy µk , and we generate µk+1 according to
Tµk+1 Jµk ,δk = T Jµk ,δk .
(2.6)
Note that since µk is proper, Jµk ,δk is the unique fixed point of the mapping Tµk ,δk given by
Tµk ,δk J = Tµk J + δk e.
The policy µk+1 of Eq. (2.6) exists by the compactness and continuity condition. We claim that µk+1 is
proper. To see this, note that
Tµk+1 ,δk Jµk ,δk = T Jµk ,δk + δk e ≤ Tµk Jµk ,δk + δk e = Jµk ,δk ,
so that
Tµmk+1 ,δ Jµk ,δk ≤ Tµk+1 ,δk Jµk ,δk = T Jµk ,δk + δk e ≤ Jµk ,δk ,
k
∀ m ≥ 1.
(2.7)
Since Jµk ,δk forms an upper bound to Tµmk+1 ,δ Jµk ,δk , it follows that µk+1 is proper [if it were improper,
k
we would have (Tµmk+1 ,δ Jµk ,δk )(x) → ∞ for some x; cf. Eq. (2.1)]. Thus the sequence {µk } generated
k
by the perturbed PI algorithm (2.6) is well-defined and consists of proper policies. We have the following
proposition.
Proposition 2.6: Assume that the compactness and continuity condition holds, that there exists
at least one proper policy, and that J * is real-valued. Then the sequence {Jµk } generated by the
ˆ
perturbed PI algorithm (2.6) satisfies Jµk → J.
Proof: Using Eq. (2.7), we have
Jµk+1 ,δk+1 ≤ Jµk+1 ,δk = lim Tµmk+1 ,δ Jµk ,δk ≤ T Jµk ,δk + δk e ≤ Jµk ,δk ,
m→∞
k
where the equality holds because µk+1 is proper, as shown earlier. Taking the limit as k → ∞, and noting
ˆ we see that J k ↓ J + for some J + ≥ J,
ˆ and we obtain
that Jµk+1 ,δk+1 ≥ J,
µ ,δk
Jˆ ≤ J + = lim T Jµk ,δk .
(2.8)
k→∞
We also have
inf H(x, u, J + ) ≤ lim
inf H(x, u, Jµk ,δk ) ≤
k→∞ u∈U (x)
u∈U (x)
=
inf H(x, u, lim Jµk ,δk ) =
k→∞
u∈U (x)
inf
lim H(x, u, Jµk ,δk )
u∈U (x) k→∞
inf H(x, u, J + ),
u∈U (x)
where the first inequality follows from the fact J + ≤ Jµk ,δk , which implies that H(x, u, J + ) ≤ H(x, u, Jµk ,δk ),
and the first equality follows from the continuity of H(x, u, ·). Thus equality holds throughout above, so
that
lim T Jµk ,δk = T J + .
(2.9)
k→∞
13
a 0 1 2 t b c Destination
√
√
√
Prob. u Prob. 1 − u Cost u Cost 1 − u
a 0 1 2 t ab !c1 Destination
"
a 0 1 2 t b c Destination
2 t b Destination
J(1) = min
c, a + J(2)
√
√
√
√
∗ J 1u
Prob. u Prob.
1 −u Prob.
uJCost
1J−
Prob.
−µ3 Juµ0
�u
��
JµCost
µ J−
µCost
µ0 1JuCost
µ1 Jµ21 J
a12tb
a 0 1 2 J(2)
t b c =Destination
a12tb
a12tb
b + J(1)
a 0 1 2 t b c Destination
a 0 1 2 t b c Destination
∗ J J � J �� J 0 J∗ 1 J 2 J 3 J !0
"
∗
�
� Jµ�� Jµ0 JµJ
µ3 Jµ
µ0 µ√J µ Jµ µJµ� µ
1 Jµ
2µ Jµ
µµ0 Jµ1 Jµ2 Jµ3 J�
µ�� J
µ0
√
√ J Jµ Jµ√
√
√
J(1) = Jmin
c, akJ(1)
+ J(2)
=
min
a +FJ(2)
k
k+1
k
�
∗
Prob.
u
Prob.
1
−
u
Cost
u
Cost
−1
Cost
1
−
u
Prob. u Prob.
1
−
u
Cost
u
Cost
1
−
u
f
(x;
θ
)
f
(x;
θ
)
x
F
(x
)
F�(x) c,
xk+1
(xk+1 ) xk+2 x
J Jµ Jµ! Jµ!! Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
J(1) = min c, a + J(2)
J(2) = b + J(1)
k k+2
k+1
=(x
b∗kk+1
+
! ) xfk (x;
k+1x
k F
k+1
k ) f (x; θ k+1
k+1
∗k+1
θ"kk))!CONVEX
fF(x;
θk+1
)"(x;
xNLP
(x
) xFk+2
x)J(2)
(x
)µ(x)
x
=(x
F
(x∗
(x)
Fxµ∗ F
(x)
f
θFkk+1
) fk(x;
θ(x)
x=
xk+1
f
(x;
θ
F
(x
(x)
x
Fk (x
=
) FJ(1)
LP
bF+
J(1)
k ) f (x; θ k+1 ) J(1)
k F=
k ) F (x)
k+1
k+1
k+2
∗ = F (x∗ ) F µkJ(2)
µk+1
J(1)
=
min
c,
a
+
J(2)
min
c,
a
+
J(2)
f
(x;
θ
x
(x
x
F
(x
)
x
x
(x)
F
(x)
Figure 2.1. A stochastic shortest path problem with two states 1, 2, and a termination state t. Here we have
∗ J J ! J !! J 0 J 1 J 2 J 3 J k0
µsequence
µ
µ ofµ proper
µ
µ
µ {µ
µ } with µk (1) → 0 is asymptotically
Jˆ = J ∗ , but there is no optimal policy. JAny
policies
∗ J J � J �� J 0 J 1 J 2 J 3 J 0
JLP
∗=J
µ� strictly
µ ��Simplex
µsuboptimal
∗
k } converges
µNLP
µ NLP
CONVEX
0b J
1
2 improper
J
J
J
J
JµµCONVEX
Jµµ3 Jµµ0policy
J(2)
=
+
J(1)
→
J
,
yet
{µ
to
the
for which u = 0
optimal in the sense
that
J
J(2)
b
+
J(1)
k
µ
LP
LP
CONVEX
NLP
µ
µ
µ
µ
LP CONVEX
NLP
µ
at state 1.
k
k+1
f (x; θk ) f (x; θk+1 ) xk F (xk ) F (x) xk+1 F (xk+1 ) xk+2 x∗ = F (x∗ ) F µ (x) F µ (x)
∗
! J
!! J µ0JJ
1 µJ
J!! µJ3µ0JµJ0µ1 Jµ2 Jµ3 Jµ0 Simplex Policy
J ∗ Jµ Jµ! JµSimplex
µµJ
µ2µSimplex
Evaluation
Exploration
Enhancement
Simplex
k ) F (x)
k+1
θfk(x;
) fby
θ)k+1
xkImprovement
(xFthe
(x
) xk+2
x∗F=(xF∗ )(
ˆ
ˆk )is
θ(x;
xk)2.2,
F
(xJF
(x)unique
xk+1xk+1
F
(xFk+1
)k+1
xk+2
x∗ =
Combining Eqs. (2.8) and (2.9), we obtain J ≤ J + =fT(x;
Jf+θ(x;
.k )Since
Prop.
fixed
point
LPthat
CONVEX
ˆ it follows
ˆ NLP
ˆ and since J k ≥ J k ≥ J,
ˆ we have
of T within {J ∈ <n | J ≥ J},
J + = J.
Thus Jµk ,δk ↓ J,
µ ,δkk+1
µk
k
Policy
Evaluation
Improvement
Exploration
Enhancement
Policy
Evaluation
Improvement
Exploration
Enhancement
kk
kxk+1
k
k+1
k+1
∗(x)
∗ Improvement
µk+1
µk+1Exploration
ν
J
J
ν
Q
k ) f (x;
k+1
kθ)k+1
k+1
k+2
∗Q
∗k+2
µµ
µ
Policy
Evaluation
Enhancem
Policy
Evaluation
Improvement
Exploration
Enhancement
k
k
k+1
k+1
k+1
k+1
f
(x;
θ
)
f
(x;
)
x
F
(x
)
F
(x)
x
F
(x
)
x
x
=
F
(x
)
F
(x)
F
(x)
f
(x;
θ
θ
)
x
F
(x
F
(x)
F
(x
)
x
x
=
F
(x
)
F
F
(x)
ˆ
CONVEX
NLP
LP LP
CONVEX
NLP
Jµk → J. Q.E.D.
Simplex
νk2.6
Jk NLP
QCONVEX
νk+1 Qk+1 convergence
Proposition
guarantees
JµJkk+1
JˆJk(see
the preceding
proof)
the
k+1 Jk+1
k+1
νk νJk+1
νQ
k Qk+1
k+1
LP
,δk ↓
Cutting
plane
Interior
νµkk+1
νk+1point
Qand
νk Jµthe
Qmonotonic
QSubgradient
LP
CONVEX
k+1QJk+1
k+1 µk+1
k+1Subgradient
kNLP
k+1 Jk+1 µk+1
k+1
Simplex
Simplex
ˆ
J. When Improvement
the control space
U is finite,
Prop. 2.6 also implies
(possibly nonmonotonic) convergence Policy
Jµk →Evaluation
Exploration
Enhancement
k will beplane
Subgradient
Cutting
Interior
point
Subgradient
that the
generated
policies
µ
optimal
for
all
k
sufficiently
large.
The
reason
is
that
the
setInterior
of policies
Simplex
Cutting
plane Interior
point
Subgradient
Simplex
Polyhedral
approximation
Subgradient
Cutting
plane
point Subgradient
Subgradient CuttingSubgradient
plane Interior
point
Subgradient
Policy
Evaluation
Improvement
Exploration
Enhancement
Policy
Evaluation
Improvement
Exploration
Enhancement
is finite and there exists a sufficiently small
>
0,Jsuch
for
nonoptimal µ there is some state
x such
ν k Jk Q
Qk+1
k+1
k+1 µthat
k+1 ν
k+1all
approximation
ˆPolyhedral
that JPolicy
J(x)
+ .Improvement
This
convergence
behavior
should be
contrasted
the
behavior
of PI without
Policy
Evaluation
Improvement
Exploration
Enhancement
µ (x) ≥
Evaluation
Exploration
Enhancement
Polyhedral
approximation
LPs
are Polyhedral
solvedwith
by simplex
method
approximation
Polyhedral
approximation
ν
J
Q
J
µ
ν
Q
ν
J
Q
J
µ
ν
Q
k
k
k+1
k+1
k+1
k+1
k+1
k
k
k+1
k+1
k+1
k+1
k+1
perturbations, which may lead to difficulties,
as
noted
earlier
[cf.
case
(d)
of
Example
1.1].
Subgradient Cutting plane Interior point Subgradient
LPs
are
solved
byQsimplex
method
ν
J
Q
J
µ
Q LPs
νHowever,
J
µ
ν
k } may exhibit some serious
k
k
k+1
k+1
k+1
k+1
k Jk Qk+1
k+1
k+1
k+1
k+1
when the control
U νis
infinite,
theare
generated
sequence
solved
simplex
method
NLPs by
are
solved
by
gradient/Newton
methods.
LPs
are{µ
solved
by simplex method
LPsspace
are solved
byk+1
simplex
method
Polyhedral
approximation
Subgradient
Cutting
plane
Subgradient
Cutting
plane
Interior
pathologies in the limit. If {µk }K is a subsequence of policies that converges to Interior
somepoint
µ̄, point
in Subgradient
the Subgradient
sense that
are solved
by gradient/Newton
methods.
Subgradient
plane
point Subgradient
SubgradientNLPs
Cutting
planeCutting
Interior
point Interior
Subgradient
NLPs areConvex
solved by
gradient/Newton
methods.
special
cases of NLPs. methods.
NLPs
solved
by gradient/Newton
NLPslim
are solved
gradient/Newton
methods.
k (x)by
µ
=
µ̄(x),
∀ xmethod
= 1,programs
. . . , n,are are
LPs are solved
by simplex
Polyhedral
approximation
Polyhedral
approximation
k→∞, k∈K
programs
are special cases of NLPs.
Polyhedral
approximation
Polyhedral Convex
approximation
Convex
programs
are special
cases of
then from the preceding proofConvex
[cf. Eq.programs
(2.9)], we are
have
Modern
view:
Post
1990s
Convex
programs
areNLPs.
special cases of NLPs.
special
cases
of NLPs.
NLPs are solved
by gradient/Newton
methods.
LPs
are
solved
by
simplex
method
LPs
are
solved
by
simplex
method
Modern
view:
Post
1990s
solved
byTsimplex
method
ˆ
LPs are solvedLPs
by are
simplex
method
µk Jµk−1 ,δk−1 = T Jµk−1 ,δk−1 → T J.
Modern
view:
Post
1990s
LPs are
often
best
solved
by1990s
nonsimplex/convex methods.
Modern
view:
Post
Modern Convex
view: Post
1990s
programs
are special
cases
of NLPs.
NLPs
are
solved
by
gradient/Newton
methods.
LPs
are
often
best
solved
by
nonsimplex/convex
methods.
NLPs
are
solved
by
gradient/Newton
methods.
Taking
the
limit
as
k
→
∞,
k
∈
K,
we
obtain
NLPs
solved by gradient/Newton
methods.
NLPs are solved
by are
gradient/Newton
methods.
LPs
are
often
best
solved
by
nonsimplex/convex
Convex
are
often
themethods.
same methods as
LP
LPs
best solved by nonsimplex/convex
meth
LPs are
often
solved
nonsimplex/convex
methods.
view:
Post
1990s
ˆ are often
Tµ̄Modern
Jˆ ≤ best
lim
T Jµby
= Tproblems
J,
k−1 ,δ
k−1
Convex
problems
are
often
solved
by
the
same
methods
as
LPs.
k→∞,
k∈K
Convex
programs
special
cases
of NLPs.
Convex
programs
are are
special
cases
of NLPs.
Convex
areof
special
cases
of NLPs.
Convex programs
are programs
special cases
NLPs.
problems
are
often
solved
by
methods
as
LPs. method
LPs
often
best
byby
nonsimplex/convex
Nondifferentiability
and methods.
piecewise
linearity
arethe
common
featur
where the inequality follows from
the problems
loweraresemicontinuity
of
g(x,
·)the
andsame
the problems
continuity
pxysame
(·).
Since
we
Convex
are
often
solved
by
same
Convex
areConvex
oftensolved
solved
methods
asofthe
LPs.
Nondifferentiability
and
piecewise
linearity are common features.
ˆ
ˆ
ˆ
ˆ
Modern
view:
Post
1990s
Modern
view:
Post
1990s
Modern
view:
Post
1990s
also have Tµ̄ J ≥ T J, we see that Tµ̄ J = T J. Thus
µ̄ satisfies
optimality
Modern
view:the
Post
1990s condition of Prop. 2.2(c), and µ̄
areofoften
solved
by
same
methods
aspiecewise
LPs.
Nondifferentiability
andthe
piecewise
linearity
are common
features.
is optimal if it is proper. On the
otherConvex
hand, problems
properness
the
limit
policy
µ̄Description
may not
be
guaranteed,
even
Primal
Problem
Nondifferentiability
and
linearity
are commo
Nondifferentiability
and
piecewise
linearity
are
common
features.
Primal
Problem
Description
LPs
are
often
best
solved
by
nonsimplex/convex
methods.
*
*
k
k
LPs
are
often
best
solved
by
nonsimplex/convex
methods.
ˆ
ˆ
if J = J . In fact the generated sequence of proper
policies
{µ
} best
satisfies
limnonsimplex/convex
Jµk → J = J , yet
{µ
}
k→∞
are
often
solved
by nonsimplex/convex
methods.
LPsLPs
are
often
best
solved
by
methods.
Nondifferentiability
and
piecewise
linearity
arefollowing
common example,
features. which
Primal
Problem
Description
may converge to an improperPrimal
policy that
is strictly
suboptimal,
as
shown
by
the
Dual
Problem
Description
Primal
Problem
Description
Problem
Description
Dual
Problem
Description
Convex
problems
areby
often
by the same
methods as LPs.
Convex
problems
are of
often
solved
the solved
same methods
as LPs.
is similar
to Example
6.7
[Fei02].
Convex
problems
often
solved
by the
same
methods
as LPs.
Convex
problems
are are
often
solved
by the
same
methods
as LPs.
Primal Problem Description
Dual
Problem
Description
Vertical Dual
Distances
Problem
Description
Dual Problem
Description
Vertical
Distances
Example
2.1 (A
Counterexample
for piecewise
the
Perturbation-Based
PIfeatures.
Algorithm)
Nondifferentiability
linearityfeatures.
are common
Nondifferentiability
and piecewise and
linearity
are
common
Nondifferentiability
piecewise
linearity
common
features.
Nondifferentiability
andand
piecewise
linearity
are are
common
features.
Dual Problem
Description
Consider two
states 1Point
andVertical
2,
in addition
to the Vertical
termination
state t;Vertical
see Fig.Differentials
2.1. At state 1 we must choose
Distances
Crossing
Point
Distances
Distances
Crossing
Differentials
√
Primal
Problem Description
Primal
Problem
Description
u ∈ [0, 1], with expected cost equal to u. Then, we transition to state 2 with probability u, and we self√ Distances
Primal
Problem
Description
Primal
Problem
Description
Vertical
transition to state 1 with probability 1 − u. From
state 2Point
we
to t with cost -1. Thus
we have
1 transition
Crossing
Differentials
Values
fCrossing
(x) Crossing
f ∗ (y)
Pointpoints
Differentials
Crossing
Point Differentials
Problem
Description
Dual ProblemDual
Description
√
√
H(1, u, J) = u + Crossing
1 − u J(1)
+Differentials
u J(2),
∀ J ∈ <2 , u ∈ [0, 1],
Point
Dual
Problem
Description
Dual
Problem
Description
∗
∗ points
∗ f ∗ (y)f ∗ (−y)
∗ (y)
Crossing
−f
Values
Crossing
points f ∗ (y)
Values f (x) CrossingValues
pointsff(x)
Vertical Distances
Vertical Distances
1 (y) f
1 (y) +ff(x)
2 (−y)
2
14
1
Vertical
Distances
Vertical
Distances
∗
∗
∗
∗
∗ f∗
−f1∗f(y)
(y) +yf∗2∗−f
(−y)
f ∗ (−y)
Point
Crossing PointCrossing
Differentials
Slope
−f1∗Differentials
(y) f1∗ (y) + f2∗ (−y)
1Slope
1 (y)y2f1 (y) + f2 (−y) f2 (−y)
2 (−y)
Slope y ∗ Slope y
Crossing
Point
Differentials
Crossing
Point
Differentials
1
Slope
y ∗ Slope1 y Slope y ∗ Slope y
1
H(2, u, J) = −1,
∀ J ∈ <2 , u ∈ U (2).
Here there is a unique improper policy µ: it chooses u = 0 at state 1, and has cost Jµ (1) = 1. Every policy µ
with µ(1) ∈ (0, 1] is proper, and Jµ can be obtained by solving the equation Jµ = Tµ Jµ . We have Jµ (2) = −1,
so that
p
p
Jµ (1) = µ(1) + 1 − µ(1) Jµ (1) − µ(1),
p
ˆ
and we obtain Jµ (1) =
µ(1) − 1. Thus, J(1)
= J ∗ (1) = −1. The perturbation-based PI algorithm will
generate a sequence of proper policies {µk } with µk (1) → 0. Any such sequence is asymptotically optimal in
the sense that Jµk → Jˆ = J ∗ , yet it converges to the strictly suboptimal improper policy. Note that in this
example, the compactness and continuity condition is satisfied.
2.3. Discussion: Fixed Points of T
While Jˆ is a fixed point of T under our assumptions, as shown in Prop. 2.2(a), an interesting question is
whether and under what conditions J * is also a fixed point of T . The following example shows that if g(x, u)
can take both positive and negative values, and the problem is stochastic, J * may not be a fixed point of T .
Moreover, Jµ need not be a fixed point of Tµ , when µ is improper.
Example 2.2 (A Problem Where J ∗ is not a Fixed Point of T )
Consider the SSP problem of Fig. 2.2, which involves a single improper policy µ (we will introduce a proper
policy later). All transitions under µ are deterministic as shown, except at state 1 where the successor state is
3 or 5 with equal probability 1/2. Under the definition (1.1) of Jµ in terms of lim sup, we have
Jµ (1) = 0,
Jµ (2) = Jµ (5) = 1,
Jµ (3) = Jµ (7) = 0,
Jµ (4) = Jµ (6) = 2,
so that the Bellman equation at state 1,
Jµ (1) =
1
Jµ (2) + Jµ (5) ,
2
is not satisfied. Thus Jµ is not a fixed point of Tµ . If for x = 1, . . . , 7, we introduce another control that leads
from x to t with cost c > 2, we create a proper policy that is strictly suboptimal, while not affecting J ∗ , which
again is not a fixed point of T .
Of course there are known cases where J * is a fixed point of T , including the SSP problem under the
classical SSP conditions, the case where g ≥ 0, the case where g ≤ 0, the positive bounded model discussed in
Section 7.2 of [Put94], and the general convergence models discussed in [Fei02] and [Yu14]. The assumptions
of all these models are violated by the preceding example.
It turns out that J * is a fixed point of T in the special case of a deterministic shortest path problem,
i.e., an SSP problem where for each x and u ∈ U (x), there is a unique successor state denoted f (x, u). For
such a problem, the mappings Tµ and T take the form
(Tµ J)(x) = g x, µ(x) + J f (x, µ(x)) ,
(T J)(x) = inf (Tµ J)(x).
µ∈M
Moreover, by using the definition (1.1) of Jµ in terms of lim sup, we have for all µ ∈ M (proper or improper),
Jµ (x) = g x, µ(x) + Jµ f (x, µ(x)) = (Tµ Jµ )(x),
15
x = 1, . . . , n.
a 0 1 2 3 4 5 t b c Destination
√
√
√
Prob. 1/2 u Prob. 1 − u Cost 0 Cost u Cost −1 Cost 1 − u
!
"
a 0 1 2 3 4 5 t b c Destination
J(1)
= min c, a + J(2)
√
a 0 1 2 3 4 5 t b c Destination
a√0 1 2 3 4 5 t √
b c Destination
Prob.
1/2 u Prob.
1 − u Cost 0√Cost u Cost −1√Cost 1 − u
√
√
√
√
Prob. 1/2 u Prob.Prob.
1/2
1 −J(2)
u=
Cost
Cost 1u−Cost
u Cost
−1 Cost
0 Cost
1 −u Cost
u
−1 Cost 1 − u
b u+0Prob.
J(1)
!
"
J(1)
! = min c,
" a + J(2)
!
"
J ∗ Jµ Jµ′ Jµ′′ Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
J(1) = min c, a + J(2)
J(1) = min c, a + J(2)
a 0 1 2 3 4a 50 t1 b2 c3Destination
a 0 1 2 3 4 5 t b c Destination
4 5 t b c aDestination
0 1 2 3 4 5 6 7 t b c Destination
J(2) = b + J(1)
a 0 1 2√
3 4 5 t b c√Destination
a 0 1 2 3 4 5 tab0c 1Destination
a 0 1 2 3 4 5 t b√c Destination√
2 3 4 5 t b c Destination
√
√
√ √√
√
√
k−
Prob.
1/2
u
Prob.
1
−
u
Cost
0
Cost
u
Cost
−1
u
u Prob.
1
−
u
Cost
0
Cost
u
Cost
−1
Cost
1
−
u
J(2)
=
b
+
J(1)
J(2)
=µk+1
b1+
J(1)
Prob.
1/2
u
Prob.
1
−
u
Cost
0
Cost
uCost
u −2 Cost
Prob.
1/2
u
Prob.
1
−
uFCost
0Cost
Cost
2−
Cost
√
√ u Cost −
√
k
k+1
k
k
k+1
k+1
k+2
∗
∗Cost
µ1−1
√
√
√
√
√
√Prob.√1/2 f (x;
√
√
θ
)
f
(x;
θ
)
x
F
(x
)
F
(x)
x
F
(x
)
x
x
=
F
(x
)
(x)
F
(x)
Prob.
1/2
Prob.
− u1 Cost
Prob. 1/2 u Prob.
Prob. 1/2
1 − uu Prob.
Cost 01Prob.
Cost
2 Cost
Costu0−2
Cost
u Cost
−1
Cost
u −11Cost
Prob.
−
u−1
Cost
0 Cost
uuuCost
− u0 Cost 2 Cost −2 Cost u Cost −1 Cost 1 − u
− u1/2
Cost
u1 Cost
Cost
1 −1 −
J ∗ Jµ Jµ′ Jµ′′ Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
!µ′3 JJµµ′′0 Jµ0 J
"µ1 Jµ2 Jµ!3 Jµ0
J ∗ Jµ Jµ′ Jµ′′ Jµ0 JµJ1∗JJµµ2 J
!"
"
!
"
LP!CONVEX" NLP
J(1) = min
c, =
J(2)c, a +
=!min c, a +"J(2)
J(1)
min
J(2)= min c, a + J(2)
!a +
"J(1)
! b c Destination
" J(1)
a
0
1
2
3
4
5
t
a
0
1
2
3
4
5
6
7
t
b
c
Destination
J(1)
=
min
c,
a
+
J(2)
J(1) = min c,J(1)
a+=
J(2)
J(1)
=
min
c,
a
+
J(2)
min c, a + J(2)
k ) F (x)
a 0 1 2 3 4√5 t b c Destination
√ xk+1 F (xk+1 ) xk+2 x∗ = F (x∗ ) F µk (x) F µ
a 0 1 2 3 4 5 t b c Destination
fb(x; θk ) f (x;√θk+1 ) xk F (x√
a− 1√2u tCost
k Cost −1
k+1
Prob. 1/2 √u Prob.
0 Cost
u+Cost
−1 1Cost
1−
Prob.
1/2
u
Prob.
−
u
Cost
0J(1)
Cost
−
Simplex
k
k+1
k
k
k
k+1
k+1
k
k+1
kCost
k+1
∗ = −2
k+1
∗) F
√
√
√ f1(x;
√
J(2)
=
bu
+
J(2)
=
b
J(1)
J(2)
b2+Cost
θ ) fCost
(x; θ 0 Cost
) xf (x;
FCost
(x
θ ))f−1
F(x;
(x)
θ x )1x−
F√
(x
F
(x
)) Fx=k+2
(x)
xJ(1)
F (x
)=µuxbk+2
(x)
x
F∗µ= Cost
F(x)
(x∗ )1 F
J(2)
+ J(1)
1/2
u
Cost
u
Prob. 1/2 u Prob. 1 − u Cost 0 Cost
u=
Cost
Cost=1u−
u 1 J− JuJ(2)
J(2)
=
b
+
J(1)
J(2)
bProb.
+−1
J(1)
=
b
+
J(1)
J(2)
bProb.
+ J(1)
µ Jµ0 Jµ00 Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
LP CONVEX NLP
a JJImprovement
1∗ 22JJt bJ3 Destination
!
"
!
"
Exploration
Enhancement
∗
∗ Jµ
′J
0 µJ
1 µJ
′LP
′′CONVEX
J ∗ JPolicy
3 µJ
J
J2µ0J=
Jµ0µJ2c,µNLP
Ja′ µJ
′′ µ
µ µ µJ
0
J3min
J0µ0 "Jµ1 Jµ2J(1)
Jµ3 =Jµmin
µ Jµ′ JEvaluation
µ′′ Jµ0 Jµ1LP
µJ
µ
µ
µµNLP
µ CONVEX
µJ0µ′′JJ
µ1J
J(1)
+
J(2)
c, a + J(2)
!
∗ 0J J ′ J ′′ J 0 J
"
∗ J2 JJ 3 ′ JJ 0′′ J! 0 J 1 J 2 J
1
2
3
0
′ µJ
′′µ
1 µJ
2 µJ
J
J
J
J
3
J ∗ Jµ Jµ′ Jµ′′ J µ∗0JJµµJ
J
J3µ0JµJ0µ1J J
µ µ
µ
µ µµ µµ
µ J(1)
µ
µ min
µ
µ a + J(2)
µ
µ
µ
µ
µ
µJ(1)
0
00
=
c,
= min c, a + J(2)J Jµ Jµ JSimplex
µ Jµ0 Jµ1 Jµ2 J µ3 J µ0
k ) f (x; k+1 ) xk F (xk ) F (x) xk+1 F (xk+1 ) xk+2 x = F (x ) F µk (x
f (x; Simplex
νk Jk Qk+1 Jk+1 Simplex
µk+1 νk+1 Qk+1
k
k+1
k+1b + J(1)
J(2)
= b +kJ(1)
J(2)
k )Ff(x)
k+1
k k+1
k+1
k+2
∗ )=(x)
∗x)∗ F
k ) f (x; θ k+1 ) xk fF(x;
k
∗ =
∗ )k+1
µ)kk(x)
k
k+2
∗F
µk (x)
kµ
k+1
θ
(x;
θ
)fixed
F
(x)k)xfk+2
F (x)
F(x)
(x
x)=
(x
f
(x;
θ
(x
)
x
(x
FFtransitions
(x
f
(x;
θ
fFx(x;
θk+1
x
xF
(xFk+1
xF
=
F(x)
(x
)) µFxk+2
(x;
θµxk(x
f)(x;
θk+1
)k+1
x xF
xk+1
Fµ(x
x∗F=µ
Figure k2.2.
An example
of kkan
improper
policy
µ, where
Jkµ∗kk+1
is notµkk+1
a)k+1
of
TF
.)xAll
under
µF are
k
k point
bxµ+
J(2)
+ J(1)
k=
k=
k+2 x∗ = F (x∗ ) F µk (x) F µk+1 (x)
k+2
∗F
∗k+2
k+2
∗k+1
∗x
µk+1
k+1
f)(x)
(x;
θ (x)
)Evaluation
fFF
(x;
FF(x
)J(1)
F
(xk+1
) x(x)
f (x; θk ) f (x; θfk+1
(xkθ)k+1
F (x)
(x
)θxk+1
xk+1
=
F µxx
f k+1
(x;
fbk+1
(x;
)x
(xFk)(x
)xF
F(x
(x∗θk+1
)µ)Improvement
xx(x)
F(x)
(x
) k+1
Fµ F
(x)
FEnhancement
(x;)θxk) fF(x;
) xkx
F
(xθF)=
F
(x)
Fxk(x
=
) (x)
FJ(2)
(x)
Policy
Exploration
k
k+1
deterministic as shown, except
at state 1 where the
nextInterior
state ispoint
2 or
5Subgradient
with
equal
k+1
k )probability
k+11/2.
Subgradient
)µf3 (x;
xJkµF
(x
FImprovement
(x)
F (xk+1 ) xk+2 x = F (x ) F µ (x) F µ (
Exploration
1(x;
LP
NLP
J ∗ Jµ Jµ′Cutting
Jµ′′Policy
Jµ0 plane
JµfEvaluation
Jµ2 kJCONVEX
JJImprovement
′′ J
0 Jµ1
3 Jµ0 Exploration Enhancement
Jµ J)µ′Evaluation
Jµ2 xJEnhancement
µ∗0Policy
µ
µ
LP
CONVEX
NLP
LP
CONVEX
NLP
LP
CONVEX
NLP
∗ J J ′ J ′′ J 0 J 1 J 2 J 3 J 0
J ∗ JNLP
Jµ1 CONVEX
Jµ2 Jµ3 Jµ0 NLP LP CONVEX NLP
µ Jµ′ Jµ′′ Jµ0 LP
µ µ
µ NLP
µCONVEX
µ
µ
µNLP
µLP CONVEX
LPJ CONVEX
LP
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
For any policy π = {µ0 , µ1Polyhedral
, . . .}, using
theνkdefinition
of
ink+1
terms
lim νsup,
we have for all x,
approximation
JCONVEX
J(1.1)
µk+1
νk JJνπkk+1
Q
Q
Jk+1ofµk+1
k Qk+1
k+1 NLP
k+1
k+1 Qk+1
Simplex
LP
k
k+1
k ) F (x) xk+1
k+2
∗ k=
µk (x) F
µk+1 (x)
Simplex
Simplex
k ) fF
k+1))xx
k Fx
k+1
Simplex
f (x; θ ) f (x; θ
)Simplex
xk F (x
(xθk+1
(x∗ )xF
f
(x;
θ
(x;
(x
) F (x)
F (xk+1
) xk+2 x∗ = F (x∗ ) F µ
k k
kFFµ(x
(xk+1
Simplexf (x; θkSimplex
k ) fx(x;
k+1
k+1
k+2
∗ =
∗ ) F µk (x) F µk+1 (x)
Subgradient
Cutting
plane
Interior
point
Subgradient
k+2
∗ Simplex
∗x
µk+1x
f
(x;
θ
θ
)
)
F
(x)
F
)
x
x
F
(x
) f (x; θk+1 ) xk F (xkSimplex
) F (x) xk+1 F (xk+1
)
x
=
F
(x
)
(x)
(x)
J
(x)
=
g
x,
µ
(x)
+
J
f
(x,
µ
(x))
,
(2.10)
π
0
π
0
LPs are solved bySubgradient
simplex method
1 plane Interior
CuttingSubgradient
Cutting
point plane
Subgradient
Interior point Subgradient
Policy
Evaluation
Improvement
Exploration
Enhancement
Simplex
Policy
Improvement
Exploration
Enhancement
Policy Evaluation
Improvement
Exploration
Enhancement
Policy
Evaluation
Improvement
Exploration
Enhancement
Policy
Evaluation
Improvement
Exploration Enhancement
LP CONVEX
NLPEvaluation
LP
CONVEX
NLP
Policy
Evaluation
Improvement
Exploration
Enhancement
Policy
Evaluation
Improvement
Enhancement
Policy Evaluation
Improvement
Exploration
Enhancement
Policy
Evaluation Exploration
Improvement
Exploration
Enhancement
Polyhedral
approximation
LP
CONVEX
NLP
LP CONVEX
NLP
where π1 = {µ1 , µ2 , . . .}. By
taking
the
infimum
of
the
left-hand
side
over
π
and
the
infimum
of
the
right-hand
NLPs are solved Polyhedral
by gradient/Newton
methods. approximation
approximation
Polyhedral
ν
J
Q
J
µ
ν
Q
ν
J
Q
J
µ
ν
Q
ν
J
Q
J
µ
ν
Q
*
*
⇥
J
Q
J
µ
⇥
Q
k
k
k+1
k+1
k+1
k+1
k+1
ν
J
Q
J
µ
ν
Q
k
k
k+1
k+1
k+1
k+1
k+1
Simplex
Policy
Evaluation
Improvement
Exploration
Enhancement
k
k
k+1
k+1
k+1
k+1
k+1
k
k
k+1
k+1
k+1
k+1
k+1
k
k
k+1
k+1
k+1
k+1
k+1
Simplex
side over
πJ1 Qand
then
µ0 , weJobtain
= TQJν .JNote
that
this
not require any assumptions
Jk+1
µk+1argument
Qk+1does
νk JSimplex
νk+1
νk νJk+1
µk+1J νk+1
νk Jµkk+1
Qk+1
µk+1
LPs
solved
byνk+1
simplex
method
k k Q
k+1are
k Qk+1 Jk+1
k QQ
k+1
k+1
k+1k+1
k+1 k+1
Simplex
Convex
programs
are
special
cases
of
NLPs.
other than the deterministic
character
ofLPs
theare
problem,
and
holds
even
if
the
state
space is infinite. The
solved by simplex
LPs aremethod
solved by simplex
method
Subgradient
Cutting
plane
Interior
point
Subgradient
Subgradient
plane
Interior
point
Subgradient
Subgradient
Cutting
plane
Interior
point
Subgradient
Subgradient
Cutting
plane
InteriorEnhancement
point Subgradient
PolicyCutting
Evaluation
Improvement
Exploration
Enhancement
Policy
Evaluation
Improvement
Exploration
Subgradient
Cutting
plane
Interior
point
Subgradient
⇥
J
Q
J
µ
⇥
Q
k
k
k+1
k+1
k+1
k+1
k+1
mathematical
reason
why
Bellman’s
equation
J
Tµ JSubgradient
may
not
hold
for stochastic
problems
and improper
Subgradient
Cutting
plane
Interior
point Subgradient
Subgradient
Cutting
plane Cutting
Interior
point
Subgradient
Subgradient
Cutting
plane Interior
point
Subgradient
plane
Interior
point Evaluation
Subgradient
µ =
µare
NLPs
solved
by
gradient/Newton
methods.
Policy
Improvement
Exploration
Enhancement
Policy Evaluation
Improvement
Exploration
Enhancement
view:
1990s
NLPs
are solvedwith
by NLPs
gradient/Newton
are
solved by
methods.
gradient/Newton
methods.
µ (cf. Example 2.2) is Polyhedral
thatModern
lim sup
mayPost
not
commute
the
expected
value
that is inherent
in Tµ , and
Polyhedral
approximation
approximation
approximation
νk Japproximation
µk+1Polyhedral
νPolyhedral
Qk+1
k Qk+1 JSubgradient
k+1
k+1
νk Japproximation
Qk+1Polyhedral
Jk+1 µk+1
νk+1Subgradient
Qk+1
kplane
Cutting
Interior
point
Polyhedral
approximation
Polyhedral
approximation
Polyhedral
approximation
Polyhedral
approximation
Convex
programs
are
special
cases
of
NLPs.
the
νk Jk Qk+1 Jk+1 µk+1 νk+1 Qk+1
νk Jk Qk+1 J
µk+1 νk+1 proof
Qk+1 breaks down.
k+1preceding
LPs are often best
solvedprograms
by nonsimplex/convex
methods.
Convex
are
Convex
specialprograms
cases
of NLPs.
are special cases of NLPs.
LPs
are
solved
bySubgradient
simplex
LPs
are
solved by 2.2
simplex
method
LPsInterior
are
solved
bymethod
simplex
method
LPs
are solved
simplex
method
Subgradient
Cutting
plane
point
Subgradient
The improper policy of Example
may
be
viewed
as
a
randomized
policy
for
a deterministic
shortest
Cutting
planeby
Interior
point
Subgradient
LPs
are
solved
by
simplex
method
Polyhedral
approximation
LPs
are
solved
by
simplex
method
LPs
are
solved
by
simplex
method
LPs
are
solved
by
simplex
method
LPs
are
solved
by
simplex
method
Modern
view:
Post
1990s
Subgradient Cutting plane Interior point Subgradient
Subgradient Cutting plane Interior point Subgradient
path problem, where from Convex
state 1problems
one canModern
go to
state
2
or
state
5,
and
the
randomized
policy
chooses
either
are
often
solved
by
the
same
methods
as
LPs.
view: Post Modern
1990s view: Post 1990s
NLPs *are NLPs
solvedare
by solved
gradient/Newton
methods.
NLPs are
solved byapproximation
gradient/Newton
methods.
by
gradient/Newton
NLPs
are solved
by methods.
gradient/Newton methods.
Polyhedral
Polyhedral
approximation
one
equalbyprobability.
For
this
problem,
J solved
takes
the
same
values
asbyinnonsimplex/convex
Example
2.2 for methods.
all x 6= 1, but
are
solved
by gradient/Newton
methods.
NLPs
are solved
bywith
gradient/Newton
methods.
NLPs
are solved
by
gradient/Newton
methods.
NLPs
are solved
gradient/Newton
methods.
NLPs
are
solved
by
gradient/Newton
methods.
LPs
are
best
solved
LPs NLPs
are
by often
simplex
method
Polyhedral
approximation
Polyhedral
approximation
piecewise
are
common
features.
* and
are=
often
bestlinearity
solved
LPs
are
by
often
nonsimplex/convex
best
solved
by
nonsimplex/convex
methods.
methods.
it takes the value J * (1) =Nondifferentiability
1 rather than LPs
JConvex
(1)
0.
Thus,
surprisingly,
once
we
allow
randomized
policies
programs
are are
special
cases
NLPs.
Convex LPs
programs
are special
cases
of NLPs.
Convex
programs
are
special
casesmethod
ofare
NLPs.
Convex
programs
special cases of NLPs.
are solved
by simplex
method
LPs
solved
by of
simplex
Convex
programs
are
special
cases
of
NLPs.
Convex
programs
are
special
cases
of
NLPs.
Convex
programs
are
special
cases
of
NLPs.
Convex
programs
areinspecial
casesLPs
of NLPs.
Convex
problems
often
solved
byofto
the
same
methods
as LPs.
into
problem
the manner
of
the
optimal
costare
function
ceases
be
a fixed
point
of T
programs
are
special
cases
NLPs.
areExample
solved
by 2.2,
simplex
method
NLPs
areConvex
solved
by
gradient/Newton
methods.
LPs are solved
bythe
simplex
method
Primal Problem Description
Convex
problems
areConvex
often solved
problems
by the
are same
often methods
solved byasthe
LPs.
same methods as LPs.
and simultaneously its Modern
optimal
cost
at
state
1
is
improved.
Modern
view:
Post
1990s
view: are
Post
1990s
Modern
view:
Post
1990s
Modern
Post 1990s methods.
NLPs
solved
by gradient/Newton
methods.
NLPs
are
solved
byview:
gradient/Newton
Modern
view: Post 1990s
Modern
Post by
1990s
Modern
view:NLPs
Post are
1990s
Modern
view:
Post 1990s
Nondifferentiability
and
piecewise
linearity are common features.
solved
by
gradient/Newton
methods.
NLPs view:
are solved
gradient/Newton
methods.
Modern
view:
Post
1990s
Convex
programs
are
special
cases
of
NLPs.
On
the other hand,
under
the
assumptions
of
Prop.
2.2,
if
we
allow
randomization,
the
optimal
cost features.
Dual Problem Description
Nondifferentiability Nondifferentiability
and piecewise linearity
andare
piecewise
common
linearity
features.
are common
LPsare
arenonsimplex/convex
often
best
solved
by
nonsimplex/convex
methods.
LPs areConvex
often best
solved
by
methods.
LPs
are
often
best
solved
by
nonsimplex/convex
methods.
LPs
are
often
best
solved
by
nonsimplex/convex
programs
special
cases
of
NLPs.
ˆ
ˆ
Convex
programs
are
special
cases
of
NLPs.
function
Jby
randomized
policies
that by
are
proper
will
not
improved
over J. To formalize
this assertion, methods.
r over
LPs
are
often
bestbe
solved
by nonsimplex/convex
methods.
LPs
are often
best
solved
nonsimplex/convex
methods.
LPs
often
best solved
nonsimplex/convex
LPs
are
often
best
solved
by
nonsimplex/convex
methods.
Primal
Problem
Description
Convex
programs
are view:
special
cases
ofmethods.
NLPs.
Convex
programs
are
special
cases
of are
NLPs.
LPs
are
often
best
solved
by
nonsimplex/convex
methods.
Modern
Post
1990s
Distances
let us note that given the Vertical
SSP problem
of
Section
1, where
no randomized
policies are explicitly
allowed,
Primal
Problem
Description
Primal
Problem Description
Convex
problems
are
often
solved
by
the
same
methods
as
LPs.
Convex Modern
problemsview:
are often
by the
same
methods
as
LPs.
Convex
problems
are
often
solved
by
the
same
methods
as LPs.
Convex
problems
are
often
solved
by the
same methods as LPs.
Post solved
1990s
Modern view: Post 1990s
Convex
problems
areDescription
often
same methods
as LPs.
Convex
problems
are
solved
the solved
same
methods
as
LPs.
Convex
problems
aresame
often
solved
by
the
same
methods
assolved
LPs. by
wePost
canoften
introduce
another
SSP
where
the
control
space
is enlarged
tothe
include
randomized
controls.
Convex
problems
areby
often
byproblem
the
methods
asDual
LPs.
Problem
Modern
view:
Post
1990s
Modern
view:
1990s
Convex
problems
arenonsimplex/convex
often
solved by themethods.
same methods as LPs.
Crossing Point Differentials
LPs
are
often
best
solvedProblem
by
Dual
Problem
Description
Dual
Description
This is the problem where
theLPs
set are
of feasible
controls
at
state
x
taken
to
be the are
set
of probability
measures
Nondifferentiability
linearity
common
Nondifferentiability
piecewise
linearity
areispiecewise
common
features.
Nondifferentiability
and
linearity
are features.
common
features.
Nondifferentiability
and
piecewise
linearity
are common features.
oftenand
best
solved
by
nonsimplex/convex
methods.
LPsand
are
often
bestpiecewise
solved by
nonsimplex/convex
methods.
Nondifferentiability
and
piecewise
linearity
are
common
features.
Nondifferentiability
and
piecewise
linearity
are
common
features.
Nondifferentiability
and
piecewise
linearity
are
common
features.
Nondifferentiability
and
piecewise
linearity
are
common
features.
Vertical
Distances
LPs,methods.
are
by nonsimplex/convex
methods.
LPs are often
bestUsolved
by nonsimplex/convex
over
(x), denoted
by P U (x)
andoften
the best
costsolved
function
and transition
probabilities,
g(x, ·) and pxy (·), are
1
are
Nondifferentiability
and piecewise
linearity
are common
Convex
often Distances
solved
by the same
methods
as LPs. features.
Vertical problems
Distances
Vertical
Primal
Problem
Description
Problem
Description
appropriately redefinedPrimal
over PProblem
U (x) problems
.Description
Note
that
P Primal
U (x)
isby
a compact
metric
space
by Prop.
[BeS78]as LPs.
Primal
Problem
Description
Convex
are often
solved
the
same
methods
as solved
LPs.
Convex
problems
are
often
by the 7.22
sameofmethods
Primal
Problem
Description
Primal
Problem
Description
Primal
Problem
Description
Primal
Problem
Description
Crossing
Point
Differentials
Convex
problems
are
often
solved
by
the
same
methods
as
LPs.
Convex problems
are
often
solved
by
the
same
methods
as
LPs.
(with any metric that metrizes weak convergence
of probability
measures;
cf. Chap. 11.3 of [Dud02]). Also,
Crossing Primal
Point Differentials
Crossing
Point Differentials
Problem
Description
Nondifferentiability
and Description
piecewise
linearity are common features.
Dual
Problem
Description
Dual
Problem
Description
Problem
Dual
Problem
Description
Nondifferentiability
and Dual
piecewise
linearity
are common
features.
Nondifferentiability
and piecewise
linearity
common
Prop.
7.31
of
[BeS78]
can
be
used
to
show
that
the
one-stage
cost
remains
lower
semicontinuous
with
respectfeatures.
1 are
Dual
Description
Dual
ProblemDual
Description
Dual
Problem
Description
Problem
Description
andProblem
piecewise
linearity are common features.
Nondifferentiability
and piecewise
linearity
areNondifferentiability
common features.
1
1
to randomized controls, while the compactness
and
continuity
condition holds for the randomized controls
Dual
Problem
Description
Primal
Problem
Description
Vertical
Distances
VerticalPrimal
Distances
Vertical
Distances
Vertical
Distances
Problem
Description
Primal
Problem
Description
problem,
assumingVertical
it holds
forPrimal
the nonrandomized
problem. Moreover, the set of proper policies for
Verticalcontrols
Distances
Vertical
Distances
Vertical
Distances
Problem Description
PrimalDistances
Problem
Description
Vertical
Distances
Dual
Problem
Description
Crossing
Point
Differentials
CrossingDual
Point
Differentials
Crossing
Point
Differentials
Crossing
Point Differentials
Problem
Description
Dual
Problem
Description
Crossing 16
Point Differentials
Crossing
PointCrossing
Differentials
Crossing Point
Differentials
Point Differentials
Dual
Problem Description
Dual Problem
Description
1
1
1
CrossingVertical
Point
Vertical
Distances
Distances
1
1Vertical Distances
1 Differentials
1
Vertical Distances
Vertical Distances
Crossing Point Differentials
Crossing Point
Differentials
Pointpoints
Differentials
Values
fCrossing
(x) Crossing
f (y)
Crossing
Point Differentials
Crossing Point Differentials
1
the randomized controls problem is nonempty as it contains the proper nonrandomized policies. Thus we
have Jˆ ≥ Jˆr , while the analysis of the preceding section applies to the randomized controls problem. Let Tr
be the DP mapping defining Bellman’s equation for this problem, and let Jr* be the corresponding optimal
cost function. Then it can be seen that Tr J = T J for all J ∈ <n . Since Jˆ is a real-valued fixed point of T
and hence of Tr , it follows that Jr* is real-valued (see the comment following the proof of Prop. 2.2). Thus
from Prop. 2.2(a), Jˆr is the unique fixed point of Tr , and hence also of T , over all J ≥ Jˆr . We have Jˆ ≥ Jˆr
while Jˆ is a fixed point of T , so we obtain that Jˆ = Jˆr .
3.
THE NONNEGATIVE ONE-STAGE COSTS CASE
In this section we aim to eliminate the difficulties of the VI and PI algorithms in the case where g(x, u) ≥ 0
for all x. Under this assumption we can appeal to the results of nonnegative-cost MDP, which we summarize
in the following proposition.
Proposition 3.1:
Assume that g(x, u) ≥ 0 for all x and u ∈ U (x). Then:
n satisfies J = T J, then J ≥ J * .
(a) J * = T J * , and if J ∈ E+
(b) A policy µ∗ is optimal if and only if Tµ∗ J * = T J * .
(c) If the compactness and continuity condition holds, then there exists at least one optimal policy,
n with J ≤ J * .
and we have T k J → J * for all J ∈ E+
For proofs of the various parts of the proposition and related textbook accounts, see [Put94], Ch. 7, and
[Ber12a], Ch. 4. The monograph [BeS78] contains extensions to infinite state space frameworks that address
the associated measurability issues, including the convergence of VI starting from J with 0 ≤ J ≤ J * . The
recent paper [YuB13c] provides PI algorithms that can deal with these measurability issues, and establishes
the convergence of VI for a broader range of initial conditions. The paper [Ber77], and the monographs
[BeS78], Ch. 5, and [Ber13], Ch. 4, provide an abstract DP formulation that points a way to extensions of
the results of this section.
It is well-known that for nonnegative-cost MDP, the standard PI and VI algorithms may encounter
difficulties. In particular, case (c) of Example 1.1 shows that there may exist a strictly suboptimal policy µ
satisfying Tµ Jµ = T Jµ , so PI will terminate with such a policy. Moreover, there may exist fixed points J of
T satisfying J ≥ J * and J 6= J * , and VI will terminate starting with such a fixed point; this occurs in case
ˆ > J * (1). In the next subsection, we will address
(c) of Example 1.1, where Jˆ is a fixed point of T and J(1)
these difficulties by introducing a suitable transformation.
3.1. Reformulation to an Equivalent Problem
We will define another SSP problem, which will be shown to be “equivalent” to the given problem. In the
new SSP problem, all the states x in the set
X 0 = x ∈ X | J * (x) = 0 ,
17
including the termination state t, are merged into a new termination state t. This idea was inspired from
a result of our recent paper [YuB13c] on convergence of VI, which shows that for a class of Borel space
nonnegative-cost DP models we have T k J → J * for all J such that J * ≤ J ≤ cJ * for some c > 1 (or
0 ≤ J ≤ cJ * for some c > 1 if the compactness and continuity condition holds in addition); a related result
is given by [Whi79]. Similar ideas, based on eliminating cycles of zero cost transitions by merging them
to a single state have been mentioned in the context of deterministic shortest path algorithms. Moreover,
while our focus will be on VI and PI, we mention that an alternative approach based on linear programming
has been given in the paper [DHP12]. When J * is real-valued, this approach can construct an optimal
randomized stationary policy from a linear programming solution and the information about the optimal
controls for the states in the set X 0 .
Note that from the Bellman equation J * = T J * [cf. Prop. 3.1(a)], and assuming the compactness and
n ), we obtain the
continuity condition (which implies that there exists µ such that Tµ J = T J for all J ∈ E+
following useful characterization of X 0 :
x ∈ X0
if and only if there exists u ∈ U (x) such that g(x, u) = 0 and pxy (u) = 0 for all y ∈
/ X 0.
(3.1)
Algorithms for constructing X 0 , to be given later in this section, will rely on this characterization.
It is possible that J * is the zero vector and X 0 = X [this is true in case (b) of Example 1.1]. The
algorithm for finding X 0 , to be given later in this section, can still be used to verify that this is the case
thereby solving the problem, but to facilitate the exposition, we assume without essential loss of generality
that the set X + given by
X + = x ∈ X | J * (x) > 0 ,
is nonempty. We introduce a new SSP problem as follows.
Definition of Equivalent SSP Problem:
State space: X = X + ∪ {t}, where t is a cost-free and absorbing termination state.
Controls and one-stage costs: For x ∈ X + , we have U (x) = U (x) and g(x, u) = g(x, u), for all
u ∈ U (x).
Transition probabilities: For x ∈ X + and u ∈ U (x), we have
pxy (u) =
pxy (u)
P
z∈X 0
pxz (u)
if y ∈ X + ,
if y = t.
¯ and is the smallest nonnegative
The optimal cost vector for the equivalent SSP problem is denoted by J,
solution of the corresponding Bellman equation J = T J, where




X
X
def
(T J)(x) = inf g(x, u) +
pxy (u)J(y) = inf g(x, u) +
pxy (u)J(y) ,
x ∈ X + , (3.2)
u∈U (x)
y∈X +
u∈U (x)
y∈X +
[cf. Prop. 3.1(a)].
We will now clarify the relation of the equivalent SSP problem with the given SSP problem (also referred
to as the “original” SSP problem). The key fact for our purposes, given in the following proposition, is that
18
J¯ coincides with J * on the set X + . Moreover, if J * is real-valued (which can be guaranteed by the existence
of a proper policy for the original SSP problem), then the equivalent SSP problem satisfies the classical SSP
conditions given in the introduction. As a result we may transfer the available analytical results from the
equivalent SSP problem to the original SSP problem. We may also apply the VI and PI methods discussed
in Section 1 to the equivalent SSP problem, after first obtaining the set X 0 , in order to compute the solution
of the original problem.
Proposition 3.2: Assume that g(x, u) ≥ 0 for all x and u ∈ U (x), and that the compactness and
continuity condition holds. Then:
¯
(a) J * (x) = J(x)
for all x ∈ X + .
(b) A policy µ∗ is optimal for the original SSP problem if and only if
µ∗ (x) = µ(x),
g x, µ∗ (x) = 0,
∀ x ∈ X +,
pxy µ∗ (x) = 0,
∀ x ∈ X 0, y ∈ X +,
(3.3)
where µ is an optimal policy for the equivalent SSP problem.
(c) If J * is real-valued, then in the equivalent SSP problem every improper policy has infinite
cost starting from some initial state. Moreover, there exists at least one proper policy, so the
equivalent SSP problem satisfies the classical SSP conditions.
(d) If 0 < J * (x) < ∞ for all x = 1, . . . , n, then the original SSP problem satisfies the classical SSP
conditions.
Proof: (a) Let us extend J¯ to a vector Jˆ that has domain X:
¯
J(x)
if x ∈ X + ,
ˆ
J(x)
=
0
if x ∈ X 0 .
ˆ
Then from the Bellman equation J¯ = T J¯ [cf. Prop. 3.1(a)], and the definition (3.2) of T , we have J(x)
=
ˆ
ˆ
ˆ
(T J)(x)
for all x ∈ X + , while from Eq. (3.1), we have (T J)(x)
= 0 = J(x)
for all x ∈ X 0 . Thus Jˆ is a
fixed point of T , so that Jˆ ≥ J * [since J * is the smallest nonnegative fixed point of T , cf. Prop. 3.1(a)], and
¯
hence J(x)
≥ J * (x) for all x ∈ X + . Conversely, the restriction of J * to X + is a solution of the Bellman
¯
equation J = T̄ J, with T̄ given by Eq. (3.2), so we have J(x)
≤ J * (x) for all x ∈ X + [since J¯ is the smallest
nonnegative fixed point of T , cf. Prop. 3.1(a)].
(b) A policy µ∗ is optimal for the original SSP problem if and only if J * = T J * = Tµ∗ J * [cf. Prop. 3.1(b)],
or
"
#
n
n
X
X
J * (x) = inf
g(x, u) +
pxy (u)J ∗ (y) = g x, µ∗ (x) +
pxy µ∗ (x) J ∗ (y),
∀ x = 1, . . . , n.
u∈U (x)
y=1
y=1
Equivalently, µ∗ is optimal if and only if


X
X
J * (x) = inf g(x, u) +
pxy (u)J ∗ (y) = g x, µ∗ (x) +
pxy µ∗ (x) J ∗ (y),
u∈U (x)
y∈X +
y∈X +
19
∀ x ∈ X + , (3.4)
¯ and the definition (3.2) of T , we see
and Eq. (3.3) holds. Using part (a), the Bellman equation J¯ = T J,
that Eq. (3.4) is the necessary and sufficient condition for optimality of the restriction of µ∗ to X + in the
equivalent SSP problem, and the result follows.
(c) Let µ be an improper policy of the equivalent SSP problem. Then the Markov chain induced by µ contains
a recurrent class R that consists of states x with J * (x) > 0 [since we have J * (x) > 0, for all x ∈ X + ]. We
have g x, µ(x) > 0 for some x ∈ R [otherwise, g x, µ(x) = 0 for all x ∈ R, implying that Jµ (x) = 0 and
¯
hence J * (x) = 0 for all x ∈ R]. From this it follows that J(x)
= J * (x) = ∞ for all x ∈ R, since under µ, all
states x ∈ R are visited infinitely often with probability 1 starting from within R. To prove the existence of
a proper policy, we note that by the compactness and continuity condition, the original SSP problem has an
optimal policy [cf. Prop. 3.1(c)], and since J * is real-valued this policy cannot be improper (as we have just
shown, improper policies have infinite cost starting from at least one initial state).
(d) Here the original and equivalent SSP problems coincide, so the result follows from part (c).
Q.E.D.
The following proposition provides analytical and computational results for the original SSP problem,
using the equivalent SSP problem.
Proposition 3.3:
Assume that g(x, u) ≥ 0 for all x and u ∈ U (x), that the compactness and
continuity condition holds, and that J * is real-valued. Consider the set
Then:
J = <n+ ∪ J ∈
/ <n+ | J(x) = 0, ∀ x = 1, . . . , n, with J * (x) = 0 .
(a) J * is the unique fixed point of T within J .
(b) We have T k J → J * for any J ∈ J .
Proof: (a) Since the classical SSP conditions hold for the equivalent SSP problem by Prop. 3.2(c), J¯ is the
unique fixed point of T̄ . From Prop. 3.2(a) and the definition of the equivalent SSP problem, it follows that
J * is the unique fixed point of T within the set of J ∈ <n with J(x) = 0 for all x ∈ X 0 . From Prop. 3(a) of
[BeT91], we also have that J * is the unique fixed point of T within <n+ , thus completing the proof.
(b) Similar to the proof of part (a), the VI algorithm for the equivalent SSP problem is convergent to J¯
from any initial condition. Together with the convergence of VI starting from any J ∈ <n+ [cf. Prop. 3(b) of
[BeT91]], this implies the result. Q.E.D.
Note that Prop. 3.3(a) narrows down the range of possible fixed points of T relative to known results
under the nonnegativity conditions ([BeT91], Prop. 3, asserts uniqueness of the fixed point of T only within
<n+ ). In particular, if J * (x) > 0 for all x = 1, . . . , n, J * is the unique fixed point of T within <n . However,
case (b) of Example 1.1 shows that the set J cannot be replaced by <n in the statement of the proposition.
To make use of the proposition we should know the sets X 0 and X + , and also be able to deal with the case
where J * is not real-valued. We will provide an algorithm to determine X 0 next, and we will consider the
case where J * can take infinite values in the next subsection.
20
Algorithm for Constructing X 0 and X +
In practice, the sets X 0 and X + can often be determined by a simple analysis that relies on the special
structure of the given problem. When this is not so, we may compute these sets with a simple algorithm
that requires at most n iterations. Let
Û (x) = u ∈ U (x) | g(x, u) = 0 ,
x ∈ X.
Denote X1 = x ∈ X | Û (x) 6= ∅ , and define for k ≥ 1,
Xk+1 = x ∈ Xk | there exists u ∈ Û (x) such that y ∈ Xk for all y with pxy (u) > 0 .
It can be seen with a straightforward induction that
Xk = x ∈ X | (T k J0 )(x) = 0 ,
where J0 is the zero vector. Clearly we have Xk+1 ⊂ Xk for all k, and since X is finite, the algorithm
terminates at some iteration k̄ with Xk̄+1 = Xk̄ . Moreover the set Xk̄ is equal to X 0 , since we have
T k J0 ↑ J * under the compactness and continuity condition [cf. Prop. 3.1(c)]. If the number of state-control
pairs is finite, say m, each iteration requires O(m) computation, so the complexity of the algorithm for
finding X 0 and X + is O(mn). Note that this algorithm finds X 0 even when X 0 = X and X + is empty.
3.2. The Case Where J * is not Real-Valued
In order to use effectively the equivalent SSP problem, J * must be real-valued, so that Prop. 3.3 can apply.
It turns out that this restriction can be circumvented by introducing an artificial high-cost stopping action
at each state, thereby making J * real-valued.
In particular, let us introduce for each scalar c > 0, an SSP problem that is identical to the original,
except that an additional control is added to each U (x), under which the transition to the termination state
t occurs with probability 1 and a cost c is incurred. We refer to this problem as the c-SSP problem, and we
denote its optimal cost vector by Jˆc . Note that
Jˆc (x) ≤ c,
Jˆc (x) ≤ J * (x),
∀ x ∈ X, c > 0,
and that Jˆc is the unique fixed point of the corresponding mapping T̂c given by
"
(T̂c J)(x) = min c,
inf
u∈U (x)
"
g(x, u) +
n
X
y=1
pxy (u)J(y)
##
,
x = 1, . . . , n,
within the set of J ∈ <n with J(x) = 0 for all x ∈ X 0 [cf. Prop. 3.3(a)]. Let
X ∞ = x ∈ X | J * (x) = ∞ .
X f = x ∈ X | J * (x) < ∞ ,
We have the following proposition.
21
(3.5)
Proposition 3.4: Assume that g(x, u) ≥ 0 for all x and u ∈ U (x), and that the compactness and
continuity condition holds. Then:
(a) We have
lim Jˆc (x) = J * (x),
∀ x ∈ X.
c→∞
(b) If the control space U is finite, there exists c > 0 such that for all c ≥ c, we have
Jˆc (x) = J * (x),
∀ x ∈ Xf ,
and if µ̂ is an optimal policy for the c-SSP problem, then any policy µ∗ such that µ∗ (x) = µ̂(x)
for x ∈ X f is optimal for the original SSP problem.
We will first state the following preliminary lemma, which can be easily proved by induction.
Lemma 3.1:
Let
ck = max (T k J0 )(x),
x∈X
k = 0, 1, . . . ,
(3.6)
where J0 is the zero vector. Then for all c ≥ ck , the VI algorithm, starting from J0 , produces identical
results for the c-SSP and the original SSP problems up to iteration k:
T̂cm J0 = T m J0 ,
∀ m ≤ k.
Proof of Prop. 3.4: (a) Let J0 be the zero vector. We have
J * ≥ lim Jˆc ≥ Jˆck ≥ T̂ckk J0 = T k J0 ,
c→∞
where ck is given by Eq. (3.6), and the last equality follows from Lemma 3.1. Since T k J0 ↑ J * [cf. Prop.
3.1(c)], we obtain limc→∞ Jˆc = J * .
(b) The result is clearly true if J * is real-valued, since then for c ≥ maxx∈X J * (x), the VI algorithm starting
from J0 produces identical results for the c-SSP and original SSP problems, so for such c, Jˆc = J * . For the
case where X ∞ is nonempty, we will formulate “reduced” versions of these two problems, where the states
in X f do not communicate with the states in X ∞ , so that by restricting the reduced problems to X f , we
revert to the case where J * is real-valued.
Indeed, for both the c-SSP problem and the original problem, let us replace the constraint set U (x) by
the set
U (x)
if x ∈ X ∞ ,
Û (x) = u ∈ U (x) | pxy (u) = 0, ∀ y ∈ X ∞
if x ∈ X f ,
so that the infinite cost states in X ∞ are unreachable from the finite cost states in X f . We refer to the
problems thus created as the “reduced” c-SSP problem and the “reduced” original SSP problem.
22
We now apply Prop. 3.1(b) to both the original and the “reduced” original SSP problems. In the
original SSP problem, for each x ∈ X f , the infimum in the expression
min
u∈U (x)
"
g(x, u) +
n
X
pxy
(u)J * (y)
y=1
#
,
is attained for some u ∈ Û (x) [controls u ∈
/ Û (x) are inferior because they lead with positive probability to
∞
states y ∈ X ]. Thus an optimal policy for the original SSP problem is feasible for the reduced original SSP
problem, and hence also optimal since the optimal cost cannot become smaller at any state when passing
from the original to the reduced original problem. Similarly, for each x ∈ X f , the infimum in the expression
"
min c, min
u∈U (x)
"
g(x, u) +
n
X
##
pxy (u)Jc (y)
y=1
,
[cf. Eq. (3.5)] is attained for some u ∈ Û (x) once c becomes sufficiently large. The reason is that for y ∈ X ∞ ,
Jˆc (y) ↑ J * (y) = ∞, so for sufficiently large c, each control u ∈
/ Û (x) becomes inferior to the controls u ∈ Û (x),
for which pxy (u) = 0. Thus by taking c large enough, an optimal policy for the original c-SSP problem,
becomes feasible and hence optimal for the reduced c-SSP problem [here the size of the “large enough” c
depends on x and u, so finiteness of X and U (x) is important for this argument]. We have thus shown that
the optimal cost vector of the reduced original SSP problem is also J * , and the optimal cost vector of the
reduced c-SSP problem is also Jˆc for sufficiently large c.
Clearly, starting from any state in X f it is impossible to transition to a state x ∈ X ∞ in the reduced
original SSP problem and the reduced c-SSP problem. Thus if we restrict these problems to the set of states
in X f , we will not affect their optimal costs for these states. Since J * is real-valued in X f , it follows that
for sufficiently large c, these optimal cost vectors are equal (as noted in the beginning of the proof), i.e.,
Jˆc (x) = J * (x) for all x ∈ X f . Q.E.D.
We note that the compactness and continuity condition is needed for Prop. 3.4(a) to hold, while the
finiteness of U is needed for Prop. 3.4(b) to hold. We demonstrate this with examples.
Example 3.1 (Counterexamples)
Consider the SSP problem of Fig. 3.1, and two cases:
(a) U (2) = (0, 1], so the compactness and continuity condition is violated. Then we have J ∗ (1) = J ∗ (2) = ∞.
Let us now calculate Jˆc (1) and Jˆc (2) from the Bellman equation
Jˆc (1) = min c, 1 + Jˆc (1) ,
Jˆc (2) = min c,
inf
u∈(0,1]
1−
√
u + uJˆc (1)
.
The equation on the left yields Jˆc (1) = c, and for c ≥ 1, the minimization in the equation on the right
takes the form
√
inf 1 − u + uc .
u∈(0,1]
By setting to 0 the derivative with respect to u, we see that the infimum is attained at u = 1/(2c)2 ,
yielding
1
Jˆc (2) = 1 − ,
for c ≥ 1.
4c
23
a 0 1 2 t b c Destination
Prob. u Prob. 1 − u Cost 1 Cost 1 −
√
u
a 0 1 2 t ab 0c 1Destination
2 t b ac 1Destination
2 t b Destination
�
�
J(1)
= min c, a + J(2)
Prob. u Prob.
uJ ∗ J1µ −
Prob. 1u −Prob.
u
Jµ� Jµ�� Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
a 0 a1 12 2t tb bc Destination
a12tb
a12tb
2 t b c Destination
a012a
t b0 c1 Destination
J(2) = b + J(1)
∗
J 1 JJµ√
�
0 J
2Jµ
J ∗ Jµ Jµ� Jµ�� Jµ√
J� 3JµJ�� J0 µ0 JJ∗µ1JµJµJ2µ�JµJ3µ��JJµµ0 0 Jµ1�Jµ2 Jµ3 �Jµ0�
1 − 1u Cost
Cost 11 −Costuµ1 − µ u µ µ f (x; θk ) f (x;
J(1)
c,=ka)min
+F�J(2)
+ J(2)
k F (x
Prob. u Prob.
Prob. u1 Prob.
− u Cost
θk+1=) min
xJ(1)
(x)c,xak+1
F�
(xk+1 ) xk+2 x∗ = F (x
J(1) = min c, a + J(2)
J ∗ Jµ Jµ� Jµ�� Jµ0 Jµ1 Jµ2 Jµ3 Jµ0
k
k ) kf (x;
k+1k+1
k�
µk (x
b J(2)
+
J(1)
=FFF
b(x+
k+1
kk+1
k+1
k
k+1
k+2
∗k =
µk+1
�LP
θ�
xNLP
(x
) )F=
x) k+1
xµk+2
x∗F F
=
Fk+1
(x
Fk+2
(x)
(x)∗) x
)J(2)
fk(x;
θx(x)
)∗J(1)
(x)
x
(x
x
f (x; θk ) f (x; θk+1 ) fx(x;
Fθ(x
)�F (x)
xf) (x;
FθFk(x
xx
(x
)F)F
CONVEX
J(2)
=
b +(x
J(1)
J(1)
=
min
c,
a
+
J(2)
J(1)
=
min
c,
a
+
J(2)
Figure 3.1. The SSP problem of Example 3.1. There are states 1 and 2, in addition to the termination kstate
k+1
√ (x) xk+1 F (xk+1 ) xk+2 x∗ = F (x∗ ) F µ (x) F µ
k ) f (x; θ k+1 ) xk F (xk ) F
(x)
f (x; θcontrol
t. At state 2, upon selecting
1J0−� JJu,
and
move
to02state
1 with
probability u and to t
∗Jcost
��
1 ��J
20 J
Ju,∗ we
Jµ incur
Jµ∗J� JLP
0
J
J
J
J
µ
µJ
µJ ��µSimplex
µ
µ
µ
µJ
µ31 J
µJ J
µ3 J
�
CONVEX
0µ
1
2µ
0 µ NLP
J
J
J
µ and
LP
NLP
µ=
µ
µ
µCONVEX
µ3 Jµ1.
bNLP
+Jprobability
J(1)
with probability 1 − u. At LP
stateCONVEX
1 we incur cost
1
stay
atµJ(1)
1=with
J(2)
bJ(2)
+
LP CONVEX NLP
k
k+1
k
Simplex
Policy
Evaluation
Improvement
Enhancement
kµµ
∗ JJ �� J
k+1
kSimplex
k+1
k+2
∗∗) =
µ
k ) the
k+1
k+1
k+1
∗ J we
13 J
0k ) =
k+1
k+1
∗x=
µF
1,
JJ∗µθ(2)
∞,
of
Prop.
fails
Jµ��1 JJSimplex
(x)
f3(x;
f (x;
x
(x)
Fx(x
=
Fx(x
2J
0 Jµ
f)(x;
θF
F k(x
) (x)
F3.4(a)
(x)
xExploration
Fxtok+1
(xhold.
xk+2
F ∗(x
JThus
Jˆµcµ0(2)
Jµµ=
Jµµ2while
µ0� c→∞
µ JJ
µ� have
µµ lim
(x)(
f (x;
θkθ)so
f (x;
θconclusion
)(xx)kk)xFkF(x
)xkFk+1
F)(x
)xx∗)k+2
FF(x
) ∗F)F
µ
Simplex
(b) U (2) = [0, 1], so the compactness
and continuity
condition
is
satisfied,
but Exploration
isImprovement
not
finite.
Then
we have Enhancement
Policy Evaluation
Improvement
Enhancement
νk JNLP
Q
Jk+1
µk+1
νUk+1
Qk+1
Policy
Evaluation
Exploration
Policy Evaluation
Improvement
Exploration
Enhancement
k
k+1
LP CONVEX
NLP
CONVEX
NLP
LPLP
CONVEX
k
k+1
k ∗
k+1
ˆ
ˆ
µ
k
k+1
k
k
k+1
k+1
k+2
∗
µ
µ
k+1
k
k
k+1
k+1
k+2
∗
∗
µ
J ∗ (1) f=(x;
∞,θ kJ)∗ff
(2)
=
1.
Similar
to
case
(a),
we
calculate
J
(1)
and
J
(2)
from
the
Bellman
equation.
An
F
(x)
(x;
) x) FF(x)
(x x) F (x)
x
x )=FF (x(x)) F (x)(x)
(x;θθ ) f (x;
) x θ F (x
F (x
)Fcx(x x) x=c F (x
Policy Evaluation Improvement Exploration Enhancement
essentially identical calculation
to J
the
one kofJcase
(a)
same
results
for
c≥
1: point
Qk+1
Jyields
µkthe
νQ
kνk+1
k+1 ν
k+1
k+1
Subgradient
plane
Interior
JkCutting
νk+1
Qk+1Subgradient
νk Jk Qk+1
Qk+1
k+1QJk+1
k+1 µk+1
k+1 µνk+1
Simplex
Simplex
Simplex
LP CONVEX
LP CONVEX
NLP NLP
νk Jk Qk+1 Jk+1 µk+1
νk+1 Qk+1 ˆ
1
plane
Interior
point
Subgradient
. Subgradient
c,planeJInterior
(2)Cutting
= 1 point
−
c (1) =Subgradient
cPolyhedral
approximation
Subgradient
Cutting
plane
Interior point Subgradient
SubgradientJˆCutting
4c
Policy Policy
Evaluation
Improvement
Exploration
Enhancement
Policy
Evaluation
Improvement
Exploration
Enhancement
Evaluation Improvement
Exploration
Enhancement
SimplexSimplex
Subgradient Cutting
plane Interior point Subgradient
approximation
ˆc (2)
LPs
solved
by =
simplex
method with Prop. 3.4(a).
Polyhedral
approximation
approximation
Thus we have limc→∞ Polyhedral
Jˆc (1) = J ∗ (1)
= ∞,Polyhedral
and limc→∞
Jare
= J ∗ (2)
1, consistently
ν
J
µJk+1
νµk+1
Qνk+1
νJkkJJQ
Qk+1
Jk+1
µk+1
νk+1
Qk+1
νkk+1
Qk+1
k k Q
kk+1
k+1
k+1
k+1
k+1
∗
Policy
Exploration
Enhancement
Policy
Evaluation
Enhancement
However,
Jˆc (2)Evaluation
< JImprovement
(2) forImprovement
all c, Exploration
so the conclusion
of Prop.
3.4(b) fails to hold.
Polyhedral approximation LPs are solved by simplex method
NLPs are
solved
by gradient/Newton
methods.
LPs
are solved
by simplex method
LPs are solved by simplex method
Subgradient
Cutting
plane
Interior
point
Subgradient
Subgradient
Cutting
plane
Interior
point
Subgradient
Subgradient
Cutting
plane
Interior
point
Subgradient
Qsuggests
Jk+1
µprocedure
νk+1 Qk+1solve a problem for which J * is not real-valued, but the
νk Jk Qνk+1
µk+1
νa
Qk+1
k JkJk+1
k+1
k+1
k+1
Proposition
3.4(b)
LPs
are solved
by simplextomethod
areConvex
solved by
gradient/Newton
methods.
programs
special
cases of NLPs. methods.
NLPs
are are
solved
by gradient/Newton
NLPs
are
solvedspace
byNLPs
gradient/Newton
methods.
one-stage cost is nonnegative and the control
is finite:
Polyhedral
approximation
Polyhedral
approximation
Polyhedral
approximation
Subgradient
plane Interior
point Subgradient
Subgradient
CuttingCutting
plane Interior
point Subgradient
areConvex
solved
bycompute
gradient/Newton
methods.
0 and
Convex
programs
are
cases of
(1) Use the algorithmNLPs
of Section
3.1 to
the
sets
X
X +special
. Post
Modern
view:
1990s
programs
areNLPs.
special cases of NLPs.
programs
are
special
cases
ofConvex
NLPs.
LPs are
solved
bysolved
simplex
method
LPs
simplex
method
LPs
areare
solved
byby
simplex
method
Polyhedral +
approximation
Polyhedral
(2) Introduce
for allapproximation
x ∈ X programs
aModern
stopping
with
a cost
c are
>
0.Modern
Convex
areaction
special
cases
of
NLPs.
view:
Post
1990s
LPs
often
best
solved
by1990s
nonsimplex/convex methods.
view:
Post
view:
PostModern
1990s
NLPs NLPs
areNLPs
solved
bysolved
gradient/Newton
methods.
gradient/Newton
methods.
areare
solved
byby
gradient/Newton
methods.
(3) SolveLPs
theare
equivalent
SSP
problem
and method
obtain a candidate optimal policy for the original SSP problem
LPs
areby
solved
by simplex
solved
simplex
method
Modern view:
Post
1990s
LPs
are
often
best
solved
by
nonsimplex/convex
problems
are methods.
themethods.
same methods as
LPs.
LPsPI
areand
often
best
solved by
nonsimplex/convex
methods.
LPs can
are often
bestwith
solved
byConvex
nonsimplex/convex
using Prop. 3.2(b). This step
be done
any
one
of the
VIoften
algorithms
noted
in Section
ConvexConvex
programs
are
special
cases
of
NLPs.
Convex
programs
are
special
cases
of
NLPs.
programs are special cases of NLPs.
NLPs
areby
solved
by gradient/Newton
methods.
1. NLPs are
solved
gradient/Newton
methods.
Convex
are
solved
by
same
methods
as
LPs. methods
LPs are often
best
solved by
nonsimplex/convex
methods.
Nondifferentiability
and piecewise
linearity
arethe
common
features.as LP
problems
arethe
solved
by
same
Convex
problems
are
oftenproblems
solved
byConvex
theoften
same
methods
asoften
LPs.
Modern
view:
Post
1990s
Modern
view:
Post
1990s
Modern
view:
Post
1990s
(4) Check that c is high enough, by testing to see if the candidate optimal policy changes as c is increased,
programs
are cases
special
of NLPs.
Convex Convex
programs
are special
of cases
NLPs.
and
piecewise
linearity
are common
Convex problems
solved
byPrimal
same
methods
as LPs.
Problem
Description
Nondifferentiability
and
linearity
Nondifferentiability
and
piecewise
linearity
are
common
features.
or satisfies the optimality
conditionare
of often
Prop.Nondifferentiability
3.1(b).
Ifthe
it does,
the
current
policy
ispiecewise
optimal;
if it features.
doesare common featur
LPs
are
often
best
solved
by
nonsimplex/convex
methods.
LPs
are
often
best
solved
by
nonsimplex/convex
methods.
LPs
are
often
best
solved
by
nonsimplex/convex
methods.
not, increaseModern
c by some
fixed
factor
from Step (3).
view:
Post
1990sand repeat
Modern view: Post
1990s
Primal Problem
Description
Dual Problem
Description
Primal
Problem
Description
Primal Problem
Description
Nondifferentiability
and piecewise
linearity
are
common
features.
By Prop. 3.4(b), this procedure terminates
with
an
optimal
policy
in
a
finite
number
of
steps.
ConvexConvex
problems
are often
solved
bysolved
theby
same
methods
as LPs.as as
Convex
problems
often
by
same
methods
LPs.
problems
areare
often
solved
thethe
same
methods
LPs.
arebest
often
bestby
solved
by nonsimplex/convex
methods.
LPs areLPs
often
solved
nonsimplex/convex
methods.
Dual Problem
Description
Vertical
Distances
Dual
Problem
Description
Dual
Problem
Description
Primal
Description
Finally, let us note
that Problem
the
analysis
and
algorithms
of this
section
are essentially
about general (not
and piecewise
linearity
are
features.
Nondifferentiability
and
piecewise
linearity
common
features.
Nondifferentiability
and
piecewise
linearity
areare
common
features.
necessarily SSP-type) nonnegative-costNondifferentiability
finite-state
MDP under
the
compactness
andcommon
continuity
condition.
problems
aresolved
often
solved
by the
same methods
as
LPs.
Convex Convex
problems
are Vertical
often
by
the
same
methods
as
LPs.Differentials
Vertical
Crossing
Point
Vertical
Distances
Distances
Problem
The reason is that suchDual
MDP
can be Description
trivially
converted
toDistances
SSP problems
by
adding an artificial termination
Problem
Description
Primal
Problem
Description
Primal
Problem
state that is not reachable from any of Primal
the other
states
with aDescription
feasible
transition. For these MDP, with the
Nondifferentiability
andPoint
piecewise
linearity
are Differentials
common
features.
Nondifferentiability
and
piecewise
linearity
are common
features.
Crossing
Point
Values
(x) Crossing
points
f ∗ (y)
Point
Differentials
Crossing
Differentials
exception of the mixedVertical
VI andDistances
PI algorithm
of our recent
workfCrossing
[YuB13c]
(Section
5.2),
no valid exact or
Dual
Problem
Description
Dual
Problem
Description
Dual
Problem
Description
approximate PI method has been known. The
transformation
also brings∗ to∗ bear
the available extensive
∗ (y) f ∗ (y)
Primal Problem
Description
Primal Problem
Description
Values
f (x)
Crossing
points
f Crossing
(y)f2∗ (−y)
−f
+ff(x)
Values
points f ∗ (y)
Values
f (x) Crossing
points
f ∗ (y)
1
2 (−y)
Crossing
Differentials
methodology for SSP under
thePoint
classical
SSP conditions (VI,1several
versions
of PI,
including some that are
Vertical
Distances
Vertical
Distances
Vertical
Distances
asynchronous, andDual
linear programming),
as ∗simulation-based
including
∗ algorithms,
∗
∗
∗ Q-learning and
∗
∗ as well
∗ −f
f1∗Slope
(y) +yf∗2∗−f
(−y)
f ∗ (−y)
Description
Dual ProblemProblem
Description
Slope
−f
f2∗ (−y)
1 (y)
11(y)y2f1 (y) + f2 (−y) f2 (−y)
1 (y) f1 (y) + f2 (−y)
approximate PI (see e.g., [Tsi94], [BeT96], [BeY10], [Ber12a], [YuB13a], [YuB13b]).
Crossing
Point Differentials
Crossing
Point
Differentials
Crossing
Point
Differentials
1
Distances
VerticalVertical
Distances
Slope y ∗ Slope y Slope y ∗ Slope y Slope y ∗ Slope y
24
1
1
Values f (x) Crossing points f ∗ (y)
1
1
1
Point Differentials
CrossingCrossing
Point Differentials
1
1
1
4.
CONCLUDING REMARKS
There are a few issues that we have not addressed and remain subjects for further research. Extensions to
infinite-state SSP problems are interesting, as well as the further investigation of the case where the one-stage
cost can take both positive and negative values. In particular, when Jˆ 6= J * , the characterization of the
set of fixed points of T , and algorithms for computing J * and an optimal (possibly improper) policy remain
open questions. The complications arising from the use of randomized policies are worth exploring. The
computational complexity properties of the PI algorithm of Section 2.2 also require investigation. A broader
issue relates to extensions of the notion of proper policy to DP models which are more general than SSP
problems. One such extension is the notion of a regular policy, which was introduced within the context of
semicontractive models in [Ber13]. This connection points the way to generalizations of the results of this
paper, among others, to affine monotonic models, including exponential cost and risk-sensitive MDP. In such
models, regular policies can be related to policies that stabilize an associated linear discrete-time system (see
[Ber13], Section 4.5), and an analysis that parallels the one of the present paper is possible.
We also have not fully discussed the case of the SSP problem where Jˆ ≤ 0 and its special case where
g(x, u) ≤ 0 for all (x, u). While we have shown in Prop. 2.5 that Jˆ = J * , and that VI converges to J * starting
from any J ∈ <n with J ≥ J * , the standard PI algorithm may encounter difficulties [see case (d) of Example
1.1]. However, the perturbation-based PI algorithm of Section 2.2 applies. Moreover, the optimistic (or
modified) form of PI given in [Put94], Section 7.2.6, for positive bounded MDP, also applies. Another valid
PI algorithm for the case g ≤ 0, which does not require that J * be real-valued or the existence and iterative
generation of proper policies, is the λ-PI algorithm introduced in [BeI96] and further studied in [ThS10],
[Ber12b], [Sch13], [YuB12] (see Section 4.3.3 of [Ber13]). This algorithm is not specific to the SSP problem,
and does not make use of the presence of a termination state. Still another possibility is a mixed VI and
PI algorithm, given in Section 4.2 of [YuB13c], which also does not rely on proper policies. This algorithm
converges from above to J * (which need not be real-valued), and applies even in the case of infinite (Borel)
state and control spaces.
5.
REFERENCES
[Alt99] Altman, E., 1999. Constrained Markov Decision Processes, CRC Press, Boca Raton, FL.
[BeI96] Bertsekas, D. P., and Ioffe, S., 1996. “Temporal Differences-Based Policy Iteration and Applications
in Neuro-Dynamic Programming,” Lab. for Info. and Decision Systems Report LIDS-P-2349, MIT.
[BeS78] Bertsekas, D. P., and Shreve, S. E., 1978. Stochastic Optimal Control: The Discrete Time Case,
Academic Press, N. Y. (republished by Athena Scientific, Belmont, MA, 1996); may be downloaded from
http://web.mit.edu/dimitrib/www/home.html.
[BeT89] Bertsekas, D. P., and Tsitsiklis, J. N., 1989. Parallel and Distributed Computation: Numerical
Methods, Prentice-Hall, Englewood Cliffs, N. J.
[BeT91] Bertsekas, D. P., and Tsitsiklis, J. N., 1991. “An Analysis of Stochastic Shortest Path Problems,”
Math. of OR, Vol. 16, pp. 580-595.
[BeT96] Bertsekas, D. P., and Tsitsiklis, J. N., 1996. Neuro-Dynamic Programming, Athena Scientific, Belmont, MA.
[BeY10] Bertsekas, D. P., and Yu, H., 2010. “Asynchronous Distributed Policy Iteration in Dynamic Programming,” Proc. of Allerton Conf. on Com., Control and Comp., Allerton Park, Ill, pp. 1368-1374.
25
[BeY12] Bertsekas, D. P., and Yu, H., 2010. “Q-Learning and Enhanced Policy Iteration in Discounted
Dynamic Programming,” Math. of OR, Vol. 37, 2012, pp. 66-94.
[Ber77] Bertsekas, D. P., 1977. “Monotone Mappings with Application in Dynamic Programming,” SIAM J.
on Control and Optimization, Vol. 15, pp. 438-464.
[Ber87] Bertsekas, D. P., 1987. Dynamic Programming: Deterministic and Stochastic Models, Prentice-Hall,
Englewood Cliffs, N. J.
[Ber12a] Bertsekas, D. P., 2012. Dynamic Programming and Optimal Control, Vol. II, 4th Edition: Approximate Dynamic Programming, Athena Scientific, Belmont, MA.
[Ber12b] Bertsekas, D. P., 2012. “λ-Policy Iteration: A Review and a New Implementation,” in Reinforcement
Learning and Approximate Dynamic Programming for Feedback Control, by F. Lewis and D. Liu (eds.), IEEE
Press, 2012.
[Ber13] Bertsekas, D. P., 2013. Abstract Dynamic Programming, Athena Scientific, Belmont, MA.
[Bla65] Blackwell, D., 1965. “Positive Dynamic Programming,” Proc. Fifth Berkeley Symposium Math.
Statistics and Probability, pp. 415-418.
[CFM00] Cavazos-Cadena, R., Feinberg, E. A., and Montes-De-Oca, R., 2000. “A Note on the Existence of
Optimal Policies in Total Reward Dynamic Programs with Compact Action Sets,” Math. of OR, Vol. 25,
pp. 657-666.
[Der70] Derman, C., 1970. Finite State Markovian Decision Processes, Academic Press, N. Y.
[DHP12] Dufour, F., Horiguchi, M., and Piunovskiy, A. B., 2012. “The Expected Total Cost Criterion
for Markov Decision Processes Under Constraints: A Convex Analytic Approach,” Advances in Applied
Probability, Vol. 44, pp. 774-793.
[DuP13] Dufour, F., and Piunovskiy, A. B., 2013. “The Expected Total Cost Criterion for Markov Decision
Processes Under Constraints,” Advances in Applied Probability, Vol. 45, pp. 837-859.
[Dud02] Dudley, R. M., 2002. Real Analysis and Probability, Cambridge Univ. Press, Cambridge.
[FeY08] Feinberg, E. A., and Yang, F., 2008. “On Polynomial Cases of the Unichain Classification Problem
for Markov Decision Processes,” Operations Research Letters, Vol. 36, pp. 527-530.
[Fei02] Feinberg, E. A., 2002. “Total Reward Criteria,” in E. A. Feinberg and A. Shwartz, (Eds.), Handbook
of Markov Decision Processes, Springer, N. Y.
[HeL99] Hernandez-Lerma, O., and Lasserre, J. B., 1999. Further Topics on Discrete-Time Markov Control
Processes, Springer, N. Y.
[Pal67] Pallu de la Barriere, R., 1967. Optimal Control Theory, Saunders, Phila; Dover, N. Y., 1980.
[Put94] Puterman, M. L., 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming, J.
Wiley, N. Y.
[ThS10] Thiery, C., and Scherrer, B., 2010. “Least-Squares λ-Policy Iteration: Bias-Variance Trade-off in
Control Problems,” in ICML’10: Proc. of the 27th Annual International Conf. on Machine Learning.
[Sch13] Scherrer, B., 2013. “Performance Bounds for Lambda Policy Iteration and Application to the Game
of Tetris,” J. of Machine Learning Research, Vol. 14, pp. 1181-1227.
[Str66] Strauch, R., 1966. “Negative Dynamic Programming,” Ann. Math. Statist., Vol. 37, pp. 871-890.
[Tsi94] Tsitsiklis, J. N., 1994. “Asynchronous Stochastic Approximation and Q-Learning,” Machine Learning,
26
Vol. 16, pp. 185-202.
[Whi79] Whittle, P., 1979. “A Simple Condition for Regularity in Negative Programming,” J. Appl. Prob.,
Vol. 16, pp. 305-318.
[Whi82] Whittle, P., 1982. Optimization Over Time, Wiley, N. Y., Vol. 1, 1982, Vol. 2, 1983.
[YuB12] Yu, H., and Bertsekas, D. P., 2012. “Weighted Bellman Equations and their Applications in Dynamic
Programming,” Lab. for Info. and Decision Systems Report LIDS-P-2876, MIT.
[YuB13a] Yu, H., and Bertsekas, D. P., 2013. “Q-Learning and Policy Iteration Algorithms for Stochastic
Shortest Path Problems,” Annals of Operations Research, Vol. 208, pp. 95-132.
[YuB13b] Yu, H., and Bertsekas, D. P., 2013. “On Boundedness of Q-Learning Iterates for Stochastic Shortest
Path Problems,” Math. of OR, Vol. 38, pp. 209-227.
[YuB13c] Yu, H., and Bertsekas, D. P., 2013. “A Mixed Value and Policy Iteration Method for Stochastic
Control with Universally Measurable Policies,” Lab. for Info. and Decision Systems Report LIDS-P-2905,
MIT; to appear in Math. of OR.
[Yu14] Yu, H., 2014. “On Convergence of Value Iteration for a Class of Total Cost Markov Decision Processes,” Technical Report, University of Alberta; arXiv preprint arXiv:1411.1459; SIAM J. Control and
Optimization, Vol. 53, pp. 1982-2016.
27