Using Markov decision processes to optimise a non

Using Markov decision processes to optimise a
non-linear functional of the final distribution,
with manufacturing applications.
E.J. Collins1
1
Department of Mathematics,
University of Bristol,
University Walk, Bristol BS8 1TW, UK.
Abstract. We consider manufacturing problems which can be modelled as
finite horizon Markov decision processes for which the effective reward function is either a strictly concave or strictly convex functional of the distribution
of the final state. Reward structures such as these often arise when penalty
factors are incorporated into the usual expected reward objective function.
For convex problems there is a Markov deterministic policy which is optimal, but for concave problems we usually have to consider the larger class
of Markov randomised policies. In the natural formulation these problems
cannot be solved directly by dynamic programming. We outline alternative
iterative schemes for solution and show how they can be applied in a specific
manufacturing example.
Keywords. Markov Decision Processes, Penalty, Non-linear reward
1
1.1
Introduction
Concave/convex effective rewards in manufacturing
Consider a manufacturing process where a number of items are processed
independently. Each item can be classified into one of a finite number of states
at each of a finite number of stages, and at each stage it is necessary for the
manufacturer to choose some appropriate action which affects the progress
of the item over the next stage. Finite horizon Markov decision processes
(MDPs) are commonly used to model such stochastic optimisation problems
because they capture the comparison of the uncertain benefits arising from
different possible control strategies.
In the standard formulation, the objective is to optimise the expected
value of some function of the final state of the process (or equivalently some
linear function of the distribution of the final state) together with the total rewards or costs incurred along the way. Problems like these are often
formulated with linear objective functions precisely because they are then
1
easy to solve with simple standard techniques. However, there are situations
where a more realistic assessment would take into account nonlinearities in
the benefits which accrue to the manufacturer.
We will consider problems where the different actions available at each
stage are effectively neutral with respect to their immediate cost or reward
(for instance, different actions may represent different settings on a machine,
or using differents blends of equally expensive raw materials) and where the
objective is to optimise some non-linear function of the proportion of items
in each of the different possible states at final time T . We will assume that
the number of items to be produced is sufficiently large that we can equate
the proportion of items in each state i at time T under a given policy π with
the probability an individual item is in i under π at T .
One situation where such non-linearities might arise is when a manufacturer wishes to maximise an expected final reward subject to various other
considerations or constraints. If one incorporates the constraints as additional penalty terms in the usual expected reward objective function, this
may lead to effective reward functions which are either strictly concave or
strictly convex functionals of the distribution of the final state. For example,
a policy might be judged both by its expected reward and by some measure
of its associated risk. If we use the variance of the resulting rewards as a
measure of the risk to the manufacturer, then this leads us to seek a policy
which maximises the variance penalized reward, in which some given fixed
multiple θ of the variance is incorporated as a penalty into the objective function. Mean-variance tradeoffs for processes with an infinite stream of rewards
have been studied by several authors, including Filar et al (1989), Huang &
Kallenberg (1994) and White (1994), using either average or discounted reward criteria. To see that a finite horizon version gives rise to optimising a
strictly convex functional of the distribution of the final state, we proceed as
follows. Consider a Markov decision process with no continuation rewards,
so the only reward is a terminal reward R at time T , where R(i) = ri , i ∈ E.
Let T denote the horizon and let ST denote the state at time T. The variance
penalized reward vπ associated with a policy π is given by
vπ
= Eπ [R(ST )] − θVarπ [R(ST )]
=
Σi ri xπ (i) − θ[Σi ri2 xπ (i) − (Σi ri xπ (i))2 ]
= φ(xπ ).
where xπ (i) = Pπ (ST = i), i ∈ E; where xπ is the vector with components
xπ (i); and where
φ(x) = Σi ri x(i) − θΣi ri2 x(i) + θ(Σi ri x(i))2 .
Without loss of generality (White (1994)) we can assume ri > 0, i ∈ E, so
that φ is a strictly convex function.
2
Another example, this time with a concave effective reward, occurs when
a manufacturer wishes to maximise the usual expected final reward but is also
subject to some externally imposed penalty ci if the proportion of times the
process produces an item with final state ST = i differs from some prescribed
value di . This leads to maximising the strictly concave penalised reward
function
φ(x) = Σi ri x(i) − Σi ci (x(i) − di )2 .
The purpose of this paper is to alert those working in the manufacturing
area to the fact that these non-linear problems can be solved using methods
based on familiar techniques like dynamic programming (and simple nonlinear programming), to outline the methods and work through some simple
examples, and to encourage the identification of further manufacturing problems for which such methods are appropriate.
1.2
MDP model
Our basic model will be that of a discrete Markov decision process (MDP)
with finite horizon T, a finite set of states E = {0, 1, 2, . . . , N } and a finite
action set A. If the current state is i and action a is chosen, then the next
state is determined according to the transition probabilities pij (a), j ∈ E.
The time points t = 0, 1, . . . , T − 1 form decision epochs. Let St denote the
state at times t = 0, 1, . . . T and assume throughout that the process starts
with some fixed distribution q for the initial state S0 . The only reward is a
final reward which (taking the penalty into account) is a strictly concave or
convex function φ of the final distribution. We let x denote the distribution
of the final state ST and let xπ denote the final distribution under a policy π.
The focus of this paper is on how to characterise an optimal final distribution
x∗ which maximises φ(x) and how to characterise and compute a policy π ∗
which achieves this final distribution x∗ .
In manufacturing applications, one might want to classify items according to different characteristics at different stages, and the set of appropriate
actions would likewise differ. Although we present the theory in a time and
state homogeneous form, all the results apply with only notational changes
to the case where the set of states Et and transition probabilities ptij (a) may
also vary with time, and where the action set Ait may vary with state and
time. All that is required is the Markov property of the transition to the
next state, given the current time, state and action.
1.3
Non-standard solutions
For standard finite horizon Markov decision processes, dynamic programming
is the natural method of finding an optimal policy and computing the corresponding optimal reward. A linear final reward function φ can be expressed
as the expected value of some function of the final state alone, and the problem then reduces to a standard MDP. When φ is a non-linear function this is
3
not possible and the example in Section 2 below indicates that, in the natural
formulation, the finite horizon value functions fail to satisfy the principle of
optimality so that dynamic programming on its own is not directly applicable
as a method of solution.
Similarly, for standard finite horizon Markov decision processes an optimal
policy can always be found in the class of Markov deterministic policies. We
shall see that this remains true when φ is convex, but that it no longer holds
when φ is concave and we need to explore more carefully what kind of policies
are then optimal. In particular, we will find it useful to allow consideration
of randomised policies and mixtures of policies (where we say µ is a mixture
of the finite set of policies π1 , . . . , πK with mixing probabilities λ1 , . . . , λK if
an individual using µ makes a preliminary choice of a policy πk from the set
{π1 , . . . , πK } according to the respective probabilities λ1 , . . . , λK and then
uses that policy πk throughout at decision epochs t = 0, 1, . . . , T − 1).
1.4
Related work
This paper follows the approach taken in Collins (1995) and Collins and McNamara (1995). A computational example of the application of these ideas
can be found in McNamara et al (1995). The mathematical motivation and
terminology used have elements in common with work in the area of probabilistic constraints and variance criteria in Markov decision processes (see
White (1988) for a survey). Some of the results on the equivalence of different
spaces of outcomes in Section 3 were originally developed in the context of
probabilistic constraints and we will quote the relevant results from Derman
(1970). Such problems also give rise to examples where only randomised optimal policies exist (Kallenberg (1983)). Sobel (1982) gives an example in
the context of variance models for which the principle of optimality does not
apply but does not develop any general method of solution.
1.5
Outline of remaining sections
Section 2 uses a small example to show the difficulties caused by uncritical
application of dynamic programming to this type of problem. In Section 3
we describe the geometry of the space of outcomes (i.e. final distributions)
and outline how this geometric description can be used to make qualitative
statements about the optimal final distribution(s) and the corresponding optimal policies. Section 4 introduces the best response method for updating
our current guess at an optimal policy and final distribution. Solution algorithms for concave (and, briefly, for convex) effective rewards are discussed
in Section 5. Finally, in Section 6, we show how these ideas can be applied
to a simple manufacturing example.
4
2
Example
Purely for motivational purposes, we consider a problem first described in
Collins and McNamara (1995), with the following parameters:
E = {0, 1}, A = {a, b}, q = ( 12 , 12 ).
6/16 10/16
1/16
pij (a) =
, pij (b) =
12/16 4/16
14/16
15/16
2/16
.
φ(x) = 1 − x2 (0) − x2 (1) (concave case).
Let T = 1 and consider the choice of action at the single decision epoch t = 0
(so a policy is just a single decision rule).
2.1
Optimality principle
Taking φ as the final reward function and using the usual optimality equation, we obtain the following table which seems to indicate that the optimal
decision rule is (a, a) (i.e. action a in state i = 0 and action a in state i = 1).
Current state
0
1
2.2
Action
a
b
a
b
‘Final distribution’ x
(6/16, 10/16)
(1/16, 15/16)
(12/16, 4/16)
(14/16, 2/16)
φ(x)
120/256 ∗
30/256
96/256 ∗
56/256
Best deterministic policy
However, the following table, showing the result of using the four possible
deterministic decision rules, indicates that (b, b) is the best deterministic
decision rule.
Decision rule
(a, a)
(a, b)
(b, a
(b, b)
Final distribution x
(9/16, 7/16)
(10/16, 6/16)
(13/32, 19/32)
(15/32, 17/32)
φ(x)
252/512
240/512
247/512
255/512 ∗
The reason for the difference is as follows. The distributions in the first
table are not actually final distributions, but conditional distributions, conditioned on the current state. The overall final distribution is a linear combination of these conditional distributions, but the reward from the overall
final distribution is not the corresponding linear combination of the rewards
from the individual conditional distributions since φ is a non-linear function.
5
2.3
Randomised policies and mixtures
In contrast to the usual case for standard finite horizon MDP’s, there is no
deterministic decision rule that does as well as the randomised decision rule
(( 12 , 12 ), ( 34 , 14 )) - under which, in state 0 action a is taken with probability
1
1
2 and action b with probability 2 , and in state 1 action a is taken with
3
probability 4 and action b with probability 41 . This gives the best possible
final distribution of ( 12 , 12 ).
Alternatively, the same final distribution can be achieved with the mixture
that uses the deterministic policy corresponding to the decision rule (a, b)
with probability λ1 = 0.2 and uses the deterministic policy corresponding to
the decision rule (b, a) with probability λ2 = 0.8
3
Optimal policies
In this section we describe the geometry of the space of outcomes (i.e. final distributions) and outline how this geometric description can be used to
make qualitative statements about the optimal final distribution(s) and the
corresponding optimal policies.
3.1
Geometry of the space of outcomes
Let X denote the space of outcomes, so X is the set of all final distributions x
achievable under some general (possibly history-dependent and randomised)
policy. We can think of each x ∈ X as a point in the N dimensional simplex
ΣN ⊂RN +1 , where ΣN = {x : 0 ≤ x(i) ≤ 1, i ∈ E; Σi x(i) = 1}. Important
subsets of X are XM R (the set of all final distributions achievable under
some Markov randomised policy) and the finite set XM D (containing all final
distributions achievable under some Markov deterministic policy).
The following theorem (Derman (1970), p91 Theorem 2) shows how X
can be represented in terms of these subsets. The first part of the theorem is
due to Derman and Strauch (1966) and the second part to Derman himself.
The proof given for part (ii) is for an average cost formulation, but is easily
adapted to the finite horizon case using the fact that, for standard finite
horizon problems, there is a Markov deterministic policy which is optimal.
Theorem
(i) X = XM R
(ii) X = convex hull (XM D ).
From the theorem we see that, in looking for an optimal policy, there is
no loss in restricting ourselves to Markov randomised policies. We also see
that X is a convex polytope, each of whose vertices is a point in the finite set
XM D corresponding to the final distribution of some Markov deterministic
policy. In our description we will assume, for ease of presentation, that these
are the only points of XM D in the boundary of X.
6
3.2
Type/number of optimal policies
The following lemmas address the number of optimal policies and the type of
policy that is optimal. They follow immediately from the characterisation of
X above and standard results on the maximum of a strictly convex/concave
function over a closed bounded convex set (e.g. Luenberger (1973), Section
6.4)
Lemma
If φ is strictly concave then φ achieves its maximum over X at a unique
final distribution x∗ . However, x∗ may either be a vertex of X (in which
case it is achieved by a Markov deterministic policy) or x∗ may not be a
vertex (in which case we must usually resort to a Markov randomised policy
to achieve x∗ ).
Lemma
If φ is strictly convex then each maximising final distribution is a vertex
of X and hence corresponds to a Markov deterministic policy. However, φ
may achieve its maximum over X at more than one point.
4
The ‘best response’ method
It can be extremely difficult to explicitly determine X directly from the parameters of the problem. Even if we knew X and found a maximising value
x∗ directly, it is not easy to determine a policy π that had x∗ as its final
distribution.
Our approach will therefore be to look for sensible and efficient ways of
searching through points in X and identifying corresponding policies - ways
that do not rely on an explicit representation of X but essentially start out
by treating X as unknown.
4.1
Best response
Our basic tool will be a method we call the best response method (see Collins
and McNamara (1995) for a fuller description). Given any point x0 and corresponding policy π0 this method allows us to easily identify a new updated
best response point x̂0 and corresponding policy π̂0 which are candidates for
improvements on x0 and π0 .
Definition
Given a point x0 ∈ X (and a corresponding policy π0 ) we say π̂0 , with
corresponding final distribution x̂0 ≡ xπ̂0 , is a best response to π0 if
∇φ(x0 )x ≤ ∇φ(x0 )x̂0
7
for all x ∈ X.
4.2
Computation
We cannot use dynamic programming directly to find a policy maximising φ.
However we can use it to find the policy π̂0 which provides the best response
to a given initial point x0 .
Define the real valued function Rx0 on E by taking Rx0 (i) = ∇φ(x0 )(i),
where ∇φ(x0 )(i) is the ith component of ∇φ(x0 ). Then
∇φ(x0 )x = Σi Rx0 (i)x(i) = Ex [Rx0 (ST )]
, where ST is the state at time T and Ex denotes expectation conditioned on
ST having distribution x. Maximising ∇φ(x0 )x is then a standard MDP with
the same state space E, action space A and transition probabilities pij (a) as
before, but now with an expected reward criterion, where the terminal reward
function Rx0 is a function of the final state alone. Dynamic programming
back against this final reward results in a policy π̂0 corresponding to the best
response and we can then work forward using the known initial distribution q,
the known transition probabilities pij (a) and the known policy π̂0 to compute
the distribution x̂0 = xπ̂0 .
Note that problem specification for finding the best response depends
through Rx0 on the point x0 and different points will generate different
MDP’s.
4.3
Geometric interpretation
Let φ be the effective reward function and consider the surface z = φ(x)
defined on X (or on ΣN ). An optimal distribution x∗ corresponds to a
maximum value of z. Now consider the tangent hyperplane to the surface at
the point x0 . The equation of this new surface is z = ψx0 (x) = ∇φ(x0 )(x −
x0 ) + φ(x0 ). The function ψx0 (x) is a linear function of x which provides a
local approximation to φ(x) at the point x0 .
Points with ∇φ(x0 )x = constant form contours in ΣN of the surface
z = ψx0 (x), so x̂0 lies on the highest contour of ψx0 which intersects X.
Assuming ∇φ(x0 ) 6= 0 the points at which this contour intersects X will
be boundary points of X. Moreover, the policy π̂0 selected by dynamic
programming will be a Markov deterministic policy, so the final distribution
associated with it will be a point in XM D . If the only point of intersection
of the highest contour and X is a single vertex, then this point in XM D will
be the point x0 selected as the best response. If the contour intersects X at
more than one vertex, then any of them may be selected.
The important properties of x̂0 are that it is a vertex of X, that it corresponds to the final distribution of some known Markov deterministic policy
π̂0 and that all points of X lie in the half-space containing x0 defined by the
hyperplane ∇φ(x0 )x = ∇φ(x0 )x̂0 .
8
5
5.1
Algorithms
Concave rewards
When φ is strictly concave on the closed bounded convex set X, then standard
results (e.g. Luenberger (1973)) show that x∗ maximises φ(x) over X if and
only if ∇φ(x∗ )x ≤ ∇φ(x∗ )x∗ for all x ∈ X.
This motivates the following basic version of a policy improvement type
algorithm for computing an optimal policy and the corresponding optimal
final distribution.
Algorithm
1. Choose some initial policy π0 and compute the corresponding point x0 .
2. Generate a sequence of policies and points π1 , π2 , π3 , . . . and x1 , x2 , x3 , . . .
by taking xn+1 = x̂n and πn+1 = π̂n , n = 0, 1, . . . .
3. Stop if xn+1 = xn . In this case ∇φ(xn )xn = ∇φ(xn )xn+1 ≥ ∇φ(xn )x
for all x ∈ X (by the construction of xn+1 ), so that xn is the optimal final
distribution and πn is a corresponding optimal policy.
Although it is intuitively attractive, the above algorithm as stated can
run into problems with improvement and convergence (in particular, cycling
and identifiability of optimal randomised policies). One way of dealing with
these problems is motivated as follows. Let P be the convex hull of K linearly
independent vertices x1 , . . . , xK in X, let y be the point at which φ(x)
achieves its maximum over x ∈ P , and let let ŷ denote the best response to
y. Then y = x∗ if and only if ŷ lies in the face of P generated by x1 , . . . , xK .
For example, if the best response cycles between two vertices x1 and x2 ,
then one can break out of the cycle by finding the point y = λx1 + (1 − λ)x2
maximising φ(x) along the line segment joining x1 and x2 and using this as
the starting point of the next iteration. If ŷ = x1 or x2 then, from above,
y = x∗ and a mixture of deterministic policies achieving x∗ is given by using
π1 with probability λ and π2 with probability 1 − λ.
More generally, Collins and McNamara (1995) show that it is possible
to construct a modified algorithm incorporating the best response method,
which produces a strict improvement at each iteration, which converges to
optimality in a finite number of iterations and where the optimal policy can
be identified even when it is a Markov randomised policy. The algorithm is
computationally more demanding, but it does indicate at least one systematic
way to proceed in cases where heuristic modifications may fail, in particular
when x∗ is not a vertex so that the corresponding optimal policy is not
deterministic. A full description of the modified algorithm, with proofs of
convergence to the optimal solution, can be found in Collins and McNamara
(1995), but the basis of the algorithm is as follows.
Assume simple iteration of the best response algorithm has identified
N + 1 linearly independent vertices x0 , . . . , xN in X. Consider an N + 1dimensional vector λ = (λ0 , . . . , λN ) and for these fixed vertices set g(λ) =
9
φ(λ0 x0 +. . .+λN xN ). Find the values λ∗0 , . . . , λ∗N solving the (relatively small
dimensional) non-linear programming problem
of maximising g(λ) subject to
P
the constraints λj ≥ 0 j = 0, . . . , N and j λj = 1, using one of the standard
routines available or easily implemented methods such as those in Luenberger
(1973). Let P denote the convex hull of x0 , . . . , xN . Then the maximum of
φ over P is achieved at y = λ∗0 x0 + . . . + λ∗N xN .
If all the λ∗j are strictly positive then y = x∗ and one can stop. If one
or more of the λ∗j are zero then find the point ŷ which is the best response
to y. If ŷ is not in P then use it to replace one of the vertices with a zero
coefficient λ∗j and start again. If ŷ is in P then again y = x∗ and one can
stop.
Once one has identified x∗ (along with the final set of vertices x0 , . . . , xN ,
the corresponding policies π0 , . . . , πN and the weights λ∗0 , . . . , λ∗N ) one can
construct an optimal mixture µ∗ of Markov deterministic policies by using
each πj with probability λ∗j .
Alternatively, if one wants the optimal policy in the form of a Markov
randomised policy, one can use the known transition probabilities under each
policy πj to calculate the quantities
P ∗
j λj Pπj (St = i, Action at t = a)
P ∗
αt (i, a) =
.
j λj Pπj (St = i)
Let π ∗ be the Markov randomised policy that takes action a with probability
αt (i, a) if in state i at time t. Then it follows from Derman (1970), p91
Theorem 1, that π ∗ results in exactly the same final distribution as µ∗ and
hence π ∗ gives a Markov randomised policy which is optimal.
In Section 6 we will discuss the implementation of these two equivalent
methods of optimal control in the context of a manufacturing example.
5.2
Convex rewards
The problem is intrinsically more difficult when φ is convex, since any or all
of the vertices of X may provide an optimal final distribution. Collins (1995)
shows how the best response method can be used as part of an algorithm
which successively approximate X from within by a finite sequence of convex
polytopes X1 ⊂ . . . ⊂ XM , which have vertices in common with X. The
steps in the algorithm can be briefly outlined as follows.
Use the best response method to generate N + 1 linearly independent
vertices of X which then define an initial polytope X1 . Generate a sequence
of polytopes X1 , X2 , . . . by using the best response method at each stage to
find a new vertex of X and taking the next polytope to be the convex hull of
the vertices identified so far. Hence generate a sequence of points x1 , x2 , . . .
with φ(x1 ) ≤ φ(x2 ) ≤ . . . , by taking xm to be the vertex (with corresponding
known Markov deterministic policy πm ) which maximises φ over the known
vertices of Xm . Stop when there are no vertices of X exterior to the current
10
polytope (say XM ). Then XM = X and φ(xM ) = φ(x∗ ). Details can be
found in Collins (1995).
6
Manufacturing Examples
Consider a manufacturing process where a large number of items are processed individually. Each item can be classified into one of N + 1 states at
each of three stages (t = 1, 2, 3), together with an initial unprocessed stage
(t = 0) and a final finished product stage (t = 4 = T ). In general we will
speak of these as quality states (zero is lowest, N is highest), but they might
might also represent cosmetic differences that did not affect the quality but
had unequal demand. The states of incoming unprocessed items are independent, but the overall proportion of incoming items in each state is known.
At each stage of the process, the operator in charge can either choose to let
the process run (action 1) or can choose to intervene to make an adjustment
to the process or to the item (action 2). If the process is allowed to run,
the item has some probability of deteriorating by one or more quality levels
during that stage and some (smaller) probability of improving. Otherwise
it stays at the same quality level. The probability of a jump by k levels
decreases with k. The probability of a transition to a different state also
decreases with time to reflect the fact that changes are less likely as the
process nears completion. If the operator intervenes, there is an appreciable
probability that the quality improves by one level during that stage, and an
appreciable probability that the intervention will spoil the process and the
item will revert to quality level zero. Otherwise the quality level stays the
same. Again the probability of change after intervention decreases over the
lifetime of the process.
The overall reward depends on the proportion of items in each quality
level at the finished product stage. There is a reward for each item, which
increases with quality and is substantially higher for items of level N . There
is also a quadratic penalty depending on the proportion of items at each level
reflecting perhaps long term future lost sales due to changing perception of
the quality of the product by current or potential buyers.
Note that for the modified algorithm the computational requirements for
each iteration split into two parts – a non-linear programming part which
depends only on the size of N (and is independent of T and A) and a dynamic
programming part which scales linearly with T for fixed N and A. Although
T is small in the example, the computations involved would be relatively
unaffected if T was much larger.
6.1
The model
We can summarise the model below, in the notation of the section on MDP
models in the introduction. The choice of parameter values and the specific
11
reward function is designed to illustrate the possibilities of the approach
rather than necessarily reflecting realistic values for a particular problem.
Time horizon: T = 4.
State space: E = {0, 1, . . . , N }.
Action space: A = {1, 2}.
Initial distribution: q, where q(i) = 2(i + 1)/(N + 1)(N + 2), i = 0, . . . , N.
Transition probabilities: For t = 0, . . . , T − 1
ptii−2 (1) = 0.1ρt ; ptii−1 (1) = 0.4ρt ;
ptii+1 (1) = 0.2ρt ; ptii+2 (1) = 0.05ρt ;
ptij (1) = 0, jP
6= i − 2, i − 1, i, i + 1, i + 2;
ptii (1) = 1 − j6=i ptij (1),
pti0 (2) = 0.1ρt ; ptii+1 (2) = 0.3ρt ;
ptij (2) = 0, jP
6= 0, i, i + 1;
ptii (2) = 1 − j6=i ptij (2),
where ρt = T /(t + T ) reflects the diminishing probability of change.
Final reward: φ(x) = Σi ri x(i) − Σi ci (x(i) − di )2 ,
where ri = 9(i + 1)/(N + 1), i = 0, . . . , N − 1; rN = 10,
where ci = 10 + (N − i), i = 0, . . . , N ,
and where di = 0, i = 0, . . . , N − 1; dN = 1, so state N is the preferred
final state.
In the following examples x0 , x1 , . . . , will denote the sequence of final
distributions generated by the best response algorithm; fk (i, t) will denote
the action specified by a given Markov deterministic policy πk when an item
is in state i at time t; and αt (i, a) will denote the probability which a corresponding Markov randomised policy π assigns to taking action a when an
item is in state i at time t.
6.2
Example 1: N = 4
To apply the simple best response algorithm, start with some arbitrary policy,
say the policy π0 of never intervening (so f0 (i, t) = 1 for all i and t). Use the
known transition matrices under π0 to compute
x0 = (0.077, 0.137, 0.200, 0.263, 0.323)
where x0 (i) denotes the probability an item is state i (i.e. the proportion of
items in state i) at time T under π0 . Define the real valued function Rx0 on
E by taking Rx0 (i) = ∇φ(x0 )(i), where here ∇φ(x0 )(i) = ri −2ci (x0 (i)−di ).
Use dynamic programming to compute a Markov deterministic policy π1 =
{f1 (i, t)} which is optimal for a standard MDP problem with terminal reward
function Rx0 . Now repeat the process starting with π1 , and so on.
For this example we find that x1 = x2 = (0.100, 0.133, 0.172, 0.233, 0.362),
so the algorithm has converged, x1 is the optimal final distribution and (the
Markov deterministic policy) π1 is an optimal policy, where π1 is given below.
12
f1 (i, t) t = 0
i=0
1
1
1
2
2
3
2
4
2
6.3
1 2 3
1 1 1
1 1 2
1 1 1
2 2 2
2 2 2
Example 2: N = 5
Start with some arbitrary policy π1 , say again the policy of never intervening,
and proceed as in Example 1 above. This time the best response sequence
cycles between the two points x1 and x2 .
Using a simple line search, we find that the maximum value of φ(x) along
the line λx1 + (1 − λ)x2 occurs at y = λ∗ x1 + (1 − λ∗ )x2 where λ∗ = 0.177.
We define the real valued function Ry as before and look for a policy that
is optimal in a problem with terminal reward Ry . We find the policy π1 is
again optimal for this terminal reward, so there is no point in X with higher
reward that y.
The optimal final distribution is thus
y = (0.078, 0.091, 0.146, 0.163, 0.210, 0.312)
and this can be achieved using a mixture of the Markov deterministic policies
π1 and π2 given below. Under the mixture an initial choice of policy is made
for each item (π1 being chosen with probability λ∗ and π2 being chosen with
probability 1 − λ∗ ) and that policy is then used throughout the processing of
that item. Alternatively, the same optimal final distribution can be achieved
using the corresponding Markov randomised policy.
f1 (i, t) t = 0
i=0
1
1
1
2
1
3
2
4
2
5
2
6.4
1 2 3
2 2 2
1 2 2
1 1 1
1 1 1
2 2 2
2 2 2
f2 (i, t) t = 0
i=0
1
1
1
2
1
3
2
4
2
5
2
1 2 3
1 2 2
1 2 2
1 1 2 .
1 1 1
2 2 2
2 2 2
Example 3: N = 6
Proceeding as before, the best response algorithm starts cycling at x3 . Using
the modified algorithm over successive iterations, we find the
P5maximum value
of φ(x) over the convex hull of x0 , . . . , x5 occurs at y = j=0 λ∗j xj , where
λ∗ = (0, 0, 0, 0.135, 0.759, 0.106).
Furthermore, the optimal policy against the terminal reward generated
by y is π3 , so ŷ = x3 is in the convex hull of x0 , . . . , x5 . Thus y =
13
(0.062, 0.074, 0.107, 0.138, 0.152, 0.192, 0.274) is the optimal final distribution,
and it can be achieved by the mixture which uses the Markov deterministic policies π3 , π4 and π5 given below with respective probabilities λ∗3 =
0.135, λ∗4 = 0.759 and λ∗5 = 0.106.
f3 (i, t) t = 0
i=0
2
1
2
2
1
1
3
4
1
5
2
6
2
1 2 3
2 2 2
2 2 2
1 2 2
1 1 2
1 1 1
2 2 2
2 2 2
f4 (i, t) t = 0
i=0
2
1
2
2
1
3
1
4
1
5
2
6
2
1 2 3
2 2 2
2 2 2
1 1 1
.
1 1 1
1 1 1
2 2 2
2 2 2
f5 (i, t) t = 0
i=0
2
1
2
2
1
1
3
4
1
5
2
6
2
1
2
2
1
1
1
2
2
αt (i, 1)
i=0
1
2
3
4
5
6
1
0.000
0.000
1.000
1.000
1.000
0.000
0.000
2
2
2
1
1
1
2
2
3
2
2
1
2
1
2
2
t=0
0.000
0.000
1.000
1.000
1.000
0.000
0.000
2
0.000
0.000
0.858
1.000
1.000
0.000
0.000
3
0.000
0.000
0.873
0.746
1.000
0.000
0.000
Alternatively, the same optimal final distribution can be achieved using
the corresponding Markov randomised policy π ∗ for which the quantities
αt (i, 1) are given above and αt (i, 2) = 1 − αt (i, 1).
6.5
Implementation of the optimal policy
In Examples 2 and Example 3 there are two equivalent optimal controls –
the optimal mixture µ∗ (of Markov deterministic policies π1 , . . . , πK ) and
the optimal Markov randomised policy π ∗ (which, in state i at time t, takes
action a with probability αit (a)). There are simple intuitive interpretations
of how these might be implemented in a manufacturing context.
If a single operator has responsibility for each stage (time) of the process
then it may be easiest to implement the optimal control in the form of µ∗
by assuming that for each successive item the single operator choses policy
πj with probability λj and then proceeds to use that Markov deterministic
policy throughout the processing of the given item.
Alternatively, when a different operator has responsibility for the action
taken at each stage of the process and it is not convenient to attempt to
co-ordinate the actions of each operator for a given item, one can implement
the optimal Markov randomised policy π ∗ by allowing each operator at each
stage t to act independently and use the randomised decision rule which takes
action a with probability αit (a) if the current item is in state i.
14
References
Collins, E.J. (1995) Finite horizon variance penalized Markov decision process. Department of Mathematics, University of Bristol, Report no.
S-95-11. Submitted to OR Spektrum.
Collins, E.J. and McNamara, J.M. (1995) Finite-horizon dynamic optimisation when the terminal reward is a concave functional of the distribution
of the final state. Department of Mathematics, University of Bristol,
Report no. S-95-10. Submitted to Adv. Appl. Prob.
Derman, C. (1970) Finite State Markovian Decision Processes. Academic
Press, New York.
Derman, C. and Strauch, R. (1966) A note on memoryless rules for controlling sequential control processes. Ann. Math. Statist. 37, 276–278.
Filar, J.A., Kallenberg, L.C.M. and Lee, H.M. (1989) Variance-penalised
Markov decision processes. Math Oper Res 14, 147–161.
Huang, Y. and Kallenberg, L.C.M. (1994) On finding optimal policies for
Markov decision chains: a unifying framework for mean-variance tradeoffs. Math Oper Res 19, 434–448.
Kallenberg, L.C.M. (1983) Linear Programming and Finite Markov Control
Problems. Mathematical Centre, Amsterdam.
Luenberger, D.G. (1973) Introduction to Linear and Nonlinear Programming. Addison Wesley, Reading.
McNamara, J.M., Webb, J.N. and Collins, E.J. (1995) Dynamic optimisation in fluctuating environments. Proc Roy Soc, B 261, 279–284.
Sobel, M.J. (1982) The variance of discounted Markov decision processes. J
Appl Prob 19, 774–802.
White, D.J. (1988) Mean, variance and probabilistic criteria in finite Markov
decision processes: a review. J Optimization Theory and Applic 56, 1–
29.
White, D.J. (1994) A mathematical programming approach to a problem in
variance penalised Markov decision processes. OR Spektrum 15, 225–
230.
15