Optimal Control
of Markovian Jump Processes
with Different Information Structures
Optimal Control of Markovian Jump
Processes with Different Information
Structures
Dissertation
zur Erlangung des Doktorgrades Dr. rer. nat.
der Fakultät für Mathematik und Wirtschaftswissenschaften
der Universität Ulm
vorgelegt von
Jens Thorsten Winter
2008
ii
Amtierender Dekan:
Prof. Dr. Frank Stehling
Erstgutachter:
Zweitgutachter:
Prof. Dr. Ulrich Rieder
Prof. Dr. Dieter Kalin
Tag der Promotion:
15. Oktober 2008
Contents
iii
Contents
1 Introduction
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Structure of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 The
2.1
2.2
2.3
1
1
2
4
Incomplete Information Model
Construction of the Processes . . . . . . . . . . . . . . . . . . . . . . . . .
Admissible Controls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
6
14
15
3 The Reduction to a Model with Complete Information
3.1 Filter Equation for the Unobservable Process . . . . . . . . . . . . . . . . .
3.2 The Reduced Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
20
31
4 Solving the Reduced Model
4.1 The Generalized HJB-Equation and Verification Technique . . . . . . . . .
4.2 Solution via a Transformed MDP . . . . . . . . . . . . . . . . . . . . . . .
35
36
46
5 Application to Parallel Queueing with Incomplete Information
61
5.1 The Model and the Complete Information Case . . . . . . . . . . . . . . . 62
5.2 Unknown Service Rates: the Bayesian Case . . . . . . . . . . . . . . . . . . 66
5.2.1 The Estimator Process . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2.2 The Reduced MDP . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2.3 A Characterization of the Value Function and the Optimal Control
74
5.2.4 The Symmetric Case . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2.5 Complete Information about One Service Rate . . . . . . . . . . . . 85
5.2.6 The Optimal Control in a Model with Reward-Function . . . . . . . 92
5.3 Unknown Length of the Queues: the 0-1-Observation . . . . . . . . . . . . 99
5.3.1 Threshold-Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3.2 Double-Threshold-Strategy . . . . . . . . . . . . . . . . . . . . . . . 102
6 Conclusion
104
A Tools for Theorem 3.5
105
B Proof of Theorem 5.25
112
Bibliography
117
List of Tables and List of Figures
121
German Summary
122
iv
Contents
1 Introduction
1
1
Introduction
In this opening section we motivate this thesis and point out the impact of this area of
research. Then we summarize our main results and compare this work with the present
literature. At the end we give an outline of this thesis.
1.1
Motivation
Technological advances, especially in the information technology sector, in the last few
years led to a more complex relationship between different systems. One may think how
the computer revolutionized the banking sector or how the internet brought together people
all over the world. In the moment one billion computers are installed worldwide, compared
to nearly 600 million units in 2001. In 2014 the number of installed PC will surpass two
billion units. To understand the dependencies between systems requires large amount
of resources and is hence expensive. Thus very often decisions are made on a lack of
information. This is not only true in large and abstract systems, it even arises in the
everyday life of everyone. For example most discount stores do not distinguish at checkout
between the different types of yoghurt one buy. The sales clerks only count the number of
yoghurts and then scan one of them. This procedure saves time and hence money, but for
the inventory system is not clear anymore how many yoghurts of each type are in store.
Therefore the store manager has to do his reorder based on incomplete information. Due to
these inaccuracy a retailer loses roughly estimated 10% of its current profit. On the other
hand this loss is at least compensated by the reduction of the costs due to the simplified
scanning procedure (see Raman et al. (2001)).
Bensoussan et al. (2003) consider an one-product-inventory. In their paper, the inventory
manager does not observe the inventory level, except the store is empty. Such models are
called zero balance walk models and arise very often, since counting all unsold product is
expensive compared to this strategy.
Another example is the following parallel queueing system:
λ1
λ2
-
-
queue 1
queue 2
Q
k
Q
µ1
Q
Q
Q Q
server
µ2
+
Figure 1: Parallel Queueing Model
There are two types of customer, one at each queue, which arrive at the queues randomly
with rate λi , i = 1, 2. For each waiting customer a cost at rate ci arises. The server has to
decide, how to split his service capacity to the queues. His goal is to minimize the expected
2
1.2 Contributions of this Thesis
discounted total cost, where the random service time depends on an intensity µi . If the
server knows which type of customer is waiting in each queue the optimal decision is to
serve always the queue with higher value ci µi . But if the server does not know which kind
is waiting in which queue the optimal decision is not clear anymore. We will come back to
this example in section 5.2.4, referred there as the symmetric case.
Such queueing models appear for example in data flow of the internet, they arise in machinery productions and in call centers. Of course, there are other applications in finance,
physics and biology with incomplete information. Practitioners deal with this lack of information mostly by their experience. They estimate the unknown parameters somehow
and apply a reasonable strategy to minimize the costs. But then they do not know if their
suggested control is optimal. Very often they do not even know how well their policy works
compared to the optimal one.
In the last few years several problems with incomplete information were studied mathematically and for some an explicit solution was obtained. There are usually two cases of
information lack. In the first case, one is not able to observe a background process which
influences the randomness of the system. In the second case, one is not able to observe
every state completely. We combine both aspects in this work. To our knowledge nobody
has considered a model in which only groups of states are observable, that means where
the observation of the states are coarsened to an observation of groups of states.
The central questions of this thesis are:
• How to model such group observations?
• How does the optimal value of the optimization problem with incomplete information
depend on the available observations?
• How to transform control problems with incomplete information to problems with
complete information?
• How to solve them?
1.2
Contributions of this Thesis
In this dissertation we consider a three-component process: an environment process, a state
process and an observation process. The environment process influences the state process
in two ways: first, it influences the randomness of it, second changes in the environment
may lead to immediate changes in the state process. Thus common jumps of both processes are explicitly possible. This fact was mostly excluded in the research. Notice that
not every state of the state process is completely observable, but only groups of states are.
For this purpose we introduce the notion of an information structure, which is a disjoint
partition of the state space. This is done for the first time. According to this information
structure the observation process is modelled. The idea that groups of states are observable has been only used by Bensoussan et al. (2003) in a very special way. Usually, the
1 Introduction
3
observation is given by a process whose intensities depend on the unobservable process
(see e.g. Liptser and Shiryayev (2004a), Liptser and Shiryayev (2004b), Brémaud (1981),
Borisov and Stefanovich (2005), Borisov (2007), Ceci and Gerardi (2000)). Note that our
setup contains the Bayesian and the Hidden-Markov-Model (e.g. Elliott et al. (1997)) too.
We want to control our system such that the expected costs, depending on this threecomponent-state process, are minimized. We discuss first the impact of information to
the minimal cost. Then we derive the filter equation for the unobservable part of the
state process. With this result we transform the optimization problem with incomplete
information into one with complete information. This transformed model is a piecewisedeterministic control problem. The equivalence between these two models will be shown
in the reduction theorem, which is often neglected in the literature.
Based on the reduced model we derive solution procedures for piecewise-deterministic models. First we extend the Hamilton-Jacobi-Bellman equation (HJB) to a generalized version,
including the Clarke derivative. This idea is based on Clarke (1983) and Davis (1993). We
state here sufficient and necessary conditions for the optimality of a control, in particular
we extend the classical verification technique. The advantage of this generalization is that
the strong differentiability condition can be weakened to local Lipschitz continuity and
regularity of the value function, which will be fulfilled for our value function. For a second
solution technique we define a time-discrete Markovian-Decision-Process (MDP), whose
value function coincides with the one of the control problem. Additionally one can construct from an optimal policy of the MDP an optimal control for the origin model. Here we
proof the existence of an optimal policy and we extend and link the ideas of Davis (1993),
Dempster (1989) and Forwick (1997) to the uniformization technique and to discounted
problems, which they did not consider. The benefit of this reduction is the opportunity of
using all results from the classical MDP-theory.
Using all our developed results, we investigate a parallel queueing model under incomplete
information (see the illustrating example in section 1.1). Under complete information it
is well-known that the cµ-rule is optimal. We prove that this strategy is also optimal if
the information structure is fine enough. If the service rates are Bayesian we show the
separation property of the value function and prove the existence of an optimal control,
which serves one queue exclusively almost surely. Further we prove in the symmetric case
that the certainty equivalence principle holds or in other words, the optimal strategy is a
control limit rule with threshold 12 . If only one service rate is unknown, the optimal control
serves always one queue exclusively and we state additionally sufficient conditions for the
optimal control. As a by-product we extend results of the time-continuous bandit problem
theory (e.g. ElKaroui and Karatzas (1997) and Kaspi and Mandelbaum (1998)) if one arm
is completely observable. In contrast to Lin and Ross (2003), Honhon and Seshadri (2007)
and Hordijk and Koole (1992) we do not only propose well performing strategies or prove
the optimality with numerical methods (e.g. Altman et al. (2003), Altman et al. (2004)),
our results are all proven rigorously. Numerical studies are done for the case, where the
number of waiting customers is not observable completely.
4
1.3
1.3 Structure of this Thesis
Structure of this Thesis
We start in section 2 by defining our state process consisting of three components: the
environment process (Zt ), the state process (Xt ) itself and the observation process (Yt ).
All three processes are pure (Markovian) jump processes and are strongly connected to
each other. In particular the environment process influences the intensities of the state
process and changes in the environment may lead to immediate changes in the state process.
Observable are only groups of states of the state process. Thus the notion of an information
structure is introduced in definition 2.1. We construct explicitly the intensities matrices
and martingales representations for (Zt , Xt , Yt ) in (2.2), (2.5) and (2.8). As in Elliott et al.
(1997) and Miller et al. (2005) our state process takes values in a finite set in contrast to
Ceci and Gerardi (2000) where the state process takes values in Rd , which is more applicable
for financial applications. Examples illustrate this construction and special cases. In the
following section 2.2 we define the set of admissible controls for our optimization problem
stated in section 2.3. Controls are only allowed to depend on the available information and
not on the unobservable parts of the processes. The optimization problem (P ) is based
on this three-component process, minimizing the expected discounted cost over an infinite
horizon. We clarify in this context how the optimal value of this problem depends on
available information and discuss the value of information. But since not all processes are
observable the problem is not solvable directly.
Thus we formulate in section 3 an optimization problem equivalent to (P ) in that way,
that optimal values and optimal strategies are the same, which will be stated in the reduction theorem 3.13. The benefit of these efforts are that the reduced problem is one with
complete information, since the state process there is measurable with respect to the available information and so the reduced optimization problem (Pred ) is solvable directly. The
reduction is done by estimating the unobservable environment and state process with the
help of conditional probabilities pt . We compute in theorem 3.5 an explicit representation
for these conditional probabilities and point out the connection to filter theory. We see that
the conditional probabilities are piecewise-deterministic processes. Additionally we discuss
the behaviour of them between two jumps, yielding from a change in the observation, and
at jump points. In section 3.2 we state the connection between the original problem (P )
under incomplete information and the reduced problem with complete information, summarized in the above mentioned reduction theorem. Finally we prove properties of the
value function of the reduced problem, in particular the concavity of the value function in
p.
After finding an equivalent directly solvable optimization problem we discuss in section 4
two solution methods. We prove in theorem 4.3 that the value function is a solution of the
generalized HJB-equation. Here the strict differentiability condition is weakened to differentiability in the sense of Clarke. Due to its concavity the value function is differentiable
and regular in the meaning of Clarke. Then we generalize in theorem 4.4 the verification
procedure to the context of Clarke derivative. Whereas the verification technique does not
make use of the piecewise-deterministic behaviour of the conditional probabilities we do
1 Introduction
5
this in section 4.2. There we formulate a (time-discrete) MDP whose value function coincides with the value function of (Pred ). This will be proven in theorem 4.7. Additionally
we show there, that one can construct an optimal control of (Pred ) from an optimal policy
of the MDP. Finally we answer the question about the existence of optimal controls in
theorem 4.14.
In section 5 we consider a parallel queueing model with one server as introduced in section
1.1. First we consider the complete information case in section 5.1 where the optimality
of the cµ-rule is proven. This control is also optimal if the information structure is fine
enough. We will then apply the theory developed in section 3 and 4 to solve this problem
for different information structures. In section 5.2 we consider the case of Bayesian service
rates, where each service rate is unknown between two values. We derive an explicit
representation for the conditional probabilities and discuss the monotonicity behaviour
and technical characteristics of the estimator process in section 5.2.1. Then we define the
corresponding complete information MDP in section 5.2.2 and find a closed formula for
the value function in section 5.2.3. This representation is quite similar to the one under
complete information (see theorem 5.13). Additionally we prove that it is always optimal to
serve one queue exclusively almost everywhere. These results are specified to two models:
the symmetric case in section 5.2.4 and the case, where one service rate is known in section
5.2.5. In theorem 5.20 we prove that the optimal control in the symmetric case is a control
limit rule with control limit p∗ = 12 . If only one service rate is unknown the optimal control
serves one queue exclusively all the time as proven in theorem 5.22. In both cases the stayon-a-winner property, famous in bandit models, is observed. In section 5.2.6 we consider a
reward criterion. There the optimal (pure) control is an index-strategy. In section 5.3 the
information structure is given as a 0-1-observation, in particular the server can distinguish
at queue 1 only if there are more than two customers waiting or not. We investigate this
case numerically.
6
2
The Incomplete Information Model
In this section we introduce our state process and define our optimization problem (P ). We
do this in a constructive way. The state process (Zt , Xt , Yt ) consists of three components:
the environment process (Zt ), which influences the parameters and the random behaviour
of the system, the state process (Xt ), which depends on the system parameters and the
observation process (Yt ), which is connected to the further ones. The last component is
the only observable component. The other two are only observable via the observation
process, with the help of the then defined information structure. After introducing this
three-component-process in section 2.1, we define the class of admissible controls in section
2.2. Apart from technical assumptions, the main requirement on admissible controls is
that they are only functions of the observation process and hence do not depend on the
unobservable parts of the system. In the last section 2.3 we introduce our optimization
problem under partial information (P ). We state some first properties of the optimal value
in dependence on the information structure.
2.1
Construction of the Processes
Our state process consists of three components: the environment process, the state process
and the observation process. All three processes exist on a given measurable space (Ω, F )
and are strongly connected to each other. The environment process Z = (Zt ) takes values
in a finite set (this means d different values), where we identify for mathematical reasons
each state zµ with a unit-vector gµ of Rd , µ = 1, . . . , d. Consequently the state space of
Z = (Zt ) with Zt = (Zt1 , . . . , Ztd )⊤ is given by
SZ := {g1 , . . . , gd }.
Let NtZ (µ, ν) count the number of jumps of Z in [0, t] from gµ to gν (with µ 6= ν), which
µ
Z
occur with (predictable) intensity qµν
Zt−
≥ 0. Then Zt (starting at time 0 in z0 ∈ SZ ) is
uniquely defined by
Ztµ
:=
z0µ
+
d
X
NtZ (ν, µ)
ν=1
ν6=µ
−
d
X
NtZ (µ, ν),
µ = 1, . . . , d.
(2.1)
ν=1
ν6=µ
Equivalent to Brémaud (1981), Elliott et al. (1997) and Rogers and Williams (2003) we
can give the following martingale representation for the environment process (Zt ):
dZt = QZ Zt dt + dMtZ ,
(2.2)
d
P
Z
Z
Z
where (QZ )⊤ = qµν
with qµZ := −qµµ
:=
qµν
is called the generator or intensity matrix
ν=1
ν6=µ
of Zt . MtZ is a d-dimensional martingale, consisting of the compensated counting processes
2 The Incomplete Information Model
7
NtZ (µ, ν), in detail
MtZ (µ)
Z t
d X
Z
Z
Z
:=
Nt (ν, µ) − Nt (µ, ν) −
qνµ
Zsν ds ,
µ = 1, . . . , d.
0
ν=1
Zt characterizes the state of the environment at time t. One may think of Zt as the
economic or political situation or an external influence to the state process X = (Xt ), the
second component of the whole state process, which takes values in the finite set
SX := {e1 , . . . , en },
where ei is the i-th unit vector of Rn (like above we model the finite set of values by the
set of unit vectors for mathematical reasons).
The environment Zt influences the state process in two ways. First, changes in the environment may lead directly to changes in the state of the process Xt . For example one may
think of the environment as the economic situation and of the state process as the credit
rating of a company, both rated on a scale of good, medium and bad. If the environment
falls from good to medium or bad the credit rating of the company also drops by one
category (if possible). This connection is now modelled as follows. Assume that Zτ − = gµ
and Xτ − = ei and that the process Z jumps at time τ to state gν ∗ . If this jumps leads to a
jump of X to state ej ∗ 6= ei , where j ∗ is unique, in particular Zτ = gν ∗ and Xτ = ej ∗ , then
we set
∗
δijµν∗ := 1.
On the other hand, if X remains in ei (thus the jump of Z from gµ to gν∗ has no influence
to X) then we set
∗
δijµν := 0 ∀j ∈ {1, . . . , n}.
Before continuing the construction let us state that for fixed i and µ due to the uniqueness
of ν ∗ and j ∗
n X
d
X
δijµν ∈ {0, 1}.
j=1 ν=1
∗
It is 1 if and only if there exists ν ∗ and j ∗ such that δijµν∗ = 1.
With this function we are able to model now the second component Xt of our state process
(Zt , Xt , Yt ). As above we characterize X = (Xt ) in terms of NtX (i, j), the process counting
the jumps of X in [0, t] from ei to ej for i 6= j. That means
Xti
:=
xi0
+
n
X
i=1
i6=j
NtX (i, j)
−
n
X
i=1
i6=j
NtX (j, i)
(2.3)
8
2.1 Construction of the Processes
where
NtX (i, j) := ÑtX (i, j) +
d X
d
X
i
δijµν NtZ (µ, ν)Xt−
,
i 6= j.
(2.4)
µ=1 ν=1
Here we assume that ÑtX (i, j) and NtZ (µ, ν) do not jump at the same time ∀i, j, µ, ν. With
ÑtX (i, j) we model the jumps of Xt which happen independent of a jump of Zt , although
the intensity of ÑtX (i, j) depends on Zt . This is the second, the indirect, influence of Zt
on Xt . In particular the intensity of ÑtX (i, j) is defined by
q̃ijX (Zt )
:=
d
X
X
r
i
q̃ij,r
Zt−
Xt−
≥ 0.
r=1
X
i
That means if Zt− = gµ the intensity of ÑtX (i, j) is given by q̃ij,µ
Xt−
. Since ÑtX (i, j) and
Zt do not jump at the same time it is immediately true that
q̃ijX (Zt )
=
d
X
X
i
q̃ij,r
Ztr Xt−
.
r=1
The interpretation of
d P
d
P
µ=1 ν=1
i
δijµν NtZ (µ, ν)Xt−
in (2.4) is the following: If Xt− = ei and
Zt− = gµ and if Z now jumps to gν ∗ and this jump has some direct influence to X,
∗
meaning that Xt directly jumps to ej ∗ , then we set δijµν∗ = 1. If the jump of Zt has no
direct influence on X (hence Xt = Xt− = ei ), then δijµν = 0 for all ν ∈ {1, . . . , d} and the
second term in (2.4) vanishes.
Summarizing, the intensity of NtX (i, j) is given by:
i
qijX (Zt )Xt−
d
d
X
X
X
r i
Z
:=
q̃ij,r +
δijrν qrν
Zt− Xt− .
r=1
ν=1
Note that the intensity is predictable by construction. As in (2.2) we get the following
martingale representation for Xt :
dXt = QX (Zt )Xt dt + dMtX ,
where (QX (Z))⊤ = (qijX (Z)) with qiX (Z) := −qiiX (Z) :=
Xt . With the following abbreviation
X
qij,µ
:=
X
q̃ij,µ
+
d
X
ν=1
Z
X
Z
δijµν qµν
=: q̃ij,µ
+ q̃ij,µ
(2.5)
P
j6=i
qijX (Z) is the intensity matrix of
2 The Incomplete Information Model
9
we can write:
X
Q (Z) =
X
Q̃X
1 , . . . , Q̃d
+
Q̃Z1 , . . . , Q̃Zd
⊤
X
X
X
where (Q̃X
= (q̃ij,µ
) with q̃i,µ
:= −q̃ii,µ
:=
µ)
d
P
ν=1
Z
Z
Z
δijµν qµν
and q̃i,µ
:= −q̃ii,µ
:=
MtX (i)
:=
n X
NtX (j, i)
P
j6=i
−
Z,
P
j6=i
X
Z
Z
q̃ij,µ
and (Q̃Zµ )⊤ := (q̃ij,µ
) with q̃ij,µ
:=
Z
q̃ij,µ
. MtX is a n-dimensional martingale defined by
NtX (i, j)
j=1
−
Z
0
t
X
qji
(Zs )Xsj ds ,
i = 1, . . . n.
From the construction we see, that we formally have to write MtX,Z (i), but for simplicity
we drop this dependence in our notation.
As mentioned before not every state of Xt may be observable, only groups of states of Xt
are. This situation can be observed very often in the real world. For example a server can
not count the customers waiting in the queue, he only knows, if there are waiting less than
10, less than 20 or more than 20. Or in inventory systems the storekeeper only knows if
the store is empty or not. This fact is modelled in the following definition.
Definition 2.1 Let m ∈ N. We call (I(k), k = 1, . . . , m) an information structure, if
(i) ∅ =
6 I(k) ⊂ {1, . . . , n} for all k and
(ii) I(k) ∩ I(l) = ∅ for all k 6= l and
(iii)
Sm
k=1 I(k)
= {1, . . . , n}.
(i) means that each element of an informations structure contains at least one state of Xt ,
from (ii) and (iii) we have (I(k), k = 1, . . . , m) is a disjoint partition of the state space
SX . An element I(k) of an information structure is identified with fk and fk is called
a representative of ei , if i ∈ I(k), where fk denotes the k-th unit vector of Rm . By the
definition of the information structure, each ei has exactly one representative and m ≤ n
is finite.
Remark 2.2
a) If one assumes additionally that some states of (Zt ) are observable directly (and not
only via (Xt )) then one has to define the extended state process X̃t := (Xt , Zt ) ∈
SX × SZ .
10
2.1 Construction of the Processes
b) The information structure can also be modelled as a function
g : {e1 , . . . , en } → {f1 , . . . , fm },
where each state ei is assigned to its representative fk . g(x) = x is equivalent with
complete information. g(x) ≡ f1 is equivalent with no information.
c) If we skip the disjointness assumption (ii) we possibly have more than one representative fk of a state ei . In this case a decision has to be made, at which time state
ei is represented by fk and at which time by fl . This decision could be modelled
by a random variable. In this case the following construction remains true, but gets
more complicate. In our opinion it is more realistic to assume the uniqueness of the
representative.
The observation process Y = (Yt ), the third component of (Zt , Xt , Yt ) is strongly connected
to the information structure and is modelled as a pure jump process on the same measurable
space (Ω, F ). Y , taking values in SY := {f1 , . . . , fm }, is characterized by the processes
NtY (k, l) (k 6= l) (compare with (2.1) and (2.3)), where NtY (k, l) counts the jumps of Y in
[0, t] from fk to fl , in detail
Ytk
:=
y0k
+
m
X
NtY
(l, k) −
m
X
NtY (k, l)
l=1
l6=k
l=1
l6=k
and NtY (k, l) is defined by
X X
k
NtY (k, l) :=
NtX (i, j)Yt−
.
(2.6)
i∈I(k) j∈I(l)
Thus the (predictable) intensity of NtY (k, l) is given by
X X
Y
k
k
i
qkl
(Zt , Xt )Yt−
:=
Yt−
qijX (Zt− )Xt−
i∈I(k) j∈I(l)
=
X X
i∈I(k) j∈I(l)
d
d
X
X
X
r i k
Z
q̃ij,r +
δijrν qrν
Zt− Xt− Yt− .
r=1
(2.7)
ν=1
The martingale representation for Yt is then given by
dYt = QY (Zt , Xt )Yt dt + dMtY ,
(2.8)
Y
Y
where (QY (Z, X))⊤ = (qkl
(Z, X)) with qkY (Z, X) := −qkk
(Z, X) :=
m
P
Y
qkl
(Z, X) and
l=1
l6=k
MtY
Z t
m X
Y
Y
Y
l
Nt (l, k) − Nt (k, l) −
(k) :=
qlk (Zs , Xs )Ys ds ,
l=1
0
k = 1, . . . , m.
(2.9)
2 The Incomplete Information Model
11
As for MtX we skip here the dependence of MtY on X and Z in the notation.
By construction of the process (Zt , Xt , Yt ) it is obvious, that this process is a Markovprocess (although this statement is a little bit imprecise, since we have not introduced any
probability measure, what we will do for the controlled processes in section 2.2).
After the construction of this three-component process we present five special models in
the following.
Complete-Observation-Model
In the following three models we skip the environment-process (Zt ) for simplicity in the
notations. But the models can be developed similar with the environment-process. The
best case is that (Xt ) is completely observable. Thus everything in the model is observable.
This is attained by
I(i) = {ei } ∀i ∈ {1, . . . , n}
and hence
m = n.
By the construction of the processes we get immediately Yt ≡ Xt .
Group-Observation-Model
In most cases only groups of states of (Xt ) are observable, which leads to a coarsening of
the observation. For this purpose we get
|I(k)| ≥ 2
for at least one k
and hence
m < n.
A very special information structure is the one where one state of Xt is observed completely
and all others are represented by one common representative. This type of model arises for
example in inventory systems, see e.g. Bensoussan et al. (2003) where the authors consider
a zero-walk-inventory model. In such a model the storekeeper only knows if the inventory
system is empty or not and in the second case the exact number of products on store is
not known. We will call this model 0-1-observation and it is attained by
I(1) = {e1 }
and
I(2) = {e2 , . . . , en }
and hence
m = 2.
No-Observation-Model
Assume now that the observer can not differ between any state of (Xt ), consequently his
only information is that (Xt ) takes values in SX and starts in x0 . This extreme case is
achieved by
I(1) = {e1 , . . . , en }
and hence
m = 1.
Hidden-Markov-Model
In the Hidden-Markov-Model (HMM) the generator of the state process (Xt ) takes values in
a finite set and is changing over time according to the unobservable environment process
(Zt ). This model is considered in Elliott et al. (1997). In the classical setting (Xt ) is
12
2.1 Construction of the Processes
completely observable and (Zt ) and (Xt ) do not have common jumps. Setting in our
model
I(k) = {ek }
and
δijµν ≡ 0
we get Yt ≡ Xt and dXt = QX (Zt )Xt dt + dMtX and we obtain this case.
Bayesian-Model
The setting in a Bayesian-Model is similar to the one in the Hidden-Markov-Model. Here,
the generator of (Xt ) is also unknown in a finite set. The only thing known is its initial
distribution (sometimes denoted as a-priori-distribution) p0 . In contrast to the HMM
the generator does not change over time. Again the state process (Xt ) is assumed to be
observable. Hence in our model we have to define
I(i) = {ei } ,
δijµν ≡ 0
and
QZ ≡ 0.
Remark 2.3
a) Until now, we did not include some noise ÑtY (k, l) in our model. This noise leads to
some incorrect observations induced by changes in the observation state since this
changes are not induced by changes in the unobserved process (Xt ). But all the
following calculation are possible with slight modifications, if we define
X X
k
NtY (k, l) := ÑtY (k, l) +
NtX (i, j)Yt−
(2.10)
i∈I(k) j∈I(l)
under the assumption that ÑtY (k, l) and NtX (i, j) do not jump at the same time.
b) All previous and following calculations remain true, if we allow time-dependent intensities.
c) By the construction of the stochastic differential equations (2.2), (2.5) and (2.8) it is
clear, that all these stochastic differential equations admit a unique solution.
d) The construction can be extended to countable state space, e.g. N0 under some
technical assumptions. These are the conservativeness of the generators and the
condition on finite diagonal elements of the intensity matrices.
Definition 2.4 We denote by FtZ,X,Y := σ(Zs , Xs , Ys , s ≤ t) the (augmentation of the)
filtration generated by the process (Zs , Xs , Ys )s∈[0,t] . Similarly we define FtY := σ(Ys , s ≤ t)
the (augmentation of the) filtration generated by the observation process Ys up to time t
and we call this filtration the information available at time t. Since a filtration generated
by a point process is right-continuous (see Brémaud (1981) and Last and Brandt (1995))
all filtrations satisfy the usual conditions.
2 The Incomplete Information Model
13
By construction of the process (Zt , Xt , Yt ) it is clear, that there is a one-to-one-relation
between the processes itself and the counting processes NtZ (·, ·), NtX (·, ·), NtY (·, ·) . Hence
we have
FtY = σ(NsY (k, l), s ≤ t, k, l = 1, . . . , m).
(2.11)
Remark 2.5
a) Obviously it holds: FtY ⊂ FtZ,X,Y .
b) To keep things simple we identify the given σ-algebra F by F = σ(Zt , Xt , Yt , t ≥ 0).
c) To model the available information as a filtration seems natural, since a filtration is
a monotone increasing family of σ-algebras, thus the longer the observation horizon,
the more information is available.
In contrast to Miller et al. (2005) and Ceci and Gerardi (2000) our observation process is
more than a with the unobservable process correlated process or the number of jumps of
the unobservable process. We introduce here a much more general framework for systems
with unobservable components including unobservable parameters. With this modelling
we also cover the Bayesian and the classical Hidden-Markov cases as mentioned in the
examples on page 11.
Definition 2.6 We call an information structure (I(k), k = 1, . . . , m) finer than another
information structure (I(k), k = 1, . . . , m) if
(i) for all k = 1, . . . , m there exists one k ′ ∈ {1, . . . , m′ } with I(k) ⊂ I ′ (k ′ )
(ii) there exists at least one k ∈ {1, . . . , m} with I(k) 6= I ′ (k ′ ) for all k ′ = 1, . . . , m′ .
As an immediate consequence we get m > m′ . The following theorem gives the connection
to the corresponding filtrations.
Theorem 2.7 Let (I(k), k = 1, . . . , m) be a finer information structure than (I(k), k =
1, . . . , m) and denote by (Yt ) and (Yt′ ) the corresponding observation processes, then
′
FtY ⊂ FtY
∀t ≥ 0.
Proof: By definition 2.6, the fact that m > m′ and the previous construction of the observation process Yt there exists at least one additional basic process NtY (k, l) of Yt compared
to Y ′ (t). Hence we get with (2.11)
′
′
FtY = σ(NsY (k, l), s ≤ t, k, l = 1, . . . , m′ ) ⊂ σ(NsY (k, l), s ≤ t, k, l = 1, . . . , m) = FtY .
14
2.2
2.2 Admissible Controls
Admissible Controls
In order to control the above constructed process (Zt , Xt , Yt) we introduce a control paZ
X
rameter u ∈ U ⊂ R and let the intensities qµν
(u) and qij,µ
(u) depend on this parameter.
To guarantee for a fixed control process u = (ut ) the well-definedness and the existence
of a process (Zt , Xt , Yt ) satisfying the stochastic state differential equations (which are the
controlled analogons of (2.2), (2.5) and (2.8))
dZt
dXt
dYt
(Z0 , X0 , Y0 )
=
=
=
=
QZ (ut )Zt dt + dMtZ
QX (ut , Zt )Xt dt + dMtX
QY (ut , Zt , Xt )Yt dt + dMtY
(z0 , x0 , y0 )
(2.12)
we have to make some assumptions on our control parameter and our control process.
Z
Definition 2.8 Let U ⊂ R be a compact set such that for all u ∈ U holds qµν
(u) ≥ 0
X
Z
for all µ 6= ν and q̃ij,µ (u) ≥ 0 for all i 6= j and for all µ. Additionally let u 7→ qµν
(u)
X
and u 7→ q̃ij,µ (u) be continuous on U. A control (process) u = (ut ) : [0, ∞) → U satisfies
assumption
(A), if
(ut ) is a càdlàg process
(A) ut is FtY -predictable for all t ≥ 0
ut ∈ U for all t ≥ 0.
We define the set of admissible controls by
U := {u = (ut ) | u satisfies (A)}
and call an element of u ∈ U accordingly admissible.
While the first assumption of (A) is rather technical, the second one means, that the control
at time t is allowed to depend only on the information coming from the observations via
the process (Ys ) up to time t. Especially the control is not allowed to depend on the
unobserved processes (Zs ) and (Xs ) up to time t and not on the future.
By the construction in section 2.1 and (2.12) we note the dependence of the process
(Zt , Xt , Yt ) and the martingales (MtZ , MtX , MtY ) on the control process u = (ut ), but once
more we neglect this fact in our notation. Keep in mind, that the randomness (which
enters the system via the counting processes Nt· (·, ·)) in the system is influenced by the
control process as it is often the case in technical applications. This is in contrast to many
applications in finance where the randomness is given by an uncontrolled Lévy-process.
It is not difficult to extend the control set to the class of impulse controls (see for example
Dempster and Ye (1995)), where we are able to move the process Xt directly from one state
to another. That means by applying an impulse control Γ ∈ SX at time τ − the process
m
P
(Zt , Xt , Yt ) is moved from (Zτ − , Xτ − , Yτ − ) to (Zτ − , Γ, Υ), where Υ :=
fk 1 Γi ∈ I(k)
k=1
2 The Incomplete Information Model
15
denotes the new state of the observation process Yt according to Γ, assuming that if an
impulse control is applied a jump of the processes at the same time is impossible (Γi denotes
the projection of Γ = ei to its index i). The following computations remain true with slight
modifications as long as the impulse control remains FtY -predictable.
Theorem 2.9 For each u ∈ U and given (z0 , x0 , y0 ) ∈ SZ × SX × SY exists a probability
measure Pu := Pu,(z0 ,x0 ,y0 ) on the given measurable space (Ω, F ) such that there exists a
process (Zt , Xt , Yt ) satisfying (2.12).
Proof: The assertion follows from Kolmogorov’s theorem.
Remark 2.10
a) We can now state more precisely the notion martingale: (MtZ ), (MtX ) and (MtY ) are
martingales with respect to FtZ,X,Y on the given probability space (Ω, F , Pu ).
b) (Zt ), (Xt ) and (Yt ) are Markovian jump processes with generator (or intensity matrix)
QZ (u), QX (u, Z) and QY (u, Z, X). But they are not any longer Markov processes (in
contrast to the uncontrolled processes in section 2.1), since the controls need not to
be Markovian. Since the intensity at time t for the next jump depends only on the
current state of the system and the current control, the notion Markovian is justified
as in Rishel (1978). If the controls are Markovian then the processes are Markovian
in the usual sense.
The next theorem states that the more information the larger the set of admissible controls.
This is reasonable, since if one has more information one has more opportunities to decide.
Theorem 2.11 Let (I(k), k = 1, . . . , m) be a finer information structure than (I ′ (k), k =
1, . . . , m′ ) with associated observation process Yt and Yt′ respectively. Then
U ′ ⊂ U,
where U and U ′ are the set of admissible controls corresponding to Yt and Yt′ .
′
Proof: By theorem 2.7 follows that FtY ⊂ FtY and the assertion is an immediate consequence by the definition of the set of admissible controls in definition 2.8.
2.3
The Optimization Problem
Denote by Eu the expectation with respect to Pu = Pu,(z0 ,x0 ,y0 ) . Let β > 0 be a given
discount factor and assume that for each u ∈ U the integrability condition
Z ∞
−βt
Eu
e g(Zt, Xt , Yt , ut )dt < ∞
0
16
2.3 The Optimization Problem
for a given Borel-measurable cost function g : SZ × SX × SY × U → R+ holds. g is assumed
to be continuous in u (this assumption is of more technical nature in section 4.1 and 4.2).
Then our optimization problem (P ) over an infinite horizon is given by
R ∞
Eu 0 e−βt g(Zt, Xt , Yt , ut)dt → min
dZt = QZ (ut )Zt dt + dMtZ
dXt = QX (ut , Zt )Xt dt + dMtX
(P )
dYt = QY (ut , Zt , Xt )Yt dt + dMtY
(Z0 , X0 , Y0) = (z0 , x0 , y0)
u∈U
We will denote for a fixed control process u the corresponding expected discounted cost by
Z ∞
−βt
J(z0 , x0 , y0; u) := Eu
e g(Zt, Xt , Yt , ut)dt | Z0 = z0 , X0 = x0 , Y0 = y0
0
and the optimal value of (P ) for fixed (z0 , x0 , y0 ) by
v(P ) := J(z0 , x0 , y0 ) := inf J(z0 , x0 , y0; u).
u∈U
A control u∗ = (u∗t ) is called optimal if and only if J(z0 , x0 , y0; u∗ ) = J(z0 , x0 , y0).
Note once more that Zt and Xt are not observable directly. Thus (P ) is a problem with
partial information. Therefore it is not solvable directly, since the control process is allowed
to depend only on the present observation FtY . We will show how a solution of (P ), if there
exists one, can be derived with the help of conditional probabilities and a reduced problem
in the next chapter. Before we state the dependence between information and costs.
Theorem 2.12 Assume g independent of Y and let (I(k), k = 1, . . . , m) be a finer information structure than (I ′ (k), k = 1, . . . , m′ ) with associated problems (P ) and (P ′).
Then
v(P ) ≤ v(P ′ ).
Proof: The assertion is a direct consequence of theorem 2.11.
Comparing (P ) with the complete information model (Pcom ), which means, all processes
are directly observable and our control is allowed to be FtZ,X,Y -predictable, results in the
following corollary.
Corollary 2.13 It holds:
a) v(Pcom ) ≤ v(P )
b) If the optimal control of (Pcom ) is FtY -predictable, then it is optimal for (P ) and
v(Pcom ) = v(P ).
2 The Incomplete Information Model
17
Proof: It is clear that the set of admissible controls for (Pcom ) is larger than the one for
(P ) and thus part a) follows as in theorem 2.12. If the optimal control for (Pcom ) is an
element of the admissible control for (P ) the equality in part b) is an obvious consequence
of a).
Part b) of the previous corollary holds for example for optimal control limit rules, if the
control limit and all parameters are completely observable. It can be weakened in the
following sense: if the optimal control in the complete information model is the same for
all states ei in an observation group I(k), then this control is also optimal for the incomplete
information model in state fk . Sometimes this property is called structure maintaining.
Thus in this case it is sufficient to know that the unobservable part of the process is in some
group and the exact state does not matter to apply the optimal (complete information)
control.
Corollary 2.14 Let u∗ be an optimal Markovian control of (Pcom ), which is independent of
(Zt ) and the same for all i ∈ I(k), that means u∗ (ei ) ≡ u for all i ∈ I(k). Then an optimal
control ν ∗ (fk ) of (P ) in fk is given by ν ∗ (fk ) = u.
Proof: We will omit the proof here and will present it in section 4.1 on page 36.
18
3
The Reduction to a Model with Complete Information
We introduced problem (P ) in the last chapter, an optimization problem under partial
information, which is not directly solvable. Since an observation of Zt and Xt is not completely possible, we will use the conditional probability P(Zt = gµ , Xt = ei | FtY ) as an
estimator for these both processes under the information available at time t modelled by
FtY . In section 3.1 we derive an explicit martingale representation for this conditional probability (theorem 3.5). Additionally we discuss the behaviour and properties of it. Then, in
section 3.2, we introduce the reduced optimization problem (Pred ), which is strongly connected to (P ), as we point out in the reduction theorem 3.13: optimal values and optimal
controls are the same. This reduced optimization problem is under complete information, where the unobservable processes Zt and Xt are replaced by their common estimator.
But it is not a pure jump model anymore, whereas the behaviour between two jumps is
deterministic.
Before stating the results for our setting we explain the idea for the derivation of the filter
equation (which is adopted to Brémaud (1981)). Assume that (Rt ) is a Markovian jump
process with finite state space SR = {e1 , . . . , en } on a filtered probability space (Ω, (Ft ), P)
and with the martingale representation
dRt = QR Rt dt + dMtR
where QR is the generator and MtR the corresponding Ft -martingale. As in section 2.1 Rt
i
. Let now Gt ⊂ Ft for
is defined via the counting process NtR (i, j) having intensity qijR Rt−
all t ≥ 0 with G0 = {∅, Ω} then
bt := dE[Rt | Gt ] = QR R
bt dt + dM
ct ,
dR
ct is a Gt -martingale.
where M
Since this introduction should be illustrative we consider a one-dimensional observation
process and define for αij ∈ {0, 1}
Nt :=
n X
n
X
αij NtR (i, j).
i=1 j=1
Hence Nt , counting some transitions of (Rs ) in [0, t], is a Poisson process with respect to
Ft having intensity
λ(Rt ) :=
n X
n
X
i
αij qijR Rt−
i=1 j=1
and MtN := Nt −
Rt
0
λs (Rs )ds is the corresponding martingale. The quadratic covariation
3 The Reduction to a Model with Complete Information
19
[R, N]t of Rt and Nt is consequently given by
d[R, N]t =
n
n X
X
(ej − ei )d[N R (i, j), N]t
i=1 j=1
=
n X
n
X
n X
n
X
(ej − ei )
i=1 j=1
=
n
n X
X
αkl d[N R (i, j), N R (k, l)]t
k=1 l=1
(ej − ei )αij dNtR (i, j).
i=1 j=1
We define Gt = FtN then
Z t
N
c = Nt −
cs )ds,
M
λ(R
t
0
n P
n
P
ct ) := E[λ(Rt ) | F N ] =
where λ(R
t
i=1 j=1
bi , is a F N -martingale. Also one knows
αij qijR R
t−
t
from the martingale representation theorem that there exists a unique FtN -predictable
process φt such that
ct = φt dM
cN .
dM
t
Summarizing
bt = QR R
bt dt + φt dM
cN .
dR
t
To compute φt we consider with Itô’s formula
Z t
Z t
Rt Nt =
Rs− dNs +
Ns− dRs + [R, N]t
0
0
Z t
Z t
Z t
N
=
Rs− dMs +
Rs− λ(Rs )ds +
Ns− (QR Rs ds + dMsR )
0
0
Z tX
n X
n
+
(ej − ei )αij dNsR (i, j)
0
=
Z
i=1 j=1
t
Rs− dMsN
+
0
Z
t
Rs− λ(Rs )ds +
0
0
Z
t
Ns− (QR Rs ds + dMsR )
0
Z tX
Z tX
n X
n
n X
n
R i
+
(ej − ei )αij qij Rs− ds +
(ej − ei )αij dMsR (i, j)
0
⇒ E[Rt Nt ] = E
Z
0
i=1 j=1
t
Ns− QR Rs ds
"Z
t
+E
Rs− λ(Rs )ds +
0
0
i=1 j=1
Z tX
n X
n
0
i=1 j=1
i
(ej − ei )αij qijR Rs−
ds
#
(3.1)
20
3.1 Filter Equation for the Unobservable Process
and similar
bt Nt =
R
Z
t
Z
bs− dNs +
R
t
bs + [R,
b N]t
Ns− dR
Z t
Z t
Z t
Z t
N
R
N
bs− dM +
bs− λ(R
cs )ds +
bs ds + φs dM )+
=
R
R
Ns− (Q R
φs dNs
s
s
0
0
0
0
Z t
Z t
Z t
N
b
b
c
bs ds + φs dMsN )
=
Rs− dMs +
Rs− λ(Rs )ds +
Ns− (QR R
0
0
0
Z t
Z t
c
+
φs λ(Rs )ds +
φs dMsN
Z0 t
0
Z t
Rb
b
b
c
⇒ E[Rt Nt ] = E
Ns− Q Rs ds + E
(Rs− + φs )λ(Rs )ds .
(3.2)
0
0
0
0
Since (3.1) and (3.2) has to be equal we see that
XX
1
bi − R
bs−
αij ej qijR R
s−
d
λ(Rs− )
n
bs− ) =
φs := φ(R
n
i=1 j=1
fulfils this condition. Consequently the filter equation is given by
bt = QR R
bt dt + φ(R
bt− )dM N .
dR
t
(3.3)
If only the jump from ei∗ to ej ∗ is observable, set αi∗ j ∗ = 1 and all other αij = 0, then after
a jump at τ the new estimate is
bτ = R
bτ − + φ(R
bτ − ) = ej ,
R
thus we know with probability one, that Rt is in state ej . If on the other hand no jump is
observable, thus αij ≡ 0, then
bt = QR Rt dt,
dR
which is the Kolmogorov’s backward differential equation. If αij ≡ 1, then all jumps are
counted and the result coincides with the one in Brémaud (1981).
3.1
Filter Equation for the Unobservable Process
To keep the notation manageable we will drop in this section the dependence of the control
process and compute all following formulas for the uncontrolled case, but they remain
Y
true for admissible controls u, since controls have to be FtY -predictable, u 7→ qkl
(u, Z, X)
X
and u 7→ qij (u, Z) are continuous and U is compact. First of all define the conditional
probability of (Zt , Xt ) given the present information FtY
pt (i, µ) := P Xt = ei , Zt = gµ | FtY
(3.4)
3 The Reduction to a Model with Complete Information
21
and pt := (pt (1, 1), . . . , pt (1, d), pt (2, 1), . . . , pt (n, 1), . . . , pt (n, d)) ∈ △nd , where
△κ := {x ∈ [0, 1]κ |
κ
X
xi = 1}
i=1
denotes the κ-dimensional probability simplex.
The following relation holds for the marginal distributions.
Lemma 3.1
a) P(Xt = ei | FtY ) = pt (i, ·) =
d
P
pt (i, r)
r=1
b) P(Zt = gµ | FtY ) = pt (·, µ) =
n
P
pt (i, µ)
i=1
Proof: The claim is an immediate consequence of the properties of marginal distributions.
The following shows the connection between the conditional probabilities and the filter
technique. We make the following convention: if we write Xt Zt , we mean Xt Zt⊤ and
similar ei gµ should be understood as ei gµ⊤ .
Lemma 3.2 It holds:
a) pt (i, µ) = E[(Xt Zt )iµ | FtY ]
b) pt (i, ·) = E[Xti | FtY ]
c) pt (·, µ) = E[Ztµ | FtY ]
Proof: We only prove part a), since the others are immediate consequences of lemma 3.1,
by:
E[(Xt Zt )iµ | FtY ] = E[1 Xt = ei , Zt = gµ | FtY ] = P(Xt = ei , Zt = gµ | FtY ) = pt (i, µ).
We introduce the operator S : Rn×d → Rnd , who writes the rows of a matrix A = (aij ) ∈
Rn×d in one row of a vector, by
SA := (a11 , . . . , a1d , a21 , . . . , a2d , . . . , an1 , . . . , and )⊤ ∈ Rnd .
[
As a consequence of the last lemma, this definition and X
t Zt := E[Xt Zt | Ft ] it is immediately true that
d = p.
S XZ
The next lemma is an easy algebraic transformation, but will be useful in the presentation
of the filter equation for pt .
22
3.1 Filter Equation for the Unobservable Process
Lemma 3.3 Let X = (xij ) ∈ [0, 1]n×d be a matrix, whose entries sum up to 1, that means
n P
d
P
xij = 1. Then for the matrix
i=1 j=1
a11 x11 + c11
a x + c21
A(X) := 21 21.
..
a12 x12 + c12 · · ·
..
.
···
..
.
an1 xn1 + cn1 an2 xn2 + cn2
a1d x1d + c1d
..
.
..
.
· · · and xnd + cnd
exists a matrix A ∈ Rnd×nd such that for λ ∈ R
S(λA(X)) = λASX.
Especially A is given by
a11 + c11
c11
c11 · · ·
c11
c
a
+
c
c
·
·
·
c12
12
12
12
12
A=
..
..
..
..
..
.
.
.
.
.
cnd
cnd
· · · · · · and + cnd
Additionally S(A1 (X) + A2 (X)) = A1 SX + A2 SX.
.
We have to derive a representation in form of a stochastic differential equation for pt ∈ △nd.
If we can do this, we also have a representation for the marginal distributions by lemma
3.1. Because of (2.11) the filter technique can be applied to each NtY (k, l) instead of Yt and
the conditional probability pt can be expressed in terms of the Poisson processes NtY (k, l).
The derivation well be done as described in the opening of this section on page 18, whereas
now
• Xt Zt is the unknown process Rt
• NtY (k, l) acts as Nt
kl
• αij
(I) = 1 i ∈ I(k), j ∈ I(l) exchanges αij and depends on the information structure (I(k), k = 1, . . . , m)
It is well known by Brémaud (1981) and Last and Brandt (1995), that the (predictable)
FtY -intensity of NtY (k, l) is given by
Y
k
Y
k
Y
qkl
(pt− )Yt−
:= E[qkl
(Zt , Xt )Yt−
| Ft−
]
)
(
d
d
X
X X X
Z
X
k
δijrν qrν
pt− (i, r) Yt−
=
q̃ij,r
+
.
i∈I(k) j∈I(l) r=1
ν=1
3 The Reduction to a Model with Complete Information
23
Y
Take care, despite the similar notation, about the difference of qkl
(Z, X)Y , which is the
Z,X,Y
Y
Y
Y
Ft
-intensity, and the Ft -intensity qkl (p). Note that p 7→ qkl (p) is linear.
With (2.8) in mind we get the FtY -representation of Yt as (see Brémaud (1981), Last and
Brandt (1995))
cY ,
dYt = QY (pt )Yt dt + dM
t
(3.5)
cY is a F Y -martingale defined by
where M
t
t
cY (k) =
M
t
m X
NtY
(l, k) −
NtY
(k, l) −
l=1
Z
t
0
Y
qlk
(ps )Ysl ds ,
k = 1, . . . , m,
Y
Y
similar to (2.9) and QY (p) = (qkl
(p)) with qkY (p) := −qkk
(p) :=
m
P
Y
qkl
(p).
l=1
l6=k
[
As in the opening of this section we derive the filter equation for X
t Zt in the following
Y
Y d
lemma. Note that qkl (p) = qkl (XZ).
Lemma 3.4 Define φ(k,l) (t) := φ(k,l) (X\
t− Zt− ) by
φ(k,l) (t) :=
d
d
X X X
X
1
X
Z
\
ej
gµ q̃ij,µ
+
δijµν gν qµν
(X\
t− Zt− )iµ − Xt− Zt−
Y
qkl
(X\
Z
)
t− t− i∈I(k) j∈I(l)
µ=1
ν=1
then the filter equation is given by
=
[
dX
t Zt
n X
d
nX
Z
\
(ei QZ gµ + (Q̃X
µ + Q̃µ )ei gµ )(Xt Zt )iµ
i=1 µ=1
+
n X
n X
d X
d
X
i=1 j=1 µ=1 ν=1
+
m X
m
X
k=1 l=1
δijµν (ej
o
Z
\
− ei )(gν − gµ )(Xt Zt )iµ qµν dt
Y
φ(k,l) (t) dNtY (k, l) − qkl
(pt )Ytk dt
[
=: A(X
t Zt )dt +
m X
m
X
k=1 l=1
Y
φ(k,l) (t) dNtY (k, l) − qkl
(pt )Ytk dt
(3.6)
Considering (3.6) we see it is of the same structure as (3.3):
[
• The dt-term is the FtY -generator of X
t Zt and is independent of the information
structure. In particular if there are no common jumps of Xt and Zt then all δijµν = 0.
24
3.1 Filter Equation for the Unobservable Process
• φ(k,l) (t) is of the form
X µν
X
XX
1
Z
X
kl
d iµ − XZ.
d
δij gν qµν (XZ)
gµ q̃ij,µ +
αij (I)ej
d
q Y (XZ)
n
n
i=1 j=1
kl
d
d
µ=1
ν=1
The proof of the lemma is done in three steps, where the proof of these individual steps are
excluded in the appendix. First we compute a martingale representation for Xt Zt NtY (k, l),
Y
[
then one for X
t Zt Nt (k, l). Since the expectation of these two expressions has to be the
same we determine the innovation gain φ(k,l) (t) function by comparison of the coefficients.
Proof: Compute Xt Zt NtY (k, l) (see lemma A.5):
Xt Zt NtY (k, l)
(3.7)
Z t
Z t
n X
d
nX
Y
Z
i µ
=
Xs− Zs− dNsY (k, l) +
Ns−
(k, l)
(ei QZ gµ + (Q̃X
µ + Q̃µ )ei gµ )Xs Zs ds
0
0
+
i=1 µ=1
d X
d
n X
n X
X
δijµν (ej
− ei )(gν −
i
gµ )Xs−
dNsZ (µ, ν)
i=1 j=1 µ=1 ν=1
Z t
Z t
Y
Z
Y
+
Ns− (k, l)Xs− dMs +
Ns−
(k, l)Zs− dMsX
0
Z0 t X X
i
k
+
(ej − ei )Xs−
Zs Ys−
dÑsX (i, j)
0 i∈I(k) j∈I(l)
+
Z
t
d X
d
X X X
0 i∈I(k) j∈I(l) µ=1 ν=1
o
µ
i
k
δijµν (ej gν − ei gµ )Xs−
Zs−
Ys−
dNsZ (µ, ν).
Y
[
Compute X
t Zt Nt (k, l) (see lemma A.7):
Y
[
X
t Zt Nt (k, l)
Z t
d
n X
nX
Y
Z
\
=
(ei QZ gµ + (Q̃X
Ns− (k, l)
µ + Q̃µ )ei gµ )(Xs Zs )iµ
0
i=1 µ=1
+
+
(3.8)
Z
0
t
n X
n X
d X
d
X
i=1 j=1 µ=1 ν=1
Z t
Y
X\
s− Zs− dNs (k, l) +
Z
\
c
δijµν (ej − ei )(gν − gµ )(X
s Zs )iµ qµν ds + dMs
o
φ(k,l)(s)dNsY (k, l).
0
Since the expectations of (3.7) and (3.8) have to be the same, we are finally able to
determine the innovation gain function φ(k,l)(t) := φ(k,l)(X\
t− Zt− ). By the arguments of
3 The Reduction to a Model with Complete Information
25
Rt Y
Fubini we see that the 0 Ns−
(k, l) . . . from (3.7) and (3.8) are under the expectation
the same, thus we have to concentrate only on the other summands. Hence choose
φ(k,l) (s) =
d
d
X X X
X
1
X
Z
\
ej
gµ q̃ij,µ
+
δijµν gν qµν
(X\
s− Zs− )iµ − Xs− Zs− ,
Z
)
q Y (X\
s− s−
µ=1
ν=1
kl
i∈I(k) j∈I(l)
thus equation of the expectations is attained. Summarizing we have found a FtY -repre[
sentation of X
t Zt , given by (see lemma A.6):
[
X
t Zt
[
= X
0 Z0 +
(3.9)
Z tX
n X
d
0
+
+
Z
\
(ei QZ gµ + (Q̃X
µ + Q̃µ )ei gµ )(Xs Zs )iµ ds
i=1 µ=1
Z tX
n X
n X
d X
d
0 i=1 j=1 µ=1 ν=1
m
m
XX
Z
\
δijµν (ej − ei )(gν − gµ )(X
s Zs )iµ qµν ds
Y [
φ(k,l) (s) dNsY (k, l) − qkl
(Xs Zs )Ysk ds
k=1 l=1
which is exactly (3.6).
Based on lemma 3.2 and lemma 3.4 we transform the filter equation (3.6) with the help of
the operator S to the representation of pt .
Theorem 3.5 Define
d = AS XZ
d
Bp := SA(XZ)
1
d =:
ϕ(k,l) − I p
Φkl (p) := Sφ(k,l) (XZ)
Y
(p)
qkl
then pt is the unique solution of
dpt = Bpt dt +
=
B−
m X
m
X
k=1 l=1
m
m XX
Y
Φkl (pt− ) dNtY (k, l) − qkl
(pt )Ytk dt
ϕ(k,l) −
Y
qkl
(pt )I
k=1 l=1
where I denotes the nd × nd-unit matrix.
Ytk
!
pt dt +
m X
m
X
Φkl (pt− )dNtY (k, l)
(3.10)
k=1 l=1
We note that B is independent of Yt and consequently independent of the information
structure. This is clear, since here only the X and Z terms are mixed. At the end of this
section we discuss the behaviour of pt and for this we introduce the following abbreviation
!
m X
m X
Y
b(y, p) := B −
ϕ(k,l) − qkl
(p)I yk p.
(3.11)
k=1 l=1
26
3.1 Filter Equation for the Unobservable Process
b(y, p) describes the deterministic flow between two jumps of pt and is bilinear-quadratic
in p.
pt as (unique) solution of (3.10) is a piecewise-deterministic process taking values in the
nd-dimensional probability simplex △nd . If a jump of NtY (k, l) occurs at time τ (resulting
from a change in the observation state Yt from fk to fl ), then pt jumps with probability
one to the new state
pτ = pτ − + Φkl (pτ − ).
(3.12)
If Yt jumps at time τ from fk to fl (with k 6= l under the assumption m ≥ 2) we have
pτ − (i, ·) = 0 for all i ∈ I(l) and pτ (i, ·) ≥ 0 (where at least one is > 0) for all i ∈ I(l) and
pτ (i, ·) = 0 for all i ∈
/ I(l). In particular each jump of Yt leads to changes in pt . Thus we
have Φkl (p) 6= 0 for k 6= l and a one-to-one relation between the observation Yt and the
estimator pt . As an immediate consequence we have
FtY = Ftp ,
without any further assumptions as needed in Miller et al. (2005).
Let us specialize the representation formula (3.10) of pt to the five special case introduced
on page 11 and discuss the behaviour of it. Some results are presented in a previous work
of the author (Winter (2007)).
Complete-Observation-Model
In the case of complete information, remember that we skip the environment process, we
have Yt ≡ Xt and pt (i) := P(Xt = ei | FtX ) only takes values in {0, 1}, this means pt is
always in a corner of △n . Thus we get if Xt jumps from ei to ej
Φij (p) = ej − ei
and
b(y, p) ≡ 0.
Group-Observation-Model
Again without environment process (Zt ) we observe the jump size of pt by
P P X
qi1 pi
i∈I(k) 1∈I(l)
1
.
.
Φkl (p) = Y
−p.
.
P P X
qkl (p)
qin pi
i∈I(k) n∈I(l)
Hence the a-posteriori-probability after a jump is distributed in relation to the intensities
for a jump from fk to fl . In particular for the 0-1-observation model, we get for jumps to
f1
Φ21 (p) = e1 − p.
3 The Reduction to a Model with Complete Information
27
Thus after a jump the probability of being in state e1 is equal to 1, which has to be the
case, since we have complete observation. Additionally b(f1 , p) ≡ 0 (hence the probability
for e1 remains 1 until we leave this observation state f1 ) and for jumps from f1 to f2
0
qX
1
12
Φ12 (p) = X
− e1 .
X
X ..
.
q12 + q13
+ . . . + q1n
X
q1n
No-Observation-Model
In the case of no information about the state process (Xt ) no jumps of (Yt ) occur, since
Y
qkl
(p) ≡ 0, and therefore no jumps of (pt ) occur. Accordingly we get
b(y, p) = QX p,
thus the filter equation is equal to Kolmogorov’s backward differential equation.
Hidden-Markov-Model
In the (classical) Hidden-Markov-Model the jump size is determined by
X
q̃ij,1
p(1)
1
..
Φij (p) = d
− p.
.
P X
X
q̃ij,d p(d)
q̃ij,r p(r)
r=1
since the jumps of (Zt ) has no influence to (Xt ). If the jumps of (Zt ) interact with the
jumps of (Xt ) then the form of Φij (p) is similar since
X
qij,1
p(1)
1
..
Φij (p) = d
− p.
.
P X
X
qij,d p(d)
qij,r p(r)
r=1
X
X
X
but now q̃ij,µ
has been replaced by qij,µ
:= q̃ij,µ
+
d
P
ν=1
Z
δijµν qµν
which pays attention to the
jumps of (Zt ). In both cases the post-jump state is relative to the estimated jump intensities.
Bayesian-Model
With QZ ≡ 0 and from Z0 only P (Z0 = z) = p0 is known (where z ∈ SZ ) and with
I(k) = {k} (complete observation about the process Xt ), the jump behaviour is described
by
X
qij,1
p(1)
1
..
Φij (p) = d
− p.
.
P X
X
qij,r p(r)
qij,n p(n)
r=1
thus the post-jump state in each component is relative to the estimated intensities.
28
3.1 Filter Equation for the Unobservable Process
The finer the information structure, the ”better” the estimator process as the following
result demonstrates.
Theorem 3.6 Let (I(k), k = 1, . . . , m) be a finer information structure than (I ′ (k), k ′ =
1, . . . , m′ ) and denote by (Yt ) and (Yt′ ) the corresponding observation processes, then
′
P(Xt = ei , Zt = gµ | FtY ) is measurable with respect to FtY .
′
Proof: This statement is a direct consequence of FtY ⊂ FtY as stated in theorem 2.7.
Remark 3.7 The extension to the case of countable state spaces SZ and SX is straightforward under the assumption, that the intensity matrices are all conservative and the
diagonal elements are finite (remember remark 2.3).
Considering the stochastic differential equation (3.10) for the estimator pt we see that it is
nonlinear in p. But the existence of a solution is always guaranteed by the strong connection
between the estimator p and the conditional expectation E[Xt Zt | FtY ] (which exists)
given in lemma 3.2. Additionally, pt is strongly connected to the non-normalized estimator
process qt as defined in Elliott et al. (1997) in the following way. There they define an
equivalent probability measure Q such that NtY (k, l) are standard Poisson processes under
Q. For this purpose define the Girsanov-density
!
(Z
! )
m
m
t
X
Y X
Y
qkl
(Zs , Xs )△NsY (Ys− , l) exp
qYYs− l (Zs , Xs ) ds
1−
Lt :=
0
0<s≤t l=1
l=1
under the assumption that Lt is a P-martingale. Notice that Lt is the stochastic exponential
m
Rt
P
of 0 (1 − qYYs− l (Zs , Xs ))dMsY . Then the relation between the measures P and Q is given
l=1
by
EQ
dP
| Ft = Lt .
dQ
Define then
qt (i, µ) := EQ Lt (Xt Zt )iµ | FtY
and we state the analogon to theorem 3.5 for qt .
Theorem 3.8
a) qt is (the unique) solution of the following linear stochastic differential equation under
Q (so-called Zakai-equation), where the jump processes NtY (k, l) are standard Poisson
processes
dqt = Bqt dt +
m
m X
X
k=1 l=1
ϕ(k,l) − I qt dNtY (k, l) − dt .
3 The Reduction to a Model with Complete Information
29
b) It holds:
pt (i, µ) =
qt (i, µ)
d
n P
P
qt (j, ν)
j=1 ν=1
and due to this relation qt (i, µ) is called unnormalized estimator process.
Denote by τn the jump times of the observation process (Yt ). These are the same jump
times as of (pt ), see (3.10). Assume Yτn = y, pτn = p, then up to the next jump τn+1 of Yt
the process pt evolves according to
ṗ = b(y, p)
(3.13)
p0 = p
remember (3.11). Denote by φt (p) the solution of this deterministic partial differential
equation. Notice that every y results in another b(y, p) and hence in another φt (p), but we
neglect this in the notation. Then after a jump of Yt at time τ the estimator process pt is
equal to φt−τ (pτ ) up to the next jump as stated next.
Theorem 3.9
a) For t ∈ [τn , τn+1 ) it holds under Yt = y that pt = φt−τn (pτn ).
b) p 7→ b(y, p) is Lipschitz continuous.
c) t 7→ φt (p) is Lipschitz continuous.
Proof:
a) The statement follows directly from the representation theorem 3.5.
b) △nd is a convex and compact subset of Rnd and p 7→ b(y, p) is a bilinear-quadratic
function and in particular continuously differentiable. As a consequence p 7→ b(y, p)
is Lipschitz continuous on △nd .
∂
c) Since ∂t
φt (p) = b(y, p) and b(y, p) is continuous by b) on the compact set △nd the
Lipschitz continuity is an immediate consequence.
By the definition of the information structure and the jump behaviour of pt described
in (3.12) it holds for the conditional probabilities between two jumps, this means for
t ∈ [τn , τn+1 ), if the observation process Yt is in fk that
(
> 0 i ∈ I(k)
pt (i, µ) = φt−τn (pτn )(i, µ)
=0 i∈
/ I(k)
30
3.1 Filter Equation for the Unobservable Process
Thus we see that between two jumps the probability measure of the marginal probability
pt (i, ·) is concentrated on states i ∈ I(k). This is reasonable, since the observations tell,
that Xt has to be in a state i ∈ I(k).
Between two jumps the conditional probability pt = φt−τn (pτn ) moves into an equilibrium
solution of ṗ = b(y, p). Under some mild conditions this is the stationary distribution
of (Xt , Zt ) restricted on I(k). The following figure 2 demonstrates this behaviour. On
the x-axis the time is marked, on the y-axis the interval [0, 1]. The different lines are the
conditional probabilities pt (i, µ). We do not want to go in more detail, since in the following
sections the estimate process pt (in particular the intensities) depends on u. Hence it is
quite hard to talk about a stationary distribution. But more details can be found in Davis
(1993) and Braun (1993).
p on I(1)
p on I(2)
Figure 2: d = 2, n = 4 and I(1) = {1, 2}, I(2) = {3, 4}.
We have seen above that the random behaviour of pt is completely described by the Poisson
Y
process NtY (k, l). The time between two jumps of NtY (k, l) is exp(qkl
(p)Y k )-distributed. If
Y
there occurs a jump the jump-size of Nt (k, l) is equal to 1 and by construction of pt
its (deterministic) jump size is Φkl (pt− ). Our next goal is to find an equivalent process
construction, where the distribution function of the sojourn times is independent of the
states of Yt . This technique is called uniformization technique and will be helpful in section
4.2 and 5.
kl
Define for fixed (k, l) with τ0kl := 0 a sequence (τn+1
− τnkl ) of iid random variables which
are exp(α)-distributed, where the uniformization parameter α is greater or equal to all
diagonal elements of QY (Z, X), hence set
α := sup qkY (gµ , ei ) .
k,µ,i
3 The Reduction to a Model with Complete Information
31
Y
(φt−τnkl (pτnkl ))Yτkkl )-distributed random variFurthermore define a sequence of Bernoulli( α1 qkl
n
ables (Wnkl ). Then we are able to express NtY (k, l) in dependence of (Wnkl ) and (τnkl ).
Lemma 3.10 It holds:
NtY
d
(k, l) =
∞
X
n=0
τnkl ≤t
τ kl (t)
Wnkl
=
X
Wnkl ,
n=0
where τ kl (t) := sup{n ≥ 0 | τnkl ≤ t}.
Proof: This result is well-known from the Markov-chain theory, see Massey and Whitt
(1998).
From the last theorem we can rewrite our stochastic differential equation (3.10) for pt as
kl
τ (t)
m X
m
X
X
dpt = b(Yt , pt )dt +
Φkl (pt− )d
Wnkl
(3.14)
k=1 l=1
n=0
The interpretation is the following: the distribution of the sojourn times is now independent
of the state process Yt and the estimate pt . But there are more jumps of the underlying
stochastic system with exp(α)-distribution compared to the former one. To compensate for
these additional jumps, we have to skip some jumps of W , in the sense that these jumps
do not lead to changes in pt . This is done with the help of the Bernoulli-distribution.
Characterization (3.14) is advantageously for simulations.
3.2
The Reduced Problem
After calculating a representation for our estimator (pt ) we express our cost criterion as a
function of pt , such that it is FtY -measurable, in particular it depends only on the available
information. Note that (pt ) depends now on the control process u, since all intensities
(and therefore NtY (k, l)) may be controllable, see the construction in section 2.2. Hence
the estimator process is influenced by the control process. This feature raises the level of
difficulty significantly. In many other applications as for example in finance models the
randomness in the system, given by a Brownian motion for example, is independent of
the control. Consequently the estimators (for example for unknown parameters) are also
independent of the control. Similar transformations for a time-discrete setting are pointed
out in Bhulai (2002), Koole (1998) and Hernandez-Lerma (1989). The next theorem states
that the expected discounted costs as a function of (Zt , Xt , Yt ) are the same as of (Yt , pt ).
Theorem 3.11 It holds for all u ∈ U:
hZ ∞
i
hZ
−βs
Eu
e g(Zs , Xs , Ys , us )ds = Eu
0
0
∞
i
e−βs g(ps , Ys , us )ds ,
(3.15)
32
3.2 The Reduced Problem
where
g(pt , y, u) := Eu [g(z, x, y, u) | FtY ] =
n X
d
X
g(gµ , ei , y, u)pt(i, µ).
i=1 µ=1
Proof: Since g ≥ 0 we have by Fubini
hZ ∞
i
hZ
−βs
Eu
e g(Zs, Xs , Ys , us )ds = lim Eu
T →∞
0
0
T
i
e−βs g(Zs , Xs , Ys , us )ds .
With Wong and Hajek (1985) we conclude
hZ T
i
h nZ T
oi
−βs
Eu
e g(Zs , Xs , Ys , us )ds = Eu Eu
e−βs g(Zs, Xs , Ys , us )ds | FTY
0
0
hZ T n
o i
= Eu
Eu e−βs g(Zs, Xs , Ys , us ) | FsY ds
0
hZ T
i
−βs
= Eu
e g(ps , Ys , us )ds
0
and the assertion follows.
After transforming the objective function and the unobservable state process into estimated
counterparts, we define the separated (transformed/reduced) problem based on (P ). As
mentioned earlier the separated control problem is with complete information, since all
functions therein are FtY -adapted. That means they depend only on information available
through (Ys )0≤s≤t up to the current time t. Define
(Pred )
R ∞ −βs
E
e
g(p
,
Y
,
u
)ds
→ min
u
s
s
s
0
m P
m
P
Φkl (ut , pt− )dNtY (k, l)
dpt = b(ut , Yt , pt )dt +
Y
dYt = Q (ut , pt )Yt dt
(p , Y ) = (p, y0 )
0 0
u ∈ U,
k=1 l=1
cY
+ dM
t
where b(u, y, p) is defined as in (3.13) in the controlled sense. Furthermore define p :=
P(Z0 = z0 , X0 = x0 | F0Y ). We note that the set of admissible controls U is the same as
for problem (P ), since controls are only allowed to depend on the observation FtY . We
introduce the following abbreviation:
Z ∞
−βs
J(y, p; u) := Eu
e g(ps , Ys , us )ds | Y0 = y, p0 = p
0
J(y, p) := inf J(y, p; u)
u∈U
and we call a control process u∗ ∈ U optimal if and only if J(y, p, u∗) = J(y, p), where
J(y, p) is the minimal expected discounted cost over an infinite horizon starting in (y, p)
(called the optimal value of (Pred )).
3 The Reduction to a Model with Complete Information
33
In order to complete the reduction we have to prove that the objective function in the
original (incomplete information) model and in the transformed (complete information)
model are the same. This step is very often omitted in the literature, although it is not
hard to prove.
Theorem 3.12 Understand U as U[t, ∞), then it holds for all t ≥ 0:
hR
i
hR
i
∞
∞
a) Eu t e−βs g(Zs , Xs , Ys , us )ds | FtY = Eu t e−βs g(ps , Ys , us )ds | FtY
b) inf Eu
u∈U
hR
∞ −βs
e g(Zs , Xs , Ys , us )ds
t
∀u ∈ U
i
hR
i
∞
| FtY = inf Eu t e−βs g(ps , Ys , us )ds | FtY .
u∈U
Proof: Recall that the set of admissible controls U for (P ) and (Pred ) are the same, then
part a) follows as in theorem 3.11. Part b) is a direct consequence of part a).
After deriving the connection between the objective functions we state the connection
between the optimal controls, which is given by: the optimal control of the reduced model
is an optimal control for the incomplete information model (and vice versa).
Theorem 3.13 The following assertions are immediate consequences of theorem 3.12:
a) u = (ut ) is optimal for (Pred ) ⇐⇒ u = (ut ) is optimal for (P )
b) The optimal values of (Pred ) and (P ) are the same.
Theorem 3.12 simplifies in the case of Markovian controls to the following:
Corollary 3.14 If the (optimal) control (u∗t ) is Markovian and (p∗t , Yt∗ ) is the corresponding
state process, then:
Z ∞
Z ∞
Y∗
−βs
∗
∗
∗
∗
−βs
∗
∗
∗
∗
Eu
= Eu
e g(ps , Ys , us ) | Ft
e g(ps , Ys , us ) | pt , Yt .
t
t
We have seen that the two problems (P ) and (Pred ) are strongly connected to each other.
If we can solve the reduced problem we have a solution for the original problem (P ). In
particular we have proven that J(z, x, y; u) = J(y, p; u) and J(z, x, y) = J(y, p). In general
the existence of an optimal solution for one of these two problems is not guaranteed, but
we will come back to this question in section 4.2. Before we state some properties of the
value function J(y, p; u) and J(y, p).
Theorem 3.15
a) For all u ∈ U holds
J(y, p; u) =
n X
d
X
i=1 r=1
J(gr , ei , y; u)p(i, r)
34
3.2 The Reduced Problem
b) For the optimal value function holds
J(y, p) ≥
n X
d
X
J(gr , ei , y)p(i, r)
i=1 r=1
c) p 7→ J(y, p) is concave.
Proof:
a) The equation is obvious by making use of the definition of J(z, x, y; u) and J(y, p; u)
with the help of the conditional expectation.
b) With a) we conclude:
J(y, p) = inf J(y, p; u) = inf
u∈U
≥
d
n X
X
i=1 r=1
u∈U
n X
d
X
J(gr , ei , y; u)p(i, r)
i=1 r=1
inf {J(gr , ei , y; u)} p(i, r) =
u∈U
n X
d
X
J(gr , ei , y)p(i, r).
i=1 r=1
c) Again from a) we get for p ∈ △nd , q ∈ △nd and ρ ∈ [0, 1]:
J(y, ρp + (1 − ρ)q) = inf J(y, ρp + (1 − ρ)q; u)
u∈U
( n d
)
XX
= inf
J(gr , ei , y; u)(ρp(i, r) + (1 − ρ)q(i, r))
u∈U
≥ inf
u∈U
i=1 r=1
( n d
XX
J(gr , ei , y; u)ρp(i, r)
i=1 r=1
= ρJ(y, p) + (1 − ρ)J(y, q)
)
+ inf
u∈U
(
n X
d
X
i=1 r=1
J(gr , ei , y; u)(1 − ρ)q(i, r)
)
Corollary 3.16 From part c) of the last theorem it follows that p 7→ J(y, p) is locally
Lipschitz continuous.
Proof: Since J(y, p) is concave in p the assertion is an immediate consequence from analysis
as stated for example in Rockafellar (1996).
4 Solving the Reduced Model
4
35
Solving the Reduced Model
In this chapter we introduce two different solution techniques for the reduced model, which
was introduced in section 3.2 as
R ∞ −βs
E
e
g(p
,
Y
,
u
)ds
→ min
u
s
s
s
0
m P
m
P
Φkl (ut , pt− )dNtY (k, l)
dpt = b(ut , Yt , pt )dt +
k=1 l=1
(Pred )
Y
cY
dY
=
Q
(u
,
p
)Y
dt
+
dM
t
t
t
t
t
(p0 , Y0 ) = (p, y0)
u∈U
We show that these procedures at the end are connected but have completely different
fundamentals. The first solution technique is a generalization of the classical verification
technique using the Hamilton-Jacobi-Bellman equation (HJB). It is well known that if
the value function is sufficient differentiable it satisfies the HJB-equation and an optimal
control can be computed with the help of this HJB-equation. But the differentiability
condition is a very strong one and various authors tried to overcome this difficulty for
example by viscosity solutions (see Fleming and Soner (1993)) or numerical approaches
(see Kushner and Dupuis (2001)). We use a weaker form of differentiability, introduced by
Clarke (1983), and extend the HJB-equation and the verification technique with the help
of the Clarke derivative in section 4.1. We give necessary and sufficient conditions for an
optimal control.
In section 4.2 we use the piecewise-deterministic behaviour of our transformed state process
(Yt , pt ) and utilize an idea of Davis (1993). We define a time-discrete Markovian-DecisionProcess (MDP), where every action is a function of the state after the last jump and of the
time elapsed since the last jump. Here we state an existence theorem for an optimal control.
We generalize present results to discounted and uniformized models. These extensions are
in our opinion much more practicable for the computation of optimal strategies and for
proving properties of optimal strategies. This procedure together with the generalized
HJB-approach will help us in section 5 to characterize optimal controls.
Notice that there are more than these two solution techniques. One is the maximum principle, which is connected to the verification technique. It is the extension of Pontryagin’s
maximum principle to the field of stochastic optimization problems. It does not use the
special structure of the piecewise-deterministic process. Therefore it can be applied to various optimization problems. Since this technique makes use of the Lagrangian function and
one has to solve the so-called adjoint equation, which is a deterministic partial differential
equations, very often computational problems arise. For details we refer to Øksendal and
Sulem (2005), Framstad et al. (2004), Rishel (1978) and Haussmann (1986).
Another approach, which is more technical, is the martingale optimality. It claims, that
the value function for a fixed control is always a submartingale. But it is even a martingale
if and only if the control is optimal. This result is very general and is applied often in the
context of financial mathematics (see e.g. Karatzas and Shreve (2001)).
36
4.1 The Generalized HJB-Equation and Verification Technique
4.1
The Generalized HJB-Equation and Verification Technique
The classical verification technique with the help of the Hamilton-Jacobi-Bellman-equation
goes back to Bellman (see e.g. Bellman (1977)) and is well understood, see for example
Fleming and Rishel (1975), Yong and Zhou (1999) or Øksendal and Sulem (2005). The
HJB-equation can be defined for every control problem and gives under some assumptions
a characterization of the optimal value function (remember section 3.2)
J(y, p) = inf Eu
u∈U
Z
∞
−βs
e
g(ps , Ys , us )ds | Y0 = y, p0 = p .
0
In other words: J(y, p) is the minimal expected discounted cost over an infinite horizon,
when the process (Yt , pt ) starts at time t = 0 in state (y, p) ∈ SY × △nd .
The following theorem is well-known as Bellman’s principle. The proof is standard and
can be found for example in Gihman and Skorohod (1979) or Hanson (2007).
Theorem 4.1 For all FtY -stopping times τ ≥ t it holds
−βt
e
J(y, p) =
inf Eu
u∈U [t,τ )
Z
τ
−βs
e
−βτ
g(ps , Ys , us )ds + e
J(Yτ , pτ ) | Yt = y, pt = p ,
t
where U[t, τ ) denotes the set of admissible controls in the interval [t, τ ).
This result will be used in the proof of theorem 4.3 where we claim that the value function
is a solution of the HJB-equation. Additionally, it is useful for the proof of corollary 2.14
we give next.
Proof of corollary 2.14: Assume Yt = fk and define τ := inf{s > t | Ys 6= fk } as the
first jump time point after time t. Consider (u∗s )s∈[t,τ ) under the condition Yt = fk . Then
we know that u∗s is FsY -predictable for s ∈ [t, τ ), since u∗s is the same for all i ∈ I(k).
Furthermore we know that if v ∗ is an optimal control of
RT
0
g(x, vs )ds → min
x ∈ A
which is independent of x then v ∗ is also optimal for
RT R
0
A
g(x, vs )µ(dx)ds → min
µ ∈ P (A)
where P (A) is set of all probability measures over A. We conclude with Bellman’s principle
4 Solving the Reduced Model
37
of theorem 4.1 with y = fk , that
e−βt J(y, p)
Z τ
−βs
−βτ
=
inf Eu
e g(ps , us )ds + e J(Yτ , pτ ) | Yt = y, pt = p
u∈U [t,τ )
t
#
"Z
n
τ
X
g(·, ei, us )ps (i, ·)ds + e−βτ J(Yτ , pτ ) | Yt = y, pt = p
=
inf Eu
e−βs
u∈U [t,τ )
=
inf Eu
u∈U [t,τ )
= Eu∗
= Eu∗
Z
t
Z
τ
e−βs
t
τ
e−βs
t
Z
i=1
i∈I(k)
X
i∈I(k)
τ
−βs
e
X
g(·, ei, us )ps (i, ·)ds + e−βτ J(Yτ , pτ ) | Yt = y, pt = p
g(·, ei, u∗s )ps (i, ·)ds + e−βτ J(Yτ , pτ ) | Yt = y, pt = p
g(ps , u∗s )ds
−βτ
+e
J(Yτ , pτ ) | Yt = y, pt = p
t
and therefore (u∗s )s∈[t,τ ) is optimal under Yt = fk .
Instead of requiring strict differentiability of the value function as in the classical HJBtheory it can be shown, that it is sufficient to assume that the value function is locally
Lipschitz continuity. This generalized technique was first introduced by Clarke (1983) and
then extended by Davis (1993). We first define the Clarke derivative, which is an extension
of the classical theory of differentiability.
Let f : Rb → R be a locally Lipschitz continuous function. Then for x, y ∈ Rb the upper
generalized directional derivative of f at x in direction y is defined by
f 0 (x; y) := lim sup
z→x
ε→0
f (z + εy) − f (z)
.
ε
The lower generalized directional derivative of f at x in direction y is defined analogously
by
f0 (x; y) := lim
inf
z→x
ε→0
f (z + εy) − f (z)
.
ε
(z)
If f0 (x; y) = f 0 (x; y) then limz→x,ε→0 f (z+εy)−f
exists and f is differentiable in x in
ε
direction y and everything breaks down to the well-known directional derivative.
The Clarke generalized gradient of f at x is defined by
∂f (x) := ξ ∈ Rb | f 0 (x; y) ≥ ξy for all y ∈ Rb ,
38
4.1 The Generalized HJB-Equation and Verification Technique
which is a nonempty, convex and compact subset of Rb . We want to understand ξ as row
vector. If f (x) is differentiable in x with derivative f ′ (x) then ∂f (x) = {f ′ (x)}. Due
to this the classical HJB-approach is included here if the value function is (piecewise)
differentiable. It holds further
f 0 (x; y) = max ξy
ξ∈∂f (x)
and
f0 (x; y) = min ξy.
ξ∈∂f (x)
By the local Lipschitz continuity we conclude that f is differentiable almost everywhere
and we can find for every x ∈ Rb a sequence (xn ) with xn ∈ Rb such that xn converges to
x and f is differentiable at xn for all n ∈ N. Hence ∂f (x) can be written as the closed
convex hull of existing limits of sequences of the derivatives ∇f (xn ), that means
n
o
∂f (x) = co lim ∇f (xn ) | lim xn = x .
n→∞
n→∞
A locally Lipschitz continuous function f is called regular at x if the ordinary directional
derivative
f (x + εy) − f (x)
ε→0
ε
f ′ (x; y) := lim
exists for all y and f 0 (x; y) = f ′ (x; y). By Clarke (1983) every concave function f (which
is even locally Lipschitz by Rockafellar (1996)) is regular.
Finally we mention one important case of the chain rule, the computation procedure for
combined functions. General formulas can be found in Clarke (1983).
Lemma 4.2 Let g : Rn → R and h : Rm → Rn are locally Lipschitz continuous. Assume
g is regular and h is strictly differentiable, then for f (x) := g(h(x)) it holds
∂f (x) = ∂g(h(x)) ◦ h′ (x)
(4.1)
where we want to understand ∂g(h(x)) = ∂g(z) |z=h(x). The meaning of (4.1) is that every
element ξ ∈ ∂f (x) can be represented as a composition of a map ψ ∈ ∂g(h(x)) and h′ (x).
In the one-dimensional case b = 1 only two directions are possible: to the right and to the
left. Hence we will sometimes speak of the upper generalized right hand side derivative in
direction y (if y > 0) when we consider f 0 (x; y) (similar for the other cases). If f 0 (x; 1) =
f0 (x; 1) then f is differentiable on the right hand side in the usual way. If additionally
f 0 (x; −1) = f0 (x; −1) then f is differentiable in x.
Thus we state the generalized HJB-equation for a locally Lipschitz continuous function W
as
βW (y, p)
=
inf
ξ∈∂p W (y,p)
u∈U
(
g(p, y, u) + ξb(u, y, p) +
m
X
l=1
(4.2)
)
Y
(W (fl , p + Φyl (u, p)) − W (y, p))qyl
(u, p)
4 Solving the Reduced Model
39
where ∂p W (y, p) is the Clarke generalized gradient with respect to p. The next theorem
is the justification of the HJB-equation, since it states that the value function J(y, p) is
solution of it. Additionally, it states a necessary condition for an optimal control.
Theorem 4.3 It holds:
a) The value function J(y, p) satisfies the generalized HJB-equation for all (y, p) ∈
SY × △nd .
b) If there exists an optimal control (u∗t ) with corresponding state process (Yt∗ , p∗t ) then
(
βJ(Yt∗ , p∗t ) =
inf
ξ∈∂p J(Yt∗ ,p∗t )
g(p∗t , Yt∗ , u∗t ) + ξb(u∗t , Yt∗ , p∗t )
+
m
X
(J(fl , p∗t + ΦYt∗ l (u∗t , p∗t )) − J(Yt∗ , p∗t ))qYYt∗ l (u∗t , p∗t )
l=1
)
for almost all t ≥ 0.
Proof:
a) Denote by τn the jump times of Yt , especially this are the jump times of pt too.
Since t 7→ e−βt J(Yt , φut (p)) is locally Lipschitz continuous (since p 7→ J(y, p) is locally
Lipschitz by corollary 3.16, t 7→ φut (p) by theorem 3.9 and t 7→ e−βt is Lipschitz
continuous for t ∈ [0, ∞)) there exists for all 0 =: τ0 < τ1 < τ2 < . . . a function
D e−βs J(Ys , ps ) such that
Z τi −
−βτi −
−βτi−1
e
J(Yτi − , pτi − ) − e
J(Yτi−1 , pτi−1 ) =
D e−βs J(Ys , ps ) ds.
τi−1
Due to the local Lipschitz continuity of e−βs J(Ys , ps ) the function D e−βs J(Ys , ps )
may be chosen as its derivative with respect to s, which exists almost everywhere on
[0, ∞). Hence with theorem 3.9
D e−βs J(Ys , ps ) = −βe−βs J(Ys , ps ) + e−βs Jp (Ys , ps )φ̇us (ps )
−βs
− βJ(Ys , ps ) + Jp (Ys , ps )b(us , Ys , ps ) .
= e
Extending this consideration over the jump time points τn we can write
e−βt J(Yt , pt )
Z t
X
= J(Y0 , p0 ) +
D e−βs J(Ys , ps ) ds +
e−βs J(Ys , ps ) − e−βs− J(Ys− , ps− )
0
= J(Y0 , p0 ) +
+
Z
0
Z
0
t
e−βs
0<s≤t
t
D e−βs J(Ys , ps ) ds
m
X
l=1
J(fl , ps− + ΦYs− l (us , ps− )) − J(Ys− , ps− ) dNsY (Ys− , fl ).
40
4.1 The Generalized HJB-Equation and Verification Technique
Denote by τ the first jump time point after time t then we obtain from Bellman’s
principle in theorem 4.1 and by the considerations above with a time-shift from [0, t]
to [t, t′ ∧ τ ] that for all u ∈ U[t, τ ) and t′ > t
e−βt J(y, p)
"Z
#
τ ∧t′
′
≤ Eu
e−βs g(ps , Ys , us )ds + e−β(τ ∧t ) J(Yτ ∧t′ , pτ ∧t′ ) | Yt = y, pt = p
= Eu
"Z
+
t
τ ∧t′
−βs
e
−βt
g(ps , Ys , us )ds + e
t
Z
J(Yt , pt ) +
Z
t
τ ∧t′
e−βs
t
m
X
l=1
τ ∧t′
D e−βs J(Ys , ps ) ds
J(fl , ps− + ΦYs− l (us , ps− ) − J(Ys− , ps− ) ·
#
· dNsY (Ys− , fl ) | Yt = y, pt = p .
Since
D e−βs J(Ys , ps ) = e−βs − βJ(Ys , ps ) + DJ(Ys , ps )
Yt = y, pt = p and the intensity of NtY (Ys− , Ys ) is given by qYYs− Ys (us , ps− ) we proceed
to
"Z
τ ∧t′
−βs
e
g(ps , Ys , us ) − βJ(Ys , ps ) + DJ(Ys , ps )
0 ≤ Eu
t
+
m
X
l=1
J(fl , ps− + ΦYs− l (us , ps− )) − J(Ys− , ps− ) ·
· qYYs− l (us , ps− ) ds | Yt = y, pt = p
=: Eu
"Z
t
τ ∧t′
#
e−βs HJ(Ys , ps , us )ds | Yt = y, pt = p .
Choose now a fixed strategy ũs ≡ u for s ∈ [t, t + ε), ε > 0, then
"
#
Z τ ∧t′
1
0 ≤ lim
Eu ′
e−βs HJ(Ys , ps , ũs )ds | Yt = y, pt = p
t′ ↓t
t −t t
"
#
Z t′
1
−βs
= lim
Eu ′
e HJ(Ys , ps , ũs )ds | Yt = y, pt = p · Pu (t′ < τ )
′
t ↓t
t −t t
Z τ
1
−βs
e HJ(Ys , ps , ũs )ds | Yt = y, pt = p · Pu (t′ ≥ τ ).
+ lim
Eu ′
t′ ↓t
t −t t
#
4 Solving the Reduced Model
41
With α ≥ sup sup qkY (u, gµ, ei ) we conclude
k,µ,i u∈U
′
Pu (t′ ≥ τ ) ≤ 1 − e−α(t −t) → 0
for t′ ↓ t.
Thus we obtain at points p where J(y, p) is differentiable with respect to p that
0 ≤ e−βt HJ(Yt , pt , ũt ) = e−βt HJ(y, p, u)
⇒
0 ≤ inf HJ(y, p, u).
u∈U
At points p where J(y, p) is not differentiable in p the generalized gradient is given
by
o
o
n
n
∂p J(y, p) = co lim ∇J(y, ptn ) | ptn → pt = p = co lim ∇J(y, ptn ) | tn → t ,
n→∞
n→∞
this means: every ξ ∈ ∂p J(y, p) is a convex combination of ξ m = limn→∞ ∇J(y, ptm
)
n
m
for sequences tn → t, along which J(y, p) is differentiable. Since p 7→ J(y, p) is
locally Lipschitz continuous (see corollary 3.16) we obtain with the chain rule for the
Clarke derivative for all m that
0 ≤ e−βt g(p, y, u) − βJ(y, p) + ξ m b(u, y, p)
+
m
X
l=1
Y
J(fl , p + Φyl (u, p)) − J(y, p) qyl
(u, p) .
Dividing by e−βt and remembering that ξ is a convex combination of ξ m we conclude
0 ≤ g(p, y, u) − βJ(y, p) + ξb(u, y, p) +
m
X
l=1
Y
J(fl , p + Φyl (u, p)) − J(y, p) qyl
(u, p).
Since u ∈ U and ξ ∈ ∂p J(y, p) were chosen arbitrarily we conclude
0 ≤
inf
ξ∈∂p J(y,p)
u∈U
n
g(p, y, u) − βJ(y, p) + ξb(u, y, p)
+
m
X
l=1
o
Y
J(fl , p + Φyl (u, p)) − J(y, p) qyl
(u, p) .
On the other hand for ε > 0 and 0 < t < t′ < ∞ with t − t′ > 0 small enough there
exists a strategy uε with corresponding state process (Yt , pt ) such that we conclude
from theorem 4.1
e−βt J(y, p) + ε(t′ − t)
"Z
#
τ ∧t′
′
≥ Eu
e−βs g(ps , Ys , uεs )ds + e−β(τ ∧t ) J(Yτ ∧t′ , pτ ∧t′ ) | Yt = y, pt = p .
t
42
4.1 The Generalized HJB-Equation and Verification Technique
The same computations as before lead to
"
#
Z τ ∧t′
1
−βs
ε
ε ≥ Euε ′
e HJ(Ys , ps , us )ds | Yt = y, pt = p
t −t t
#
"
Z τ ∧t′
1
−βs
e
inf HJ(Ys , ps , us )ds | Yt = y, pt = p
≥ Eu ′
us ∈U
t −t t
"
#
Z τ ∧t′
1
= Eu∗ ′
e−βs HJ(Ys , ps , u∗s )ds | Yt = y, pt = p ,
t −t t
where the existence of u∗s is guaranteed since u 7→ HJ(y, p, u) is continuous and U
compact. If J(y, p) is differentiable in a point p we continue as before with
ε ≥ lim
Eu∗
′
t ↓t
−βt
= e
"
1
′
t −t
Z
τ ∧t′
e−βs HJ(Ys , ps , u∗s )ds | Yt = y, pt = p
t
HJ(y, p, u∗t )
#
= e−βt inf HJ(y, p, u)
u∈U
from which we conclude, since e−βt > 0 and ε was arbitrarily that
0 ≥ inf HJ(y, p, u).
u∈U
If J(y, p) is not differentiable in a point p we get that every ξ ∈ ∂p J(y, p) is a
convex combination of ξ m . With the same computations and with the help of the
approximating sequence tm
n as before we can show that
0 ≥
inf
ξ∈∂p J(y,p)
u∈U
n
g(p, y, u) − βJ(y, p) + ξb(u, y, p)
+
m
X
l=1
o
Y
J(fl , p + Φyl (u, p)) − J(y, p) qyl
(u, p) .
Altogether we obtain
βJ(y, p) =
inf
ξ∈∂p J(y,p)
u∈U
n
g(p, y, u) + ξb(u, y, p)
+
m
X
l=1
which proves part a).
o
Y
J(fl , p + Φyl (u, p)) − J(y, p) qyl (u, p) ,
4 Solving the Reduced Model
43
b) The necessary condition for an optimal control is proven by choosing the optimal
control u∗t instead of an arbitrary control in the beginning of the proof of a) and by
choosing u∗t instead of uε in the second part of the proof of a). Then overall equality
holds.
Now we are in the position to formulate our generalized verification procedure for computing an optimal control.
Theorem 4.4
˜ p) is locally Lipschitz continuous and satisfies the generalized HJBa) If p 7→ J(y,
equation (4.2) then J˜(y, p) ≤ J(y, p).
b) If J˜ : SY × △nd → R is a locally Lipschitz continuous, regular (in p) solution
of the
∗
∗
∗
∗
generalized HJB-equation (4.2) and if there exists a ut := u (Yt− , pt− ) ∈ U with
˜ ∗ , p∗ )
corresponding state process (Yt∗ , p∗t ) such that for almost all t ≥ 0 a ξt∗ ∈ ∂p J(Y
t
t
exists such that
(
˜ ∗ , p∗ ) = g(p∗, Y ∗ , u∗) + ξ ∗ b(u∗ , Y ∗ , p∗ )
β J(Y
t
t
t
t
t
t
t
t
t
m X
∗
∗
∗
∗
∗
˜
˜
J(fl , pt + ΦYt∗ l (ut , pt )) − J(Yt , pt ) qYYt∗ l (u∗t , p∗t )
+
l=1
)
then
(i) (u∗t ) is an optimal control
˜ p) for all (y, p) ∈ SY × △nd .
(ii) J(y, p) = J(y,
Proof:
˜ p) which is locally Lipschitz continuous by assumption and
a) Replace J(y, p) by J(y,
conclude as in the proof of theorem 4.3 that
e−βt J˜(Yt , pt )
Z t
X
˜
˜ s , ps ) ds +
˜ s , ps ) − e−βs− J(Y
˜ s− , ps− )
= J(y, p) +
D e−βs J(Y
e−βs J(Y
0
Z
˜ p) +
= J(y,
+
Z
0
0<s≤t
t
0
t
−βs
e
˜ s , ps ) ds
D e−βs J(Y
m
X
l=1
˜ l , ps− + ΦYs− l (us , ps− )) − J(Y
˜ s− , ps− ) dN Y (Ys− , fl ).
J(f
s
44
4.1 The Generalized HJB-Equation and Verification Technique
˜ p) is differentiable (in particular ps = ps− ) we know
At those points p where J(y,
from the HJB-equation that
˜ s , ps )
D e−βs J(Y
m
X
˜ l , ps− + ΦYs− l (us , ps− ))− J(Y
˜ s− , ps− ) q Y (us , ps− )
J(f
≥ e−βs −g(ps , Ys , us )−
Ys− l
l=1
and since J˜(y, p) is differentiable almost everywhere we conclude by integration from
0 to t
Z t
−βt ˜
˜ p) −
e J(Yt , pt ) ≥ J(y,
e−βs g(ps , Ys , us )ds
+
Z
0
t
e−βs
0
m
X
l=1
˜ l , ps− + ΦYs− l (us , ps− )) − J˜(Ys− , ps− ) ·
J(f
· dNsY (Ys− , fl ) − qYYs− l (us , ps− )ds .
˜ t , pt ) then tends to 0 we come to
Let now t tend to infinity and note that e−βt J(Y
Z ∞
˜
0 ≥ J(y, p) −
e−βs g(ps , Ys , us )ds
+
Z
0
∞
−βs
e
0
m
X
l=1
˜ l , ps− + ΦYs− l (us , ps− )) − J(Y
˜ s− , ps− ) ·
J(f
· dNsY (Ys− , fl ) − qYYs− l (us , ps− )ds .
Taking expectation and remembering that the second integral is a martingale we
finally get
Z ∞
−βs
˜
J(y, p) ≤ Eu
e g(ps , Ys , us )ds ,
0
which proves part a), since u was arbitrarily, consequently J˜(y, p) ≤ J(y, p).
∂ ˜
b) Let t such that ∂t
J(Yt∗ , p∗t ) exists (which is the case almost everywhere due to the
local Lipschitz continuity), then we get:
˜ ∗ , p∗ ) − J(Y
˜ ∗ , p∗ )
J(Y
∂ ˜ ∗ ∗
t
t−ε
t
t
J(Yt , pt ) = lim
ε→0
∂t
−ε
˜ ∗ , p∗ − ε · b(u∗ , Y ∗ , p∗ )) − J˜(Y ∗ , p∗ )
J(Y
t
t
t
t
t
t
t
= lim
ε→0
−ε
= −J˜0 (Yt∗ , p∗t ; −b(u∗t , Yt∗ , p∗t ))
≤ −ξt (−b(u∗t , Yt∗ , p∗t )) ∀ξt ∈ ∂p J(Yt , pt )
= ξt b(u∗t , Yt∗ , p∗t ),
4 Solving the Reduced Model
45
where the second equality is true due to the local Lipschitz continuity, the third due
to the regularity and the inequality due to the properties of Clarke derivative. In
∂ ˜
particular for ξt∗ it holds (since p∗t = p∗t− , since we consider t where ∂t
J(Yt∗ , p∗t ) exists)
∂ ˜ ∗ ∗
J(Yt , pt ) ≤ ξt∗ b(u∗t , Yt∗ , p∗t )
∂t
˜ ∗ , p∗ ) − g(p∗ , Y ∗ , u∗ )
= β J(Y
t
t
t
t
t
m X
∗
∗ ∗
∗
∗
∗ ∗
˜
˜
∗
J(fl , pt− + ΦYt− l (ut , pt− )) − J(Yt− , pt− ) qYYt−
−
∗ l (ut , pt− )
l=1
from which we conclude with
∂ −βt ˜ ∗ ∗ ˜ ∗ , p∗ ) + ξ ∗ b(u∗ , Y ∗ , p∗ )
e J(Yt , pt ) = e−βt −β J(Y
t
t
t
t
t
t
∂t
and by integration from 0 to t that
Z t
−βt ˜
∗ ∗
˜ p) −
e J(Yt , pt ) ≤ J(y,
e−βs g(p∗s , Ys∗ , u∗s )ds
+
Z
0
t
e−βs
0
m
X
˜ l , p∗ + ΦY ∗ l (u∗ , p∗ )) − J(Y
˜ ∗ , p∗ ) ·
J(f
s−
s s−
s− s−
s−
l=1
∗
∗ ∗
· dNsY (Ys−
, fl ) − qYYs−
∗ l (us , ps− )ds .
As in part a) taking expectation and letting t → ∞ we obtain
Z ∞
−βt
∗
∗
∗
˜ p) ≥ Eu∗
J(y,
e g(pt , Yt , ut )dt .
0
Hence we conclude with a)
˜ p) ≥ Eu∗
J(y, p) ≥ J(y,
Z
0
∞
−βt
e
g(p∗t , Yt∗ , u∗t )dt
≥ J(y, p)
which proves the results of b).
Since the value function J(y, p) is concave in p (see theorem 3.15) it is regular in the
meaning of the Clarke derivative and it is locally Lipschitz continuous, see corollary 3.16.
Thus J(y, p) is the unique solution of the generalized HJB-equation. This property will
hold true usually in partial observable models as we consider here. For general piecewisedeterministic optimization problems this property will not be true usually. We will apply
this verification theorem in section 5.2, where we prove sufficient conditions and the existence of optimal controls.
Remark 4.5 If the value function J(y, p) is differentiable in p, then theorem 4.3 and 4.4
reduce to the classical ones. In this case the Clarke generalized gradient is the usual
gradient, consequently ∂p J(y, p) = {Jp (y, p)}.
46
4.2 Solution via a Transformed MDP
4.2
Solution via a Transformed MDP
Whereas the verification technique makes no use of the piecewise-deterministic behaviour
of the state process (Yt , pt ) in our reduced problem (Pred ) the following solution procedure does. It goes back to the reduction technique for Semi-Markov-Decision-Processes
(SMDP), where a time-continuous Markov chain is reduced to its embedded time-discrete
Markov chain. Thus the decision time points can be reduced to the jump points of the embedded Markov chain. Various authors (Davis (1993), Dempster (1989), Forwick (1997))
extend this idea to the case of piecewise-deterministic decision processes. But they do
not include discounting and did not make use of the uniformization technique (remember
lemma 3.10). We will define a time-discrete Markovian-Decision-Process (MDP), which is
strongly connected to the reduced control problem. We are then able to use all the tools
of the MDP-theory to solve our optimization problem (Pred ) and hence (P ).
The state process (Yt , pt ) of our reduced problem (Pred ) is essentially described by the state
after a jump-time point and the elapsed time since the last jump. Between two jumps the
behaviour is deterministic as seen in section 3.1. The MDP defined below will be extended
later on to a uniformized model as mentioned earlier. If the intensities or the cost rate g
depend on time, one has to extend the state process by the time component in a similar
way as pointed out later in remark 4.11. Remember first that the distribution function of
the holding times τn+1 − τn is given by:
Pu (τn+1 − τn ≤ t | Y0 , p0 , τ1 , Yτ1 , pτ1 , . . . , Yτn , pτn ) = Pu (τn+1 − τn ≤ t | τn , Yτn , pτn )
Z t
Y
u
= 1 − exp − qYτn (us , φs (pτn ))ds
0
=: F (Yτn , pτn , u; t)
independent of the post jump state. Then define a time-discrete MDP by
• state space: S = SY × △nd ∋ (y, p)
• action space: A = {γ | γ : R+ → U measurable}
• set of admissible controls: D(y, p) = A ∀(y, p) ∈ S
• transition probability: for y 6= ω, B ⊂ △nd
q (y, p), γ, (ω, B)
Z ∞ Y
qyω (γt , φγt (p))
γ
γ
1
φ
(p)
+
Φ
(γ
,
φ
(p))
∈
B
F (y, p, γ; dt)
=
yω
t
t
t
qyY (γt , φγt (p))
0
Z t
Z ∞
γ
γ
γ
Y
Y
γ
=
qyω (γt , φt (p))1 φt (p) + Φyω (γt , φt (p)) ∈ B exp − qy (γs , φs (p))ds dt
0
and
q (y, p), γ, (y, B) = 0
0
4 Solving the Reduced Model
47
• cost function:
r((y, p), γ)
Z ∞ Z t
−βs
γ
=
e g(φs (p), y, γs)ds F (y, p, γ; dt)
0
0
Z t
Z ∞ Z t
γ
−βs
γ
Y
Y
γ
=
e g(φs (p), y, γs)ds qy (γt , φt (p)) exp −
qy (γs , φs (p))ds dt
0
0
0
• discount factor:
δ((y, p), γ) =
=
Z
∞
Z0 ∞
e−βt F (y, p, γ; dt)
e−βt qyY (γt , φγt (p)) exp
0
Z t
Y
γ
−
qy (γs , φs (p))ds dt.
0
A sequence π = (fn ) ∈ F ∞ where fn ∈ F := {f : S → A measurable} is called (Markov)
strategy. The expected cost of such a strategy π = (f0 , f1 , . . .) are defined by
#
!
" ∞ n−1
X Y
δ Yτk , pτk , fk (Yτk , pτk ) r Yτn , pτn , fn (Yτn , pτn ) Y0 = y, p0 = p ,
V∞,π (y, p) := Eπ
n=0
k=0
where the expectation is taken with respect to Pπ which is defined by the transition probabilities q (see e.g. Hernandez-Lerma and Lasserre (1996)). Since r is positive the expectation is well-defined and we define the set of positive functions on S by
B := {v : S → R | v ≥ 0} .
The value function is defined by
V∞ (y, p) := inf∞ V∞,π (y, p),
π∈F
the minimal expected discounted costs over an infinite horizon. It holds V∞,π and V∞ ∈ B.
Define for v ∈ B the operators L : B → B, Tf : B → B and T : B → B by
Z
X
v(ω, ρ)q((y, p), γ, (dρ, ω)) for γ ∈ A
(Lv)(y, p, γ) := r((y, p), γ) + δ((y, p), γ)
△nd ω∈S
Y
(Tf v)(y, p) := (Lv)(y, p, f (y, p)) for f ∈ F
(T v)(y, p) := inf {(Lv)(y, p, γ)} .
γ∈A
All operators are obviously isotone. We call f ∈ F a minimizer of v if Tf v = T v.
48
4.2 Solution via a Transformed MDP
Remark 4.6 If the jump distribution of the holding times τn+1 − τn depends additionally
on the post-jump state, that means is of the form F ((y, p), u, (ω, B); t), the MDP can be
generalized to
Z ∞ Y
qyω (γt , φγt (p))
γ
γ
q (y, p), γ, (ω, B) =
1
φ
(p)+Φ
(γ
,
φ
(p))
∈
B
F ((y, p), γ, (ω, B); dt)
yω
t
γ
t
t
qyY (γt , φt (p))
0
Z
Z ∞ Z t
Y
X qyω
(γt , φγt (p))
−βs
γ
r((y, p), γ) =
e g(φs (p), y, γs)ds
·
γ
Y
0
0
△nd ω6=y qy (γt , φt (p))
·1 φγt (p) + Φyω (γt , φγt (p)) ∈ {dρ} F ((y, p), γ, (ω, dρ); dt)
Z ∞
δ((y, p), γ, (ω, B)) =
e−βt F ((y, p), γ, (ω, B); dt)
0
The next theorem is the justification for the transformation of (Pred ) into the just defined
MDP. It shows that the above MDP is indeed strongly connected to the control problem
(Pred ), since the value function V∞ (y, p) and J(y, p) are equal and if we find an optimal
strategy of the MDP we have an optimal control of (Pred ).
Theorem 4.7 It holds:
a) J(y, p) = V∞ (y, p)
b) If π ∗ = (fn ) ∈ F ∞ is optimal for the MDP, then u∗ = (u∗t ) ∈ U with
u∗t := fn (Yτ∗n , p∗τn )(t − τn )
for t ∈ [τn , τn+1 )
is optimal for (Pred ).
Proof:
a) Denote by τk the last jump before t and introduce the state space of the observed
k
past at time t as Hk := R+ × SY × △nd . Since τ0 := 0 we define hk ∈ Hk by
hk := {Y0 , p0 , τ1 , Yτ1 , pτ1 , . . . , τn , Yτk , pτk } .
u = (ut ) ∈ U is allowed to depend on the whole (observable) past FtY . Hence the
control at time t can be written as (see e.g. Elliott et al. (1997))
ut = u(Y0, p0 , τ1 , Yτ1 , pτ1 , . . . , τk , Yτk , pτk )(t − τk ) for t ∈ [τk , τk+1 ).
Thus we have to introduce a generalized policy π̃ = (f˜n ), which is allowed to depend
on the whole past. That means f˜n ∈ F̃n with F̃n := {f : Hn → A measurable}. Then
it is easy to see that for every u = (ut ) ∈ U a corresponding π̃ = (f˜n ) ∈ ×∞
k=0 F̃k can
be found with
ut = f˜n (hn )(t − τn ) for t ∈ [τn , τn+1 ).
(4.3)
4 Solving the Reduced Model
49
Due to this we will not differ between Eu and Eπ̃ in notation and we continue:
Z ∞
−βt
J(y, p; u) = Eu
e g(pt , Yt , ut )dt
0
#
"∞ Z
Z τn+1
∞
X
X τn+1
−βt
−βt
e g(pt, Yt , ut )dt
e g(pt , Yt , ut )dt =
Eu
= Eu
τn
n=0
=
=
=
∞
X
n=0
∞
X
n=0
∞
X
−βτn
Eu e
−βτn
Eu e
−βt
e
Eu
Z
g(pt+τn , Yt+τn , ut+τn )dt
τn+1 −τn
−βt
e
g(pt+τn , Yt+τn , ut+τn )dt |
0
"
−βτn
Eu e
n=0
τn+1 −τn
0
τn
n=0
Z
Z
∞
0
Z
s
−βt
e
FτYn
˜
g(φft n (pτn ), Yτn , f˜n (hn )(t))dt
0
·
·dPu (τn+1 − τn ≥ s | FτYn )
=
∞
X
"
−βτn
Eu e
n=0
Z
∞
0
Z
s
−βt
e
˜
g(φft n (pτn ), Yτn , f˜n (hn )(t))dt
0
·
·dPu (τn+1 − τn ≥ s | τn , Yτn , pτn )
=
∞
X
"
−βτn
Eu e
n=0
Z
∞
0
Z
s
−βt
e
˜
g(φft n (pτn ), Yτn , f˜n (hn )(t))dt
0
#
#
·
˜
·qYYτn (f˜n (hn )(s), φfsn (pτn )) ·
Z t
#
˜
· exp −
qYYτn (f˜n (hn )(t), φft n (pτn ))dt ds
0
=
∞
X
n=0
=
∞
X
n=0
i
h
Eu e−βτn r(Yτn , pτn , f˜n (hn ))
Eu
"
n
Y
e−β(τk −τk−1 ) r(Yτn , pτn , f˜n (hn ))
k=1
where
r(Yτn , pτn , f˜n (hn )) =
Z
0
∞
Z
s
−βt
e
#
˜
g(φft n (pτn ), Yτn , f˜n (hn )(t))dt
0
·
˜
·qYYτn (f˜n (Yτn , pτn )(s), φfsn (pτn )) ·
Z t
f˜n
Y
˜
· exp −
qYτn (fn (hn )(t), φt (pτn ))dt ds.
0
50
4.2 Solution via a Transformed MDP
It holds further with Ck := {Yτl = yl , pτl = pl , τl − τl−1 > tl , l = 1, . . . , k − 1}:
Pu (τk − τk−1 ≤ t, Yτk = yk , pτk ∈ Bk , k = 1, . . . , n − 1)
n
Y
=
Pu (Yτk = yk , pτk ∈ Bk | Ck )Pu (τk − τk−1 ≤ t | Yτk = yk , pτk = pk , Ck )
k=1
and we conclude
#
" n
Y
e−β(τk −τk−1 ) r(Yτn , pτn , f˜n (hn ))
Eu
k=1
=
X
y0 ,...,yn ∈SY
Z
△nd
...
Z
r(yn , pn , f˜n (hn )) ·
△nd
·
"
= Eu r(Yτn , pτn , f˜n (hn ))
"
= Eu r(Yτn , pτn , f˜n (hn ))
n−1
YZ ∞
k=0 0
n−1
YZ ∞
k=0
n−1
Y
e−βsk dPu (τk − τk−1 ≤ sk , Yτk = yk , dpτk )
e−βsk F (yk , pk , f˜k (hk ); dsk )
0
δ(yk , pk , f˜k (hk ))
k=0
where
δ(y, p, γ) =
Z
#
#
∞
e−βt F (y, p, γ; dt).
0
Taking now the infimum over all u ∈ U or by (4.3) equivalent over all π̃ ∈ ×∞
k=0 F̃k
we conclude
J(y, p) = inf V∞,π̃ (y, p).
π̃
From Bertsekas and Shreve (1978) it is well-know that it is sufficient to consider on
the right hand side the infimum over all Markov strategies π ∈ F ∞ , thus
J(y, p) = inf V∞,π̃ (y, p) = inf∞ V∞,π (y, p) = V∞ (y, p).
π̃
π∈F
b) The statement follows from the last line of the proof of a) and (4.3) for Markov
strategies π = (fn ) which reads as
ut = fn (Yτn , pτn )(t − τn ) for t ∈ [τn , τn+1 ).
4 Solving the Reduced Model
51
Corollary 4.8 It holds:
a) V∞ (y, p) = T V∞ (y, p) and due to theorem 4.7 J(y, p) = T J(y, p).
b) If Tf V∞ (y, p) = T V∞ (y, p) then f ∞ is optimal.
Proof: Both statement are well-known from the MDP-theory, see e.g. Bertsekas and Shreve
(1978).
After defining this MDP we discuss the question how to compute optimal controls. One
way is to use Howard’s policy improvement, which works under certain conditions. Another
way is to find minimizers fn of Vn−1 and then taking the policy consisting of accumulation
points of this sequences (fn ). This is an optimal policy for the infinite horizon MDP under
compactness and continuity conditions (see e.g. Hernandez-Lerma and Lasserre (1996) and
corollary 4.16). The third approach was pointed out in corollary 4.8: find a minimizer of
V∞ . To close the circle to section 4.1 we discuss the latter one. Consider therefore the
Bellman equation:
V∞ (y, p) = T V∞ (y, p)
( " Zτ
e−βt g(φγt (p), y, γt)dt
= inf Eu
(4.4)
γ∈A
0
+e−βτ
= inf
γ∈A
(Z
∞
e−
Rt
Y
0 (qy
#)
γ
Y
q
(γ
,
φ
(p))
τ−
yω τ
V∞ (ω, φγτ−(p) + Φyω (γτ , φγτ− (p))) Y
γ
q
(γ
,
φ
τ
τ − (p))
y
ω6=y
X
(γs ,φγs (p))+β)ds
0
+
X
ω6=y
n
g(φγt (p), y, γt)
V∞ (ω, φγt (p)
+
Y
Φyω (γt , φγt (p)))qyω
(γt , φγt (p))
)
o
dt
where we used in the second line the proof of theorem 4.7 and then we computed the
expectation. The right hand side is a (deterministic) control problem which can be solved
with the help of Pontryagin’s maximum principle or with the help of the (generalized)
verification technique for deterministic models (remember the stochastic case was described
in section 4.1). Denote by V̄ (y, p) the value function of this deterministic control problem
for fixed V∞ (y, p). Then it is well-known from the verification theorem 4.4, that if W (y, p)
is locally Lipschitz continuous and regular in p and if it is for fixed V∞ (y, p) a solution of
52
4.2 Solution via a Transformed MDP
the corresponding HJB-equation
(
inf
ξ∈∂p W (y,p)
u∈U
g(p, y, u) + ξb(u, y, p)
+
=
inf
ξ∈∂p W (y,p)
u∈U
X
Y
V (ω, p + Φyω (u, p))qyω
(u, p) − (qyY (u, p) + β)W (y, p)
ω6=y
(
)
g(p, y, u) + ξb(u, y, p) − βW (y, p)
+
X
ω6=y
Y
V (ω, p + Φyω (u, p)) − W (y, p) qyω
(u, p)
)
=0
that W (y, p) = V̄ (y, p) = V∞ (y, p). Hence the HJB-equation for the deterministic problem
on the right hand side of the Bellman equation of the MDP simplifies to
(
inf
ξ∈∂p W (y,p)
u∈U
g(p, y, u) + ξb(u, y, p)
+
X
ω∈SY
Y
W (ω, p + Φyω (u, p)) − W (y, p) qyω
(u, p)
)
= βW (y, p),
which is exactly the HJB-equation from theorem 4.4 for the reduced problem (Pred ).
Theorem 4.9
a) The HJB-equations of (Pred ) given in (4.2) and the HJB-equation of the deterministic
control problem in the operator T arising in the MDP-approach are the same.
b) Since V∞ (y, p) = J(y, p) by theorem 4.7 we see that (4.4) in J(y, p) = T J(y, p) is the
analogon to theorem 4.1 for t = 0.
Therefore we see both approaches, the verification technique of section 4.1 and the MDPapproach, lead to the same HJB-equation in the end. That means computation of optimal
controls with the verification technique or as minimizers encounter the same difficulties
like differentiability or the right guess for the value function.
The advantage of the MDP-approach is the opportunity to make use of the whole MDPtheory to state existence results for an optimal control and to find a representation of the
value function V∞ (y, p) with the value iteration.
The next both remarks complete the definition of the equivalent time-discrete MDP. The
first remark introduces the uniformized MDP, where the distribution function F of the
sojourn times is independent of the current state and control. The second extension in
remark 4.11 contains the transformation for a finite horizon control problem.
4 Solving the Reduced Model
53
Remark 4.10 Using the uniformization technique introduced at the end of section 3.1 and
in lemma 3.10, we define the following MDP which is equivalent to the last one. For this
purpose set
α := sup sup qkY (u, gµ , ei )
k,µ,i u∈U
as in section 3.1, but notice that we also take the supremum with respect to the control
parameter. Thus define the distribution function of the holding times τn+1 − τn as
Pu (τn+1 − τn ≤ t | Y0 , p0 , τ1 , Yτ1 , pτ1 , . . . , Yτn , pτn ) = 1 − e−αt 1 t ≥ 0 =: F (t)
independent of the state and control processes. Then the MDP has to be defined by:
• state space: S = SY × △nd ∋ (y, p)
• action space: A = {γ | γ : R+ → U measurable}
• set of admissible controls: D(y, p) = A ∀(y, p) ∈ S
• transition probability:
Z ∞
Y
q (y, p), γ, (ω, B) =
(γt , φγt (p))1 φγt (p) + Φyω (γt , φγt (p)) ∈ B dt,
e−αt qyω
0
X
q (y, p), γ, (ω, B)
q (y, p), γ, (y, B) = 1 −
y 6= ω
ω6=y
where ω ∈ SY , B ⊂ △nd
• cost function:
r((y, p), γ) =
Z
∞
−αt
αe
0
• discount factor:
Z ∞
δ=
e−βt αe−αt dt =
0
Z
t
−βs
e
g(φγs (p), y, γs)ds
0
dt
α
<1
α+β
independent of state and control process.
If the cost function g(p, y, u) does not depend on p and u, that means it is of the form
g(y), then the cost rate simplifies to:
Z
Z ∞ Z t
g(y) ∞ −βt
−βt
−αt
(e
− 1)αe−αt dt
r(y, p, γ) = g(y)
e ds αe dt = −
β 0
0
0
1
g(y)
1
g(y)α
−
=
+
=: r(y),
= −
β
α+β α
α+β
54
4.2 Solution via a Transformed MDP
which coincides with the formula in the case of a pure jump SMDP. Note that g(p, y, u) depends via the controlled state process on the control. Observe also that for its independence
of the estimator process, g(z, x, y, u) has to be independent of z and x.
The Bellman equation for the general uniformized model is given by:
v(y, p) = T v(y, p)
(Z
n
∞
= inf
e−(α+β)t g(φγt (p), y, γt)
γ∈A
0
+
X
Y
v(ω, φγt (p) + Φyω (γt , φγt (p)))qyω
(γt , φγt (p))
ω6=y
+v(y, φγt (p))(α −
X
ω6=y
)
o
Y
qyω
(γt , φγt (p))) dt ,
which can be computed as in the non-uniformized case.
If g(p, y, u) is bounded, then the value function V∞ (y, p) is bounded too and we conclude
that T is a contracting operator on the set of bounded function with contraction parameter
δ < 1, since
||T v − T w|| ≤ δ||v − w||,
where ||v(y, p)|| := supy,p |v(y, p)|. Consequently one can use Banach’s fixed point theorem
to prove without continuity-assumptions, that lim Vn (y, p) = V∞ (y, p) and for the comn→∞
putation of optimal strategies procedures as Howard’s policy improvement algorithm may
be applied.
Remark 4.11 Considering a control problem with finite horizon T and terminal cost h(y, p)
we have to extend the state process by the time component t. Therefore the state process (t, Yt , pt ) evolves between two jump-time point, that means for t ∈ [τn , τn+1 ), as
(t, Yτn , φut−τn (pτn )), where φut (p) is the unique solution of
ṗ = b(u, y, p)
p0 = p
Notice that φut (p) is defined as in theorem 3.9. Again the jump distribution function is
uniformized, hence
Pu (τn+1 − τn ≤ t | Y0 , p0 , τ1 , Yτ1 , pτ1 , . . . , Yτn , pτn ) = 1 − e−αt 1 t ≥ 0 1 τn + t ≤ T
and then the corresponding uniformized (infinite horizon) MDP is accordingly given by:
• state space: S = [0, T ] × SY × △nd ∋ (t, y, p)
4 Solving the Reduced Model
55
• action space: A = D(t, y, p) = {γ | γ : R+ → U}
• transition probability:
q (t, y, p), γ, ([t1, t2 ), ω, B)
Z T −t
Y
=
e−αs qyω
(γs , φγs (p))1 (s, φγs (p) + Φyω (γs , φγs (p))) ∈ [t1 , t2 ) × B ds
0
with [t1 , t2 ) ⊂ [0, T ], ω ∈ SY and B ⊂ △nd .
• cost function:
r(t, y, p, γ)
Z s
Z T −t
−αs
−βr
γ
=
αe
e g(φr (p), y, γr )dr ds + e−(α+β)(T −t) h(y, φγT −t(p))
0
0
• discount factor: δ(t) =
R T −t
0
αe−(α+β)s ds =
α
α+β
1 − e−(α+β)(T −t)
The Bellman equation is given by
v(t, y, p) = T v(t, y, p)
(
Z
= inf r(t, y, p, γ) + δ(t)
γ∈A
= inf
γ∈A
(Z
X
v(t, ω, ρ)q (t, y, p), γ, (δt, ω, dρ)
△nd ω∈S
Y
T −t
0
n
e−(α+β)s g(φγs (p), y, γs)
+
X
)
Y
v(s, ω, φγs (p) + Φyω (γs , φγs (p)))qyω
(γs , φγs (p))
ω6=y
+v(s, y, φγs (p))(α
−
X
Y
qyω
(γs , φγs (p)))
ω6=y
+e−(α+β)(T −t) h(y, φγT −t (p))
o
ds
)
If g and h are bounded then the model is obviously substochastic and for the operator T
holds
||T v − T w|| ≤ (1 − e−(α+β)T )||v − w||,
in particular T is a contracting operator.
If the cost function g in the origin (infinite horizon) model depends on time t, then we
have to extend the state space in the transformed MDP by the time component.
56
4.2 Solution via a Transformed MDP
In a pure jump SMDP it is well-known that an optimal control exists if A is finite, since
in these models it is sufficient to consider actions γ ∈ A which are constant between two
jumps. In general the existence of an optimal policy is guaranteed under continuity and
compactness assumptions, which are satisfied for the wider class of relaxed controls in a
suitable topology. Then additionally the convergence of the n-stage value functions Vn (y, p)
to V∞ (y, p) is true. Assume for this U is convex.
Definition 4.12 A measurable function r : R+ → △(U) is called a relaxed control, where
△(U) is the probability simplex of U. The set of all relaxed controls will be denoted by
R. We say in this context that γ ∈ A is deterministic. If a relaxed control takes only value
in {0, 1}, that means r is always a corner of △(U), then r is called pure.
Define for a relaxed control
Z
r̄ :=
u r(du) ∈ U.
U
Instead of choosing at each time point t a fixed control u ∈ U we randomize now over
the set of possible values in U. The case of deterministic controls is always included by
choosing the Dirac-measure δ on U. Because of the relaxation we have to consider our
state process (Yt , pt ) now with respect to relaxed controls. As in section 2.1 we define for
a relaxed control r = (rt ) the corresponding state process (Ytr , prt ) in the first component
by its (relaxed) intensities
Z
Y
Y
qkl (r, gµ , ei ) =
qkl
(u, gµ , ei )r(du).
U
In the second component define the state process prt = φrt−τ (pτ ) under Yt = y between two
jumps as in lemma 3.9 as the solution of
ṗr = b (r̄, y, pr ) .
Note that the drift component is not relaxed. Hence we conclude
φrt (p) = φr̄t (p).
(4.5)
The jump times are again exp(α)-distributed and the jump size of pt under a relaxed
control is given by
Z
Φ(r, p) =
Φ(u, p)r(du).
U
The process (Ytr , prt ) as a solution of the above introduced characteristics is well-defined
(see Davis (1993)). Finally define the cost rate for a relaxed control r as
Z
g(p, y, r) =
g(p, y, u)r(du).
U
4 Solving the Reduced Model
57
The following construction is adopted from Davis (1993). A similar consideration can be
found in Presman and Sonin (1990). Endow R with the Young topology, which is at the end
a suitable topology, which guarantees compactness and continuity. For this purpose denote
first L1 (R+ ; C(U)), the space of functions h(t, u), which are integrable over (R+ ; C(U)).
Hence every function
R ∞ h(t, u) ∈ L1 (R+ ; C(U)) is measurable in t, continuous in u and
satisfies ||h|| := 0 max |h(t, u)|dt < ∞. Then one can conclude that L1 (R+ ; C(U)) is a
u∈U
Banach space.
The dual space is accordingly given by L∞ (R+ ; C ∗ (U)), where C ∗ (U) is the dual space
of C(U) (consisting of the set of signed measures on U under the total variation norm),
consisting of measurable functions v : R+ → C ∗ (U) such that ||v||∗ = esssupt∈R+ ||vt ||C ∗ <
∞.
Introduce the unit ball B1 := {v ∈ L∞ (R+ ; C ∗ (U)) | ||v||∗ ≤ 1} which is compact in the
weak∗ topology on L∞ (R+ ; C ∗ (U)). Then the Young topology on R is the relative weak∗
topology of R considered as a subset of B1 .
In the next lemma we prove that all continuity and compactness conditions for the existence
of an optimal (relaxed) policy are fulfilled. We refer for them to Hernandez-Lerma and
Lasserre (1996) and Bertsekas and Shreve (1978).
Lemma 4.13
a) AR := {ν : R+ → △(U) measurable} is compact.
b) (p, ν) 7→ φνt (p) is continuous for ν ∈ AR and p ∈ △nd .
Proof:
a) Davis (1993).
n
b) Let ν n = (νtn ) → (νt ) = ν and pn → p and denote by φνt and φνt the solution of
ṗ = b(ν¯n , y, p)
with p0 = pn
and
ṗ = b(ν̄, y, p)
with p0 = p
then:
n
|φνt (pn ) − φνt (p)|
Z t
Z t
n
n
ν
= |p +
b(ν¯sn , y, φs )ds − p −
b(ν¯s , y, φνs )ds|
0
0
Z t
Z t
n
νn
ν
n
n
¯
¯
≤ |p − p| +
|b(νs , y, φs ) − b(νs , y, φs )|ds +
|b(ν¯sn , y, φνs (p)) − b(ν¯s , y, φνs )|ds
0
Z0 t
Z t
n
≤ |pn − p| +
L|φνs − φνs |ds +
|b(ν¯sn , y, φνs (p)) − b(ν¯s , y, φνs )|ds
0
0
58
4.2 Solution via a Transformed MDP
where the last inequality is true due to the Lipschitz continuity of p 7→ b(u, y, p).
Since ν n → ν and pn → p the statement follows if the second integral tends to 0.
Remember that b(ν, y, p) is continuous in ν. Additionally p 7→ b(ν, y, p) − b(ν̃, y, p) is
Lipschitz continuous on the compact set △nd and bilinear-quadratic. Therefore the
maximum point p∗ (ν) is a continuous function of ν and we conclude that
|b(ν, y, p) − b(ν̃, y, p)| ≤ |h(ν, y) − h(ν̃, y)|
for a in ν continuous function h(ν, y). Hence we are able to apply the Grönwallinequality and attain
Z s
νn
ν
Ls n
Ls
|φs − φs | ≤ e |p − p| + e
e−Lt |h(ν¯tn , y) − h(ν¯t , y)|dt.
0
n
Let pn → p and ν n → ν, then we see that |φνs − φνs | → 0 and finally
n
|φνt (pn ) − φνt (p)| → 0.
With this lemma we are able to state the existence of optimal relaxed strategies. Under
an additional assumption we conclude the existence of an optimal deterministic strategy.
Sufficient conditions for this assumptions are stated in the following remark.
Theorem 4.14
a) There exists an optimal relaxed strategy π ∗ = (f ∗ , f ∗, . . .), in particular f ∗ (y, p) ∈ AR
and V∞,π∗ (y, p) = V∞ (y, p).
b) If u 7→ F (p, y, u) is convex, where
X
X
Y
Y
F (p, y, u) := g(p, y, u)+
V∞ (ω, p+Φyω (u, p))qyω
(u, p)+V∞ (y, p)(α−
qyω
(u, p))
ω6=y
then
π¯∗ = (f¯∗ , f¯∗ , . . .)
is an optimal deterministic strategy, especially f¯∗ (y, p) ∈ A and
V∞,π¯∗ (y, p) = V∞ (y, p).
ω6=y
4 Solving the Reduced Model
59
Proof:
a) By corollary 4.8 it is sufficient to prove that there exists f ∗ with
Tf ∗ V∞ (y, p) = inf (LV∞ )(y, p, γ).
γ∈AR
We know from lemma 4.13 that AR is compact. Additionally that ν 7→ φνt (p) is
Y
continuous. Since u 7→ qkl
(u, p) and u 7→ g(p, y, u) are continuous and p 7→ V∞ (y, p)
is concave, we conclude that
ν 7→ (LV∞ )(y, p, ν)
is continuous. Hence there exists a minimizer f ∗ (y, p) of (LV∞ )(y, p, ν). Since u 7→
(LV∞ )(y, p, u) is continuous and (LV∞ )(y, p, u) ≥ 0 we conclude that f ∗ (y, p) is
measurable.
b) Since U is convex f¯∗ ∈ U. On the one hand we have
T V∞ (y, p) = inf (LV∞ )(y, p, γ) ≥ inf (LV∞ )(y, p, ν).
γ∈A
ν∈AR
On the other hand we get:
(LV∞ )(y, p, ν)
=
(4.5)
=
≥
Z
∞
Z
e−(α+β)t F (φνt (p), y, u)ν(du)dt
Z0 ∞ ZU
Z0 ∞
e−(α+β)t F (φν̄t (p), y, u)ν(du)dt
U
e−(α+β)t F (φν̄t (p), y, ν̄)dt = (LV∞ )(y, p, ν̄).
0
Hence
inf (LV∞ )(y, p, ν) ≥ inf (LV∞ )(y, p, ν̄) = inf (LV∞ )(y, p, γ) = T V∞ (y, p).
ν∈AR
ν∈AR
γ∈A
Summarizing:
T V∞ (y, p) = inf (LV∞ )(y, p, ν).
ν∈AR
In particular we get:
T V∞ (y, p) = (LV∞ )(y, p, f ∗) ≥ (LV∞ )(y, p, f¯∗) = Tf¯∗ V∞ (y, p),
thus f¯∗ is a minimizer of V∞ (y, p). The measurability of f¯∗ follows as in a).
60
4.2 Solution via a Transformed MDP
Remark 4.15
a) Due to theorem 4.7 we have by the existence of optimal (relaxed/deterministic) strategy of the MDP in theorem 4.14 the existence of an optimal (relaxed/deterministic)
controls for the reduced problem (Pred ) and hence for (P ).
b) u 7→ F (p, y, u) is convex if
• g(p, y, u) is convex in u and
Y
• qkl
(u, p) is linear in u and
• Φkl (u, p) is independent of u.
Denote by Vn,π (y, p) the expected discounted cost over n periods under a fixed policy
π = (f0 , f1 , . . . , fn−1 ) ∈ F n . This means
#
" n−1
X
k
δ r(Yτk , pτk , fk (Yτk , pτk )) Y0 = y, p0 = p .
Vn,π (y, p) := Eπ
k=0
Define by Vn (y, p) := sup Vn,π (y, p) the n-period value function.
π∈F n
Corollary 4.16 Under the assumption of theorem 4.14 b) holds:
a) lim Vn (y, p) = V∞ (y, p)
n→∞
b) If fn is a minimizer of Vn−1 then (lim inf n→∞ fn )∞ is an optimal policy.
Proof: Both statements are well-known MDP-results which are true under the continuity
and compactness assumptions (Bertsekas and Shreve (1978)).
In section 5.2.3 we demonstrate the power of the value iteration.
5 Application to Parallel Queueing with Incomplete Information
5
61
Application to Parallel Queueing with Incomplete Information
Queueing models are very popular, since they appear in various applications of the real
world, for example in telecommunication, in the internet or in supply chain management.
Additionally, the basic concepts of queueing models are treated in various publications
and are well-understood. We skip them here and refer to the works of Asmussen (2003),
Kitaev and Rykov (1995) and Brémaud (1981). Most of the queueing applications have in
common, that they can be modelled as a SMDP, but that statements about the optimal
control (or even properties of it) are quite hard to compute and to prove. In this section
we recall our introductory example in section 1.1 and consider a parallel queueing model
with two servers, which is illustrated by the following picture:
λ1
λ2
-
-
queue 1
queue 2
Q
k
Q
µ1
Q
Q
Q Q
server
µ2
+
Figure 3: Parallel Queueing Model
There are two queues with infinite buffer, where customers arrive at queue i corresponding
to a Poisson process with arrival rate λi , i = 1, 2. Additionally, one server is available, who
has to decide at each time point t which of the two queues (to be more precisely: which
one of the first customers waiting in each queue) is served. The service time of a customer
in queue i is exp(µi )-distributed, i = 1, 2. We assume that arrivals and service time are
independent. For each waiting customer a cost at rate ci , i = 1, 2, occurs and we want
to find a service strategy for the server minimizing the expected discounted waiting costs
over an infinite horizon.
In the next section we model these circumstances as a MDP. We then prove the optimality
of an cµ-rule in the complete information case as in Asmussen (2003) and extend his result
to sufficient fine information structures. In the following sections we discuss various models under incomplete information, especially models where the information structure is not
sufficiently fine anymore. First, in section 5.2 we discuss the case of unknown (Bayesian)
service rates. In this part we make use of the solution technique derived in section 4.1
and 4.2. The main results in this incomplete information case are the explicit representation of the estimator process, a closed formula for the value function, the existence of
an optimal (almost surely pure) control and sufficient conditions for optimal controls in
several models. The same model with a reward criteria is solved completely. In section 5.3
62
5.1 The Model and the Complete Information Case
the underlying information structure is given by a 0-1-observation about the number of
waiting customers in queue 1. Here we investigate numerically the performance of several
reasonable strategies.
5.1
The Model and the Complete Information Case
Denote by α := λ1 + λ2 + µ1 + µ2 the uniformization parameter. Then the uniformized
MDP for the above described parallel queueing model is defined under the uniformized
distribution function of the holding times τn+1 − τn
F (t) = 1 − e−αt 1 t ≥ 0
by:
• state space: S = N20 , i = (i1 , i2 ) ∈ S with
i1 =
ˆ number of customers in queue 1
i2 =
ˆ number of customers in queue 2
• action space: A = {a | a ∈ {1, 2}} with
a=1 =
ˆ serve queue 1
a=2 =
ˆ serve queue 2
• set of admissible actions: D(i) = A ∀i ∈ S
• transition probabilities:
λ
k
α
µa δ(i )
a
q(i, a, j) = α P2
1 − k=1
0
where δ(i) := 1 i > 0
• cost function: r(i, a) =
• discount factor: δ =
λk
α
j = i + ek , k = 1, 2
j = i − ea
µa
− α δ(ia ) j = i
else
c1 i1 +c2 i2
α+β
α
.
α+β
Since all processes are constant between two jumps it is sufficient to consider controls
a ∈ A which are constant between jumps (see e.g. Bertsekas and Shreve (1978)). Since A
is finite all continuity and compactness conditions required for value and policy iteration
are fulfilled.
The optimality equation is given by
v(i1 , i2 ) = T v(i1 , i2 )
1 h
=
c1 i1 + c2 i2 + λ1 v(i1 + 1, i2 ) + λ2 v(i1 , i2 + 1)
α+β
+ min µ1 v((i1 − 1)+ , i2 ) + µ2 v(i1 , i2 );
i
+
µ1 v(i1 , i2 ) + µ2 v(i1 , (i2 − 1) ) .
5 Application to Parallel Queueing with Incomplete Information
63
The next theorem uses an interchange argument for the proof of the optimal policy. It
states that if the ”expected costs” of queue 1 are greater than the costs of queue 2, that
means if c1 µ1 ≥ c2 µ2 , the optimal policy always prefers to serve queue 1, if there is a
customer waiting. Hence this strategy is called cµ-policy. It arises in many queueing
models with a linear cost structure and can easily be extended to the case of M parallel
queues. We invest some efforts in the proof of this well-known result (see for example
Asmussen (2003)), since the idea and the technique of this proof will be helpful later on.
Theorem 5.1 Assume c1 µ1 ≥ c2 µ2 then π = f ∞ is optimal where
(
1 i1 > 0
f (i1 , i2 ) =
2 i1 = 0
Proof: We show, that for each horizon N ∈ N the cµ-policy is optimal, this means π ∗ =
(fN , . . . , f1 ) with
(
1 i1 > 0
fn (i1 , i2 ) = f (i1 , i2 ) =
2 i1 = 0
is optimal. This will be done by proving that fn is a minimizer of Vn−1 and consequently
(fN , . . . , f1 ) has to be optimal for the N-period model. Taking the limit the statement
follows. Start with N = 1 and let V0 ≡ 0, then
T V0 (i1 , i2 ) =
c1 i1 + c2 i2
,
α+β
thus every decision rule f is a minimizer of V0 and hence optimal for N = 1. Assume f
is a minimizer of V0 , . . . , VN −1 (in particular (f, . . . , f ) ∈ F N is optimal for the N-period
model), then it remains to prove, that f is a minimizer of VN (see corollary 4.16). By
induction and the monotonicity of T we conclude that Vn (i1 , i2 ) is monotone increasing in
i1 and i2 respectively, since:
Vn (i1 , i2 ) = T Vn−1 (i1 , i2 ) ≤ T Vn−1 (i1 + 1, i2 ) = Vn (i1 , i2 ).
Analogously: Vn (i1 , i2 ) ≤ Vn (i1 , i2 + 1).
From the optimality equation we conclude
i1 = 0, i2 > 0 : µ1 VN (0, i2 ) + µ2 VN (0, i2 ) ≥ µ1 VN (0, i2 ) + µ2 VN (0, i2 − 1)
⇒ the minimizer is fN +1 (0, i2 ) = 2 = f (0, i2 )
i1 > 0, i2 = 0 : µ1 VN (i1 − 1, 0) + µ2 VN (i1 , 0) ≤ µ1 VN (i1 , 0) + µ2 VN (i1 , 0)
⇒ the minimizer is fN +1 (i1 , 0) = 1 = f (i1 , 0)
i1 = 0, i2 = 0 : the minimizer fN +1 (0, 0) is arbitrarily.
64
5.1 The Model and the Complete Information Case
Now let i1 , i2 > 0 and g ≡ 2. Compute then
VN +1,(g,f,...,f ) (i1 , i2 ) − VN +1,(f,g,f,...,f ) (i1 , i2 )
= Tg Tf VN −1,(f,...,f ) (i1 , i2 ) − Tf Tg VN −1,(f,...,f ) (i1 , i2 )
=
c1 µ 1 − c2 µ 2
≥ 0.
(α + β)2
(5.1)
Let h be an arbitrary decision rule with h(i1 , i2 ) = 2 . Using the just proven inequality we
conclude
Th VN (i1 , i2 ) = Th VN,(f,...,f ) (i1 , i2 ) = VN +1,(h,f,...,f ) (i1 , i2 ) = VN +1,(g,f,...,f ) (i1 , i2 )
≥ VN +1,(f,g,f,...,f ) (i1 , i2 ) = Tf VN,(g,f,...,f ) (i1 , i2 ) ≥ Tf VN (i1 , i2 ),
where the last inequality is true due to VN,(g,f,...,f ) (i1 , i2 ) ≥ VN (i1 , i2 ) and the monotonicity
of Tf . Hence f (i1 , i2 ) is a minimizer of VN (i1 , i2 ), which finishes the proof.
The next theorem offers an explicit representation of the n-period value function Vn (i1 , i2 )
and hence of V∞ (i1 , i2 ) = lim Vn (i1 , i2 ). We omit the proof, since it is a special case of a
n→∞
similar theorem in the incomplete information model (see theorem 5.13). It is done with
the help of the value iteration.
Theorem 5.2 For i1 > 0 and i2 > 0 the n-period value function Vn (i1 , i2 ) is given for n ≥ 1
by
Vn (i1 , i2 ) = (c1 i1 + c2 i2 )Kn + (c1 λ1 + c2 λ2 )Ln−1 − c1 µ1 Ln−1
(5.2)
where
Kn
Ln
k
n−1 n−1
α
1 X k
1 X
δ
=
=
α + β k=0 α + β
α + β k=0
1
=
Kn + αLn−1 with L0 := 0.
α+β
Thus the value function for the infinite-horizon problem is given by
V∞ (i1 , i2 ) = lim Vn (i1 , i2 ) = (c1 i1 + c2 i2 )K + (c1 λ1 + c2 λ2 )L − c1 µ1 L
n→∞
where K := lim Kn =
n→∞
1
β
and L := lim Ln =
n→∞
1
β2
=
K
β
are well-defined (see theorem 5.14).
We see that the value function consists of three terms, which have a nice interpretation:
the first one (c1 i1 + c2 i2 )K are waiting costs for customers in the queues, the second
(c1 λ1 + c2 λ2 )L are ”expected” waiting costs due to arrivals and the third c1 µ1 L reduces
the costs due to ”expected” served customers.
Theorem 5.1 can be formulated more precisely in view of corollary 2.14. If one of the
queues is empty, then it is always optimal to serve the other queue (independent of the
5 Application to Parallel Queueing with Incomplete Information
65
service rates). If in both queues at least one customer is waiting (the exact number of
waiting customers does not matter) and if one has complete information about the service
rates (we assume here without loss of generality that µi can only take two values, that
B
means µi ∈ {µA
i , µi }), then it is optimal to serve queue 1 if the product of cost rate c1
times the current service rate µ1 is greater than the product of cost rate c2 and service
rate at queue 2 given by µ2 , that means if c1 µ1 ≥ c2 µ2 . If the inequality holds in the other
direction then it is optimal to serve queue 2. Let us state this in a corollary.
Corollary 5.3 Let (I(k), k = 1, . . . , m) be an information structure with corresponding
observation process Yt such that
pit1 (0) := Pu (queue 1 empty | FtY ) ∈ {0, 1}
A
Y
pt (µA
1 ) := Pu (µ1 = µ1 | Ft ) ∈ {0, 1}
pit2 (0) := Pu (queue 2 empty | FtY ) ∈ {0, 1}
A
Y
pt (µA
2 ) := Pu (µ2 = µ2 | Ft ) ∈ {0, 1}.
A
B
A
Define µbj := µj (pµj ) := µA
µ-rule π = f ∞ is
j p(µj ) + µj (1 − p(µj )) for j = 1, 2. Then the cb
optimal, where
1 pi1 (0) = 0, c1 µ1 (pµ1 ) > c2 µ2 (pµ2 )
2 pi2 (0) = 0, c2 µ2 (pµ2 ) > c1 µ1 (pµ1 )
i1
i2
A
A
f (p , p , p(µ1 ), p(µ2 )) :=
1 pi2 (0) = 1
2 pi1 (0) = 1
If c1 µ1 (pµ1 ) = c2 µ2 (pµ2 ) then the optimal service allocation can be chosen arbitrarily.
Remark 5.4
a) Extend the state space to SX × SZ , remember remark 2.2. Then a sufficient fine
information structure has to have complete information about the service rates and
a 0-1-group observation about the queues. In mathematical words it has to be at
least as
B
A
B
I(1) = (0, i2 , µ1 , µ2), i2 ∈ N0 , µ1 ∈ {µA
,
µ
},
µ
∈
{µ
,
µ
}
,
2
1
1
2
2
A
B
A
B
I(2) = (i1 , 0, µ1 , µ2), i1 ∈ N, µ1 ∈ {µ1 , µ1 }, µ2 ∈ {µ2 , µ2 } ,
A
B
A
I(3) = (i1 , i2 , µA
1 , µ2 ), i1 , i2 ∈ N , I(4) = (i1 , i2 , µ1 , µ2 ), i1 , i2 ∈ N ,
B
B
B
I(5) = (i1 , i2 , µA
,
µ
),
i
,
i
∈
N
,
I(6)
=
(i
,
i
,
µ
,
µ
),
i
,
i
∈
N
.
1
2
1
2
1
2
1
2
1
2
B
A
B
b) Notice that if for example c1 min{µA
1 , µ1 } ≥ c2 max{µ2 , µ2 }, then it is not even
necessary to observe the service rate. In this cases queue 1 is always better than
queue 2. Therefore it is sufficient to observe only if at queue 1 a customer is waiting
or not (means 0-1-observation).
But what happens if the service parameters µ1 and µ2 are unknown? This may be due
to the fact that two different kind of customers can arrive at queue j requiring service
B
rates µA
j and µj respectively and the server is not able to observe, which kind of customer
arrives. We will investigate this setting in the next section.
66
5.2
5.2 Unknown Service Rates: the Bayesian Case
Unknown Service Rates: the Bayesian Case
In this section we consider the case when the service rates are Bayesian, this means
B
µj ∈ {µA
j , µj } unknown, but constant. The server now can spend his service capacity
simultaneously to both queues. Therefore the control set U is defined by U = [0, 1] in contrast to the pure service restriction in the complete information model in the last section.
There we have seen that it is never optimal to split service, thus it was no real restriction.
We interpret u ∈ U as the service capacity spent to queue 1. Hence 1 − u is spent to queue
2. Sometimes we will write u(1) and u(2) for the service rates given to queue 1 and 2,
respectively. With a slight abuse of notation we will also write u = 2 for serving queue 2
exclusively.
We will see in the following, that the conditions of remark 4.15 are satisfied. Hence it is
not necessary to distinguish relaxed and deterministic controls as pointed out in theorem
4.14 and the existence of an optimal deterministic control is guaranteed. Sometimes we
will even state existence and properties of an optimal pure control, which serves one queue
exclusively.
First we derive in section 5.2.1 an explicit representation of the estimator processes for
the unknown service rates. This will be done with the help of the general representation
theorem 3.5. In this context we are able to solve the partial differential equation, which
makes it easier to prove monotonicity and continuity properties of the estimator process
between two jumps. After introducing the associated MDP in the sense of section 4.2,
according to the one with complete information in section 5.1 we develop in section 5.2.3
with the value iteration a characterization of the value function in the model with unknown
service rates, see theorem 5.13. We will see that it is quite similar to the one under complete
information given in theorem 5.2. Furthermore we prove in theorem 5.18 that the optimal
control serves one queue exclusively most of the time with the help of the generalized
HJB-equation introduced in section 4.1.
In the consecutive two sections we specify our model to the symmetric case and the case, in
which one service rate is known. In the first one the optimal control is a control limit rule
with control limit p∗ = 21 (in particular here the certainty equivalence principle is true).
This will be shown in theorem 5.20 with the help of the verification technique of section 4.1.
In the latter setup we characterize the optimal strategy as a control limit rule with implicit
defined control limit and state sufficient conditions for the optimality of a control. Here we
demonstrate a interchange argument as in the proof of theorem 5.1. In both settings the
stay-on-a-winner property is obtained. In section 5.2.6 we change the objective function,
where we consider not anymore a model with waiting costs, but a model with rewards for
each served customer. In this case the model is completely solved, see theorem 5.29.
B
A
B
Assume now that µ1 ∈ {µA
1 , µ1 } and µ2 ∈ {µ2 , µ2 } are unknown, but constant over time.
Only known is the a-priori-distribution of µ1 and µ2 , that means p0 = P(µ1 = µA
1 ) and
A
q0 = P(µ2 = µ2 ). All other state and parameter processes are observable, hence we are in
the Bayesian case. In the context of section 2.1 we use the following notations:
5 Application to Parallel Queueing with Incomplete Information
67
• SZ = {g1 , g2 } × {g1 , g2 }, QZ ≡ 0
• SX = {e11 , e12 , . . .} × {e21 , e22 , . . .} (in particular SX = N20 ), δijµν ≡ 0 for all i, j, µ, ν
• I(k) = {k}, hence Yt ≡ Xt and FtY ≡ FtX .
Due to the optimality of the cµ-rule in corollary 5.3 only the following three case are
relevant:
A)
B)
C)
A
c1 µ A
1 < c2 µ 2
A
c1 µ A
1 < c2 µ 2
A
c1 µ A
1 < c2 µ 2
B
c1 µ B
1 > c2 µ 2
B
c1 µ B
1 > c2 µ 2
B
c1 µ B
1 > c2 µ 2
B
A
B
c1 (µA
1 − µ1 ) < c2 (µ2 − µ2 ) < 0
B
A
B
c1 (µA
1 − µ1 ) < c2 (µ2 − µ2 ) = 0
B
A
B
0 < c2 (µA
2 − µ2 ) ≤ −c1 (µ1 − µ1 )
(5.3)
B
A
A
A
B
Case A) can be simplified to c1 µB
1 > c2 µ2 > c2 µ2 > c1 µ1 . Case B) implies that µ2 = µ2 ,
in particular the service parameter µ2 is known. We will treat this case later on in 5.2.5 in
more detail. Case C) with equality in the last conditions will be topic of section 5.2.4 and
referred to as the symmetric case. All other cases are completely covered by the cµ-rule,
for example
A
B
A
c1 µ B
1 > c1 µ 1 > c2 µ 2 > c2 µ 2
implies that the optimal controller always prefers queue 1.
5.2.1
The Estimator Process
X
In this section we analyze the estimator processes pt = Pu (µ1 = µA
1 | Ft ) and qt :=
X
Pu (µ2 = µA
2 | Ft ).
The FtX -generator of (Xt ) is given as in section 3.1
QX (u, p, q) = qijX (u, p, q) with
µ1 (p)u
µ2 (q)(1 − u)
X
qij (u, p, q) = λ1
λ2
−λ − λ − µ (p)u − µ (q)(1 − u)
1
2
1
2
corresponding to section 5.1 by
j
j
j
j
j
= i − e1
= i − e2
= i + e1
= i + e2
=i
B
where µ1 (p) := µA
1 p + µ1 (1 − p) is the estimated service rate (the conditional mean) at
B
queue 1 at time t, similar µ2 (q) := µA
2 q + µ2 (1 − q).
Theorem 5.5 The estimator process pt is the unique solution of
X1
1
1
A
1
dpt = ut (1) µB
1 − µ1 pt (1 − pt )1 Xt > 0 dt + Φ1 (pt− )dNt (Xt− , Xt− − 1)
where the control independent jump-size Φ1 (p) is given by
Φ1 (p) =
1
µA p − p.
µ1 (p) 1
(5.4)
68
5.2 Unknown Service Rates: the Bayesian Case
Similar qt is described as the unique solution of
X2
A
2
2
2
dqt = ut (2) µB
2 − µ2 qt (1 − qt )1 Xt > 0 dt + Φ2 (qt− )dNt (Xt− , Xt− − 1)
where the control independent jump-size Φ2 (q) is given by
Φ2 (q) =
1
µA q − q.
µ2 (q) 2
Proof: The assertions are a direct consequence of theorem 3.5.
The following results hold in a similar fashion for qt but we will omit them here. We see
that pt jumps if and only if new information about the unknown service rate is available.
This is the case if and only if the service of a customer in queue 1 is finished. In this case
B
the new estimate pt is proportional to the possible intensities µA
1 and µ1 with respect to
the pre-jump-estimate pt− , that means
1
µA pt− .
pt = pt− + Φ1 (pt− ) =
µ1 (pt− ) 1
We omit in the following 1 Xt1 > 0 since it is obvious that new information is only
available through service, which is reasonable only if customers are waiting in queue 1.
Between two jumps pt is described by the deterministic part of (5.4)
A
ṗ = u(1)(µB
1 − µ1 )p(1 − p)
(5.5)
and can be calculated explicitly.
Theorem 5.6 Denote by τn the n-th jump time of (pt , qt ). For t ∈ [τn , τn+1 ) holds pt =
φut−τn (pτn ) where
Rt
B
A
exp{(µ
−
µ
)
us (1)ds}
1
1
.
(5.6)
φut (p) =
Rt 0
B
A
exp{(µ1 − µ1 ) 0 us (1)ds} − p−1
p
In particular: φu0 (p) = p.
Proof: The first statement is clear
R t by theorem 3.9. The second one follows by differentiating
A
B
(5.6) with M(t) := (µ1 − µ1 ) 0 us (1)ds:
"
∂ u
1
p−1
A
exp(M(t)) µB
φt (p) = exp(M(t)) −
2
1 − µ1 ut (1)
∂t
p
p
exp(M(t)) − p−1
#
A
− exp(M(t)) exp(M(t)) µB
1 − µ1 ut (1)
= 1
exp(M(t)) −
p
p−1
2 − exp(M(t))
u
A
u
= ut (1) µB
1 − µ1 φt (p)(1 − φt (p))
which is exactly (5.5) and obviously φu0 (p) = p.
µB
1
−
µA
1
p−1
ut (1)
p
5 Application to Parallel Queueing with Incomplete Information
69
Remark 5.7
1) If p = 1 then φut (p) ≡ 1. On the other hand if p = 0 then φut (p) ≡ 0. Summarizing if
we have complete information about the real value of µ1 then the estimator does not
change between two jumps. It will also remain constant due to jumps shown later
on. Thus it will be constant over time and the complete information is not destroyed.
2) If us (2) ≡ 1 (means only queue 2 is served) for s ∈ [t, t + ε] then φus (p) ≡ φut (p) for
all s ∈ [t, t + ε]. Thus the estimator pt is updated only if queue 1 is served.
3) pt is independent of the length of the queues i1 , i2 .
Rt
4) Tt (1) := 0 us (1)ds denotes the time spent for serving queue 1 in [0, t].
We assume now without loss of generality
A
µB
1 > µ1
(5.7)
and investigate the behaviour of φut (p) in dependence of t.
Lemma 5.8
a) t 7→ φut (p) is monotone increasing for all p ∈ [0, 1].
b) t 7→ φut (p) is Lipschitz continuous for all p ∈ [0, 1].
Proof:
∂ u
A
u
u
φt (p) is by theorem 5.6 given by ut (1)(µB
a) The derivative ∂t
1 − µ1 )φt (p)(1 − φt (p))
which is greater or equal 0 due to assumption (5.7).
b) The statement is a direct consequence of theorem 3.9.
By part a) if us ≡ 1 for s ∈ [t, t + ε] we have strong monotonicity for p ∈ (0, 1). But if
us ≡ 2 then φus (p) is constant in [t, t + ε]. In other words: if one serves queue 1 (that means
B
A
us = 1) then the parameter µA
1 becomes more likely. This is reasonable since µ1 > µ1
which means that a service under rate µ1 = µB
1 tends to result in an earlier completion
A
than under rate µ1 = µ1 .
After considering the estimator process in dependence of time, we analyze it in dependence
of the a-priori-estimate p. The greater the a-priori-probability p, the greater is the estimate
φut (p) up to the next jump as the following lemma claims in part a).
70
5.2 Unknown Service Rates: the Bayesian Case
Lemma 5.9
a) p 7→ φut (p) is monotone increasing for all t ≥ 0.
b) p 7→ φut (p) is Lipschitz continuous uniformly in u for all t ≥ 0 with Lipschitz paramA
eter exp{(µB
1 − µ1 )t} which is < 1 for t > 0. Hence it is contractive.
c) p 7→ φut (p) is concave for all t ≥ 0.
Proof:
a) By differentiation we get:
Rt
A
exp{(µB
−
µ
)
u (1)ds} · p12
1
1
0 s
Rt
≥ 0.
p−1 2
A
exp{(µB
−
µ
)
u
(1)ds}
−
1
1
p
0 s
∂ u
φ (p) =
∂p t
b) We first note that for the denominator of φut (p) in (5.6) holds, that
Z t
B
A
p exp (µ1 − µ1 )
us (1)ds − p + 1 ≥ 1
0
and we conclude with p, q ∈ [0, 1]:
|φut (p) − φut (q)|
Rt
Rt
B
A
B
A
p
exp{(µ
−
µ
)
u
(1)ds}
q
exp{(µ
−
µ
)
u
(1)ds}
1
1
1
1
0 s
0 s
= −
Rt
Rt
B
A
B
A
p exp{(µ1 − µ1 ) us (1)ds} − p + 1 q exp{(µ1 − µ1 ) us (1)ds} − q + 1 0
0
Rt
B
A
(p
−
q)
exp{(µ
−
µ
)
u
s (1)ds}
1
1
0
= R
R
t
t
A
A
p exp{(µB
q exp{(µB
1 − µ1 ) 0 us (1)ds} − p + 1
1 − µ1 ) 0 us (1)ds} − q + 1
Z t
B
A
≤ (p − q) exp{(µ1 − µ1 )
us (1)ds}
0
A
≤ exp{(µB
1 − µ1 )t}|p − q|
A
c) Define K := exp{(µB
1 − µ1 )
2
∂ u
φ (p) =
∂2p t
Rt
−2 (K −
since p ∈ [0, 1] and
p−1
p
≤ 0.
0
us (1)ds} > 0 then:
p−1
p
2
K p13 − 2K p12 K −
4
K − p−1
p
p−1
p
1
p2
≤ 0,
5 Application to Parallel Queueing with Incomplete Information
71
A
As an immediate consequence of the last statements we get that under µB
1 > µ1 (see (5.7)):
• t 7→ µ1 (φut (p)) is monotone decreasing
• p 7→ µ1 (φut (p)) is monotone decreasing.
After we discussed the behaviour of pt between two jumps in the last lemma we analyze
now the jump behaviour of pt , where the jump-size is independent of the current control.
It can be proven, that a jump reduces the probability that µA
1 is the true parameter, since
A
by
assumption
(5.7),
thus
jumps
are
more
likely
under
the hypothesis µ1 = µB
>
µ
µB
1.
1
1
Lemma 5.10
a) p + Φ1 (p) =
1
µA p
µ1 (p) 1
< p if p ∈ (0, 1), in particular Φ1 (p) < 0.
b) p 7→ p + Φ1 (p) is monotone increasing on (0, 1).
c) p 7→ φut (p) + Φ1 (φut (p)) is Lipschitz continuous.
Proof:
a) If p ∈ (0, 1) (the case p ∈ {0, 1} is discussed after this proof) then due to (5.7)
B
A
µ1 (p) = µA
1 p + µ1 (1 − p) > µ1
and hence
µA
1
µ1 (p)
< 1.
b) First note that for p = 1 and p = 0 the jump-size is equal to 0 and pt is constant.
p + Φ1 (p) is the new state of the conditional probability after a jump and it holds:
A
B
B
A
A
B
(µA
−
µ
)p
+
µ
∂
1
1
1 µ1 − µ1 p(µ1 − µ1 )
(p + Φ1 (p)) =
B
B 2
∂p
((µA
1 − µ1 )p + µ1 )
=
A
µB
1 µ1
B
B
((µA
1 − µ1 )p + µ1 )
2
≥ 0.
c) Since µ1 (p) ≥ µA
1 and
u
u
µ1 (φut (q))φut (p) − µ1 (φut (p))φut (q) = µB
1 (φt (p) − φt (q))
we conclude
|φut (p) + Φ1 (φut (p)) − φut (q) − Φ1 (φut (q))|
u
u
A u
µ1 (φut (q))µA
1 φt (p) − µ1 (φt (p))µ1 φt (q) =
u
u
µ1 (φt (p))µ1 (φt (q))
A u
A u
µ1 φt (p)
µ
φ
(q)
1 t
−
= µ1 (φut (p)) µ1 (φut (q)) µB
1
≤
|φut (p) − φut (q)| .
µA
1
72
5.2 Unknown Service Rates: the Bayesian Case
We obtain from part a): if p = 0 then 0 + Φ1 (0) = 0 and if p = 1 then 1 + Φ1 (1) = 1
(remember remark 5.7). Thus if we have complete information before a jump, we have
complete information after a jump or in other words: new information (due to jumps)
gives no update. But if p ∈ (0, 1) the estimator is updated due to a finished service. If
1
A
µA
1 = 0 we have pt = pt− + Φ1 (pt− ) = µ1 (pt− ) µ1 pt− = 0. Hence after a jump (which
is impossible under the hypothesis µ1 = µA
1 ) the conditional probability pt ≡ 0 for all
upcoming time-points t.
Part b) reads as the greater the a-priori-probability pt− the greater the a-posterioriprobability pt after a jump. In particular we saw in the proof that under the hypothesis
µ1 = µA
1 = 0 the function p 7→ p + Φ1 (p) is constant, which is not astonishing, since if
µA
=
0
after the first jump the conditional probability for µA
1
1 has to be zero (since under
A
the hypothesis µ1 = 0 a jump never occurs).
Part c) is the analogon to lemma 5.9 where we proved the Lipschitz continuity in p. Similar
to the proof we conclude the Lipschitz continuity in p 7→ p + Φ1 (p).
5.2.2
The Reduced MDP
We introduce next the reduced (uniformized) MDP as in section 4.2. For this denote
B
A
B
α := λ1 + λ2 + µA
1 + µ1 + µ2 + µ2
the uniformization parameter. Hence the distribution function of the sojourn times τn+1 −τn
is given by
F (t) = 1 − e−αt 1 t ≥ 0 .
Then define the MDP as follows (see remark 4.10):
• state space: S = N20 × [0, 1]2 , (i1 , i2 , p, q) ∈ S with
i1 =
ˆ number of waiting customers in queue 1
i2 =
ˆ number of waiting customers in queue 2
p =
ˆ conditional probability for µ1 = µA
1 discussed in section 5.2.1
A
q =
ˆ conditional probability for µ2 = µ2
• action space: A = {γ | γ : R+ → [0, 1]}, where γ ∈ A denotes the fraction of service
capacity spent to queue 1 (sometimes we will write γ(1) and γ(2) := 1 − γ(1) and
with a slight abuse of notation γ = 2 for serving queue 2 exclusively)
• set of admissible actions: D(i1 , i2 , p, q) = A ∀(i1 , i2 , p, q) ∈ S.
Between two jumps describe φγt (p) as the unique solution of (5.5)
A
ṗ = γ(1)(µB
1 − µ1 )p(1 − p)
5 Application to Parallel Queueing with Incomplete Information
73
with initial condition p0 = p given by
Rt
A
exp{(µB
1 − µ1 ) 0 γs (1)ds}
γ
φt (p) =
Rt
p−1
A
exp{(µB
1 − µ1 ) 0 γs (1)ds} − p
as in theorem 5.6. ϕγt (q) corresponding to qt is the unique solution of
A
q̇ = γ(2)(µB
2 − µ2 )q(1 − q).
We continue with the definition of the MDP:
• transition probabilities: for ω = (i1 , i2 , p, q) and B ⊂ [0, 1]2 set
q ω, γ, (i1 + 1, i2 , B)
=
Z
∞
Z0 ∞
e−αt λ1 1 (φγt (p), ϕγt (q)) ∈ B dt
e−αt λ2 1 (φγt (p), ϕγt (q)) ∈ B dt
0
X
q ω, γ, (i1, i2 , B) = 1 −
q ω, γ, (k, B)
q ω, γ, (i1, i2 + 1, B)
q ω, γ, (i1 − 1, i2 , B)
=
=
Z
k∈{(i1 +1,i2 ),(i1 ,i2 +1),
(i1 −1,i2 ),(i1 ,i2 −1)}
∞
e−αt µ1 (φγt (p))1 i1
0
> 0 γt (1) ·
·1 φγt (p) + Φ1 (φγt (p)), ϕγt (q) ∈ B dt
Z ∞
q ω, γ, (i1 , i2 − 1, B) =
e−αt µ2 (ϕγt (q))1 i2 > 0 γt (2) ·
0
·1 φγt (p), ϕγt (q) + Φ2 (ϕγt (q)) ∈ B dt.
All other transitions have probability 0.
• cost function: r (i1 , i2 , p, q), γ = r(i1 , i2 ) =
• discount factor: δ =
α
α+β
c1 i1 +c2 i2
α+β
∈ (0, 1).
Define as in section 4.2
V∞,π (i1 , i2 , p, q) := Eπ
"∞
X
k=0
and
k
δ r(Xτk ) X0 = (i1 , i2 ), p0 = p, q0 = q
V∞ (i1 , i2 , p, q) := inf∞ V∞,π (i1 , i2 , p, q).
π∈F
#
74
5.2 Unknown Service Rates: the Bayesian Case
Remark 5.11 It is obvious that it is never optimal to serve an empty queue while in the
other queue customers are waiting. Note that by serving an empty queue there is no new
information available about the service parameter of this queue. This means the (pure)
optimal decision f ∗ = f ∗ (i1 , i2 , p, q) in the case i1 = 0 or i2 = 0 is given by
i1 = 0, i2 > 0
2
∗
f (i1 , i2 , p, q) = 1
i2 = 0, i1 > 0
arbitrarily i1 = i2 = 0
Hence we concentrate in the following to the case i1 > 0, i2 > 0. The optimality equation
is accordingly given by:
v(i1 , i2 , p, q) = T v(i1 , i2 , p, q)
nZ ∞
= inf
e−(α+β)t − (c1 i1 + c2 i2 )
γ∈A
0
+v i1 + 1, i2 , φγt (p), ϕγt (q) λ1 + v i1 , i2 + 1, φγt (p), ϕγt (q) λ2
+v i1 , i2 , φγt (p), ϕγt (q) α− (λ1 + λ2 + µ1 (φγt (p))γt (1) + µ2 (ϕγt (q)))γt (2)
+v i1 − 1, i2 , φγt (p) + Φ1 (φγt (p)), ϕγt (q) µ1 (φγt (p))γt (1)
o
+v i1 , i2 − 1, φγt (p), ϕγt (q) + Φ2 (ϕγt (q)) µ2 (ϕγt (q))γt (2) dt .
Remark 5.12 Since u 7→ qijX (u, p, q) is linear, the cost rate is control independent, Φ1 (p)
and Φ2 (q) are independent of the control too and U = [0, 1] ⊂ R is convex there exists an
optimal deterministic strategy π as proven in theorem 4.14 and remark 4.15. Furthermore
due to corollary 4.16 we have lim Vn (i1 , i2 , p, q) = V∞ (i1 , i2 , p, q).
n→∞
5.2.3
A Characterization of the Value Function and the Optimal Control
The following theorem offers an explicit characterization of the value function. Assume
here again that i1 > 0 and i2 > 0. Otherwise the optimal control is clear, since serving an
empty queue is never advantageously, see remark 5.11.
Theorem 5.13 The n-stage value function Vn (i1 , i2 , p, q) is given for n ≥ 1 by
Vn (i1 , i2 , p, q) = (c1 i1 + c2 i2 )Kn + (c1 λ1 + c2 λ2 )Ln−1 + Gn−1 (p, q),
5 Application to Parallel Queueing with Incomplete Information
75
where
k
n−1 n−1
α
1 X
1 X k
Kn =
δ
=
α + β k=0 α + β
α + β k=0
1
Ln =
Kn + αLn−1 with L0 := 0
α+β
(Z
n
∞
−(α+β)t
Gn (p, q) = inf
e
Gn−1 (φγt (p) + Φ1 (φγt (p)), ϕγt (q)) − Gn−1 (φγt (p), ϕγt (q))
γ∈A
0
−c1 Kn−1 µ1 (φγt (p))γt (1)
+ Gn−1 (φγt (p), ϕγt (q) + Φ2 (ϕγt (q))) − Gn−1 (φγt (p), ϕγt (q))
−c2 Kn−1 µ2 (ϕγt (q))γt (2)
)
o
γ
γ
+Gn−1 (φt (p), ϕt (q))α dt
with G0 (p) := 0.
The same result holds for pure controls by replacing A = {γ : R+ → [0, 1]} by A =
{a | a ∈ {1, 2}}. But the convergence of Vn (i1 , i2 , p, q) to V∞ (i1 , i2 , p, q) we consider in
theorem 5.14, may not be true anymore for the pure control model in general.
Proof: For n = 1 and V0 ≡ 0 we get:
V1 (i1 , i2 , p, q) = T V0 (i1 , i2 , p, q) = inf
γ∈A
Z
0
∞
−(α+β)t
e
c1 i1 + c2 i2 dt = c1 i1 + c2 i2 K1 .
Assume now Vn−1 (i1 , i2 , p, q) = (c1 i1 + c2 i2 )Kn−1 + (c1 λ1 + c2 λ2 )Ln−2 + Gn−2 (p, q) for some
n − 1 ∈ N. Then the statement follows by induction, since
Vn (i1 , i2 , p, q) = T Vn−1 (i1 , i2 , p, q)
1
1
(1 + αKn−1 ) + (c1 λ1 + c2 λ2 )
(Kn−1 + αLn−2 )
= (c1 i1 + c2 i2 )
α+β
α+β
(Z
∞
+ inf
e−(α+β)t Gn−2 (φγt (p) + Φ1 (φγt (p)), ϕγt ) − Gn−2 (φγt , ϕγt (q))
γ∈A
0
−c1 Kn−1 µ1 (φγt (p))γt (1)
+ Gn−2 (φγt (p), ϕγt + Φ2 (ϕγt (q))) − Gn−2 (φγt , ϕγt (q))
)
γ
γ
γ
−c2 Kn−1 µ2 (ϕt (q))γt (2) + Gn−2 (φt (p), ϕt (q))α dt
76
5.2 Unknown Service Rates: the Bayesian Case
and
Kn
Ln
k
n−1 1
1 X
α
:=
1 + αKn−1 =
α+β
α+β
α+β
k=0
k
n−1
1 X
α
1
Kn + αLn−1 =
Kn−k .
:=
α+β
α+β
α+β
k=0
For the next lemma remember that the continuous and compactness assumptions under
the use of deterministic controls are fulfilled, see remark 5.12.
Theorem 5.14 It holds:
a) Kn is monotone increasing in n and K := lim Kn =
n→∞
1
β
b) Ln is bounded for all n, monotone increasing in n and L := lim Ln =
n→∞
1
.
β2
c) V∞ (i1 , i2 , p, q) = lim Vn (i1 , i2 , p, q) = (c1 i1 + c2 i2 )K + (c1 λ1 + c2 λ2 )L + G(p, q), where
n→∞
G(p, q) := lim Gn (p, q).
n→∞
d) V∞ (i1 , i2 , p, q) is monotone increasing in i1 and i2 .
Proof:
a) The monotonicity of Kn in n is obvious. Furthermore we get
k
∞ α
1
1
1 X
1
1
1
= .
K=
=
α =
β
α + β k=0 α + β
α + β 1 − α+β
α + β α+β
β
b) First, Ln =
n−1
P
k=0
α
α+β
k
Kn−k ≤
1
β
∞ P
α
α+β
k=0
1
(Kn
α+β
k
≤
1
1
α
β 1− α+β
=
1 α+β
β β
< ∞. Additionally
1
+ αLn−1 ) ≤ α+β
(Kn+1 + αLn ) = Ln+1 , by
L0 ≤ L1 and by induction Ln =
1
the monotonicity of Kn and induction hypothesis. L = β 2 follows from the recursion
formula of Ln given in the proof of theorem 5.13.
c) We know that V∞ (i1 , i2 , p, q) = lim Vn (i1 , i2 , p, q) holds due to corollary 4.16 for
n→∞
deterministic strategies which satisfy the compactness and continuity assumptions.
By theorem 5.13 the there given characterization of Vn (i1 , i2 , p, q) holds for each n.
Taking the limit and using part a) and b) the assertion in c) follows, in particular
G(p, q) := lim Gn (p, q) exists.
n→∞
d) The statement follows from c) and theorem 5.13.
5 Application to Parallel Queueing with Incomplete Information
77
Furthermore we know from part c) that
(Z
n
∞
G(p, q) = inf
e−(α+β)t G(φγt (p) + Φ1 (φγt (p)), ϕγt (q)) − G(φγt (p), ϕγt (q))
γ∈A
0
=: T G(p, q).
−c1 K µ1 (φγt (p))γt (1)
+ G(φγt (p), ϕγt (q) + Φ2 (ϕγt (q))) − G(φγt (p), ϕγt (q))
)
o
−c2 K µ2 (ϕγt (q))γt (2) + G(φγt (p), ϕγt (q))α dt
Hence G(p, q) is a fixed point of the operator T . Whereas a direct computation of Gn (p, q)
and G(p, q) is quite hard, we can give some bounds for Gn (p, q) and G(p, q), respectively.
Define for this purpose
B
A
B
Mmin := max min{c1 µA
1 , c1 µ1 }, min{c2 µ2 , c2 µ2 }
B
A
B
M max := max{c1 µA
1 , c1 µ 1 , c2 µ 2 , c2 µ 2 }
Mmin stands for the second worst case and M max for the best case of parameters. Then
Gn (p, q) ∈ [−Mmin Ln−1 , −M max Ln−1 ]
and
G(p, q) ∈ [−Mmin L, −M max L].
The bounds are attained if Mmin (M max ) are estimated with probability 1, that means if
M max = c1 µA
1 then the a priori-probability p0 has to be 1. In particular G(p, q) is negative
which is clear, since G(p, q) are expected reductions of the costs due to finished services
(compare the interpretation after theorem 5.2). Additionally we see that the length of the
interval for Gn (p, q) which is given by (M max − Mmin )Ln−1 is increasing in n, since Ln−1 is.
To value the information (or the cost of incomplete information) we compare in the next
theorem the value functions of both models, derived in theorem 5.2, 5.13 and 5.14.
Theorem 5.15 Denote by V C (i1 , i2 ) and V IC (i1 , i2 , p, q) the value functions of the complete and the incomplete information models from the just mentioned theorems. Then it
holds:
a) 0 ≥ VnC (i1 , i2 ) − VnIC (i1 , i2 , p, q) = −c1 µ1 Ln−1 − Gn−1 (p, q)
b) 0 ≥ V∞C (i1 , i2 ) − V∞IC (i1 , i2 , p, q) = −c1 µ1 L − G(p, q)
Proof: The inequalities follows from theorem 2.12, whereas the equalities are true due to
theorem 5.2, 5.13 and 5.14.
We obtain that the value of information does not depend on the lengths of the queues i1
and i2 . In the special case of n = 1 we have V1C (i1 , i2 ) − V1IC (i1 , i2 , p, q) = 0. This is not
surprising, since we have seen that for n = 1 every service allocation is optimal and the
78
5.2 Unknown Service Rates: the Bayesian Case
completion of a service of a customer has no influence on the total waiting costs since after
the next jump the system terminates.
For completeness we mention here, that the proof of theorem 5.2 follows directly from
theorem 5.14 c), since in the complete information case the value function does not depend
on p, q and therefore the optimization problem in the definition of G(p, q) is easy to solve.
Lemma 5.16
a) p 7→ Gn (p, q) and q 7→ Gn (p, q) are concave and therefore locally Lipschitz continuous
for all n ∈ N0 .
b) p 7→ G(p, q) and q 7→ G(p, q) are concave and therefore locally Lipschitz continuous.
Proof: Both statements are immediate consequences of theorem 3.15 and the characterization of the value function in theorem 5.13 and 5.14, where we have to adapt theorem 3.15
in a similar fashion to the finite horizon case in a).
Remark 5.17 We note from the separation property of the value function V∞ (i1 , i2 , p, q)
that the optimal control depends for i1 > 0 and i2 > 0 only on p and q. Furthermore, to
find an optimal control one has to solve the deterministic optimization problem T G(p, q).
This was already pointed out in theorem 4.9.
So far we discussed the value function of our parallel queueing model with unknown service
rates. Also important is the optimal control. For this we first remember the existence of
an optimal (deterministic) policy π, see remark 5.12. Due to theorem 4.7 there exists an
optimal control u∗ = (u∗t ). After that we look for sufficient conditions to characterize the
optimal control.
From theorem 3.16 we know that our value function J(i1 , i2 , p, q) = V∞ (i1 , i2 , p, q) is a
solution of the generalized HJB-equation, which is given by
HW (i1 , i2 , p, q, u)
(5.8)
βW (i1 , i2 , p, q) =
inf
ξp ∈∂p W (i1 ,i2 ,p,q)
ξq ∈∂q W (i1 ,i2 ,p,q)
u∈[0,1]
where the generalized Hamiltonian is defined as
HW (i1 , i2 , p, q, u) := c1 i1 + c2 i2
A
A
B
+ξp (µB
1 − µ1 )p(1 − p)u + ξq (µ2 − µ2 )q(1 − q)(1 − u)
+ (W (i1 + 1, i2 , p, q) − W (i1 , i2 , p, q)) λ1
+ (W (i1 , i2 + 1, p, q) − W (i1 , i2 , p, q)) λ2
+ (W (i1 − 1, i2 , p + Φ1 (p), q) − W (i1 , i2 , p, q)) µ1 (p)u
+ (W (i1 , i2 − 1, p, q + Φ2 (q)) − W (i1 , i2 , p, q)) µ2 (q)(1 − u).
5 Application to Parallel Queueing with Incomplete Information
79
By theorem 4.4 it is sufficient to compute (u∗ , ξp∗, ξq∗ ) of the generalized Hamiltonian to
find an optimal control such that the generalized HJB-equation is fulfilled. We note first
that HW (i1, i2 , p, q, u) is linear in u. Consequently the minimum point in u will be (if it
is unique) equal to 0 or 1. Hence the optimal control will serve one queue exclusively. If
the minimum point in u is not unique, that means in cases where HW (i1 , i2 , p, q, u) does
not depend on u or HW (i1 , i2 , p, q, 0) = HW (i1 , i2 , p, q, 1) with corresponding ξp∗ and ξq∗
we do not choose a minimum point arbitrarily. We choose u∗ in such a way, that p and q
remain such, that the coefficient of u in the Hamiltonian remain 0. This will be fulfilled
for example if u∗ is chosen such that p and q keep constant. Let us state the existence
theorem first.
Theorem 5.18 There exists an optimal control u∗ = (u∗t ) ∈ U with the above stated
properties. In particular if the minimum point of the HJB-equation is unique one queue is
served exclusively.
Proof: We apply the verification theorem 4.4. If we can prove that there always exists an
(u∗ , ξp∗, ξq∗ ) such that the generalized HJB-equation is fulfilled, the statement follows. The
generalized HJB-equation for the queueing model is given by
βW (i1 , i2 , p, q) =
inf
HW (i1, i2 , p, q, u)
ξp ∈∂p W (i1 ,i2 ,p,q)
ξq ∈∂q W (i1 ,i2 ,p,q)
u∈[0,1]
which is fulfilled for the value function J(i1 , i2 , p, q) = V∞ (i1 , i2 , p, q), since it is locally
Lipschitz continuous and regular. Due to the linearity of HJ(i1 , i2 , p, q, u) in u we conclude
with
F (i1 , i2 , p, q) := c1 i1 + c2 i2 + (J(i1 + 1, i2 , p, q) − J(i1 , i2 , p, q)) λ1
+ (J(i1 , i2 + 1, p, q) − J(i1 , i2 , p, q)) λ2
that the HJB-equation can be written as
βJ(i1 , i2 , p, q) − F (i1 , i2 , p, q)
HJ(i1 , i2 , p, q, 0)
HJ(i1 , i2 , p, q, 1) ,
inf
= min
inf
ξp ∈∂p J(i1 ,i2 ,p,q)
ξp ∈∂p J(i1 ,i2 ,p,q)
ξq ∈∂q J(i1 ,i2 ,p,q)
ξq ∈∂q J(i1 ,i2 ,p,q)
(
n
A
= min
inf
ξp (µB
1 − µ1 )p(1 − p)
ξp ∈∂p J(i1 ,i2 ,p,q)
inf
ξq ∈∂q J(i1 ,i2 ,p,q)
n
o
+ (J(i1 − 1, i2 , p + Φ1 (p), q) − J(i1 , i2 , p, q)) µ1 (p) ,
A
ξq (µB
2 − µ2 )q(1 − q)
+ (J(i1 , i2 − 1, p, q + Φ2 (q)) − J(i1 , i2 , p, q)) µ2 (q)
)
o
80
5.2 Unknown Service Rates: the Bayesian Case
(
A
= min J0,p (i1 , i2 , p, q; 1)(µB
1 − µ1 )p(1 − p)
J0,q (i1 , i2 , p, q; 1)(µB
2
+ (J(i1 − 1, i2 , p + Φ1 (p), q) − J(i1 , i2 , p, q)) µ1 (p),
− µA
2 )q(1 − q)
)
+ (J(i1 , i2 − 1, p, q + Φ2 (q)) − J(i1 , i2 , p, q)) µ2 (q)
where we used
inf
ξp ∈∂p J(i1 ,i2 ,p,q)
A
ξp (µB
=
1 − µ1 )p(1 − p)
inf
ξp ∈∂q J(i1 ,i2 ,p,q)
B
ξp (µ1 − µA
1 )p(1 − p)
A
= J0,p (i1 , i2 , p, q; 1)(µB
1 − µ1 )p(1 − p)
and the definition of the lower generalized Clarke derivative J0,p (i1 , i2 , p, q; 1) with respect
to p. Since J(i1 , i2 , p, q) is regular in p we conclude that J0,p (i1 , i2 , p, q; 1) exists and is the
right hand side derivative. Similar considerations hold true for J0,q (i1 , i2 , p, q; 1).
5.2.4
The Symmetric Case
We consider now case C) in (5.3). This means that both possible parameters at each
queue are the same, but it is not known which parameter is true at which queue. Assume
B
A
furthermore c1 = c2 = 1 and for the service rates we make the convention µA
1 = µ2 =: µ
A
B
and µB
1 = µ2 =: µ . We will refer to this as the symmetric case. In particular we get
B
A
µ > µ and hence we see that if the true value µ1 = µA then queue 1 is the ”bad” queue
and an optimal controller prefers according to the cµ-rule always queue 2. If on the other
hand µ1 = µB then the optimal decision is vice versa. A similar model was considered in
the context of bandit problems by Donchev in his works (Donchev and Yushkevich (1996),
Donchev (1998) and Donchev (1999)).
Since we are in the symmetric case it is sufficient to consider only one estimator process
B
pt := Pu (µ1 = µA
1 | Ft ) = Pu (µ2 = µ1 | Ft )
which is the solution of
dpt = ut (1)(µB − µA )pt (1 − pt ) + ut (2)(µA − µB )pt (1 − pt ) dt
1
1
2
2
+Φ1 (pt− )dNt1 (Xt−
, Xt−
− 1) + Φ2 (pt− )dNt2 (Xt−
, Xt−
− 1)
B
A
= (µ − µ )(2ut (1) − 1)pt (1 − pt )dt
1
1
2
2
+Φ1 (pt− )dNt1 (Xt−
, Xt−
− 1) + Φ2 (pt− )dNt2 (Xt−
, Xt−
− 1)
where
1
1 A
µA p − p =
µ p−p
B
+ µ (1 − p)
µ(p)
1
1
µB p − p =
µB p − p.
Φ2 (p) =
B
A
µ p + µ (1 − p)
µ(1 − p)
Φ1 (p) =
µA p
5 Application to Parallel Queueing with Incomplete Information
81
Here we set again µ(p) = µA p + µB (1 − p). From this stochastic differential equation we
see that between two jump times τn and τn+1
> 1
monotone increasing
monotone decreasing
if ut (1) <
pt = φut−τn (pτn ) is
.
2
constant
=
The interpretation of this result is as in section 5.2.1: If we serve queue 1 majoritarian then
the estimator is monotone increasing, since µA < µB and hence jumps are more unlikely.
We see that now the estimator process is not separated to one queue alone, since the
hypothesis of the true values of the service rates are connected. Thus the completion of a
service in both queues leads to jumps and therefore to updates of the estimator process.
Determining of customers at queue 1 make µA being the true parameter at queue 1 more
unlikely, completing a customer at queue 2 makes it more probable, since µB > µA and
hence
p + Φ1 (p) =
µA p
≤p
µ(p)
and
p + Φ2 (p) =
µB p
≥p
µ(1 − p)
(5.9)
From theorem 5.14 we know that the value function is
J(i1 , i2 , p) = V∞ (i1 , i2 , p) = (i1 + i2 )K + (λ1 + λ2 )L + G(p)
where G(p) is given by
(Z
∞
e−(α+β)t
G(p) = inf
(u)
0
n
G(φut (p) + Φ1 (φut (p))) − G(φut (p)) − K µ(φut (p))ut (1)
+ G(φut (p) + Φ2 (φut (p))) − G(φut (p)) − K µ(1 − φut (p))(1 − ut (1))
)
o
u
+G(φt (p))α dt .
Since p 7→ G(p) is concave it is locally Lipschitz continuous and regular. Because we are
in the symmetric case with the same cost rate at each queue it is clear that
G(p) = G(1 − p).
The function p 7→ G(p) is symmetric and concave, hence it is monotone increasing for
p < 21 and decreasing for p > 12 . Therefore we get for an element ξ of the generalized
Clarke gradient
∂p G(p) = co lim sup ∇G(pn ) | lim pn = p
n→∞
that ξ ≥ 0 if p < 12 and ξ ≤ 0 for p >
{0} ∈ ∂p G( 21 ). Additionally we have
∂p G(p) = −∂p G(1 − p),
n→∞
1
2
and since G(p) attains a maximum in p =
1
2
that
82
5.2 Unknown Service Rates: the Bayesian Case
where we understand −F as the set {−a | a ∈ F }. Especially we get, ∂p G( 21 ) is a symmetric
interval with respect to 0. Since we are in the symmetric case it would be desirable that
the symmetry also holds for the optimal control. This is the case as the following theorem
shows (as long as i1 > 0, i2 > 0).
Theorem 5.19 Let i1 > 0 and i2 > 0 and denote by u∗ = (u∗t ) the optimal control, then
u∗(p) = 1 − u∗ (1 − p).
Proof: From theorem 4.3 we know that the value function J(i1 , i2 , p) and the optimal
control (u∗t ) with corresponding state process (p∗t ) fulfils the generalized HJB-equation for
almost all t ≥ 0. That means by the separation property of J(i1 , i2 , p) in theorem 5.14
(
inf ∗
G(p∗t + Φ1 (p∗t )) − G(p∗t ) − K µ(p∗t )u∗t
ξ∈∂p G(pt )
+ G(p∗t + Φ2 (p∗t )) − G(p∗t ) − K µ(1 − p∗t )(1 − u∗t )
)
+ξ(µB − µA )p∗t (1 − p∗t )(2u∗t − 1) − βG(p∗t )
= 0.
Consider the generalized Hamiltonian in this symmetric case given by
HG(p, u) := G(p + Φ1 (p)) − G(p) − K µ(p)u
+ G(p + Φ2 (p)) − G(p) − K µ(1 − p)(1 − u)
+ξ(p) · (µB − µA )p(1 − p)(2u − 1) − βG(p)
=: M(p, G)u + R(p, G)
Due to p − Φ1 (1 − p) = p + Φ2 (p) and the symmetry of G(p) and ∂p G(p) we conclude that
HG(1 − p, u) = G(1 − p + Φ1 (1 − p)) − G(1 − p) − K µ(1 − p)u
+ G(1 − p + Φ2 (1 − p)) − G(1 − p) − K µ(p)(1 − u)
+ξ(1 − p) · (µB − µA )(1 − p)p(2u − 1) − βG(1 − p)
= G(p + Φ2 (p)) − G(p) − K µ(1 − p)u
+ G(p + Φ1 (p)) − G(p) − K µ(p)(1 − u)
−ξ(p) · (µB − µA )p(1 − p)(2u − 1) − βG(p)
=: −M(p, G)u + R̃(p, G).
By the linearity of HG(p, u) in u and the symmetry of the coefficients M(p, G) of u in
HG(p, u) and HG(1−p, u) it follows, that if u∗ (p) = 1 is a minimum point of u 7→ HG(p, u),
then u∗ (1 − p) = 0 is a minimum point of u 7→ HG(1 − p, u). The same conclusion holds
true if u∗ (p) = 0. If p = 21 we will see in the upcoming that 0 and 1 are minimum points in
the generalized HJB-equation. Thus u∗ 12 = 21 can be chosen as optimal decision. Hence
the statement is proven.
5 Application to Parallel Queueing with Incomplete Information
83
The next theorem states that it is always optimal to serve the queue, where the better
service rate is assumed. This means if p < 12 it is more likely that the better rate µB is
the true value at queue 1. Thus it is optimal to serve queue 1. If both hypothesis are
equiprobable, the optimal control divides service fifty-fifty between both queues (although
every allocation would be optimal). This was explained from a technical point of view
before the existence theorem 5.18 to keep the estimator process constant. It is motivated
from an intuitive point of view in the proof of the next theorem which proves the optimality
of a threshold-strategy with threshold p∗ = 12 . In other words the certainty equivalence
principle to the cµ-rule holds true, since with c1 = c2 = 1
µ1 (p) = µ(p) ≥ µ(1 − p) = µ2 (p)
⇐⇒
1
p≤ .
2
Theorem 5.20 Assume i1 > 0 and i2 > 0 and set
1
1 p < 2
u∗ (p) = 12 p = 21
0 p > 12
then u∗ (p∗t− ) is optimal (as long as customers are waiting in the queue which is served).
In particular the optimal control u∗ is a threshold control with threshold p∗ = 21 .
Proof: Choose ξ ∗ = G0,p (p; 1) for p ≤ 12 , since we know from the proof of the existence
theorem 5.18 that according to the generalized verification theorem 4.4 it is sufficient to
compute minimum points in u of the right hand side of the generalized HJB-equation.
Due to the separation property of the value function J(i1 , i2 , p) we only have to compute
the minimum point in u of the Hamiltonian, which is independent of i1 , i2 (as long as
i1 > 0, i2 > 0), which is given by
( µB p
µA p
− G(p) − K µ(p) − G
− G(p) − K µ(1 − p)
HG(p, u) =
G
µ(p)
µ(1 − p)
)
+2ξ ∗ (µB − µA )p(1 − p) u
∗
B
A
−ξ (µ − µ )p(1 − p) + G
It is sufficient to prove that
!
A µ p
M(p) :=
G
− G(p) − K µ(p) −
µ(p)
+2ξ ∗ (µB − µA )p(1 − p) < 0
for p <
1
2
and equal to 0 for p =
1
2
µB p
µ(1 − p)
G
− G(p) − K µ(1 − p).
µB p
µ(1 − p)
− G(p) − K µ(1 − p)
by theorem 5.19. For this purpose define
h(p) := 2ξ ∗ (µB − µA )p(1 − p) − Kµ(p) + Kµ(1 − p).
!
84
5.2 Unknown Service Rates: the Bayesian Case
Then
M(p) =
µA p
G
µ(p)
− G(p) µ(p) − G
µB p
µ(1 − p)
− G(p) µ(1 − p) + h(p) u.
Analyzing h(p) we get, that h(p) < 0 for p < 21 , which can be seen as follows:
• p 7→ h(p) is continuous
• h(0) = −K(µB − µA ) < 0
1
• h 21 = (µB − µA )(ξ ∗ 12 − 12 ξ ∗ 21 ) ≤ 0 since ξ ∗ 21 = G0,p 12 ; 1 ∈ ∂p G
2 is ≤ 0
due the fact, that ∂p G 12 is symmetric interval around 0 and G0,p 21 ; 1 is equal to
the left bound
• the maximum point of h(p) is p∗ =
p ∈ [0, 12 ).
K+ξ ∗
2ξ ∗
>
1
,
2
since K > 0 and ξ ∗ ≥ 0 for all
It remains to prove that
A µB p
µ p
− G(p) µ(p) − G
− G(p) µ(1 − p) ≤ 0
G
µ(p)
µ(1 − p)
for p <
that
1
2
(note for p =
G(p) ≥
1
2
this expression is equal 0). Since p 7→ G(p) is concave we know
p − p1
p2 − p
G(p2 ) +
G(p1 )
p2 − p1
p2 − p1
for all 0 ≤ p1 ≤ p ≤ p2 ≤ 1 such that p2 6= p1 . Choose now p1 =
conclude
µA p
µ(p)
and p2 =
µB p
µ(1−p)
and
(p2 − p1 )G(p)µ(p)µ(1 − p) = (µB pµ(p) − µA pµ(1 − p))G(p)
A µ p
µB p
B
A
+ (µ p − µ(1 − p)p)µ(p)G
.
≥ (µ(p)p − µ p)µ(1 − p)G
µ(1 − p)
µ(p)
Dividing by p we continue with
A A B
µ p
µ p
− G(p) µ µ(p) − G
µ(p)µ(1 − p)
0 ≥ G
µ(p)
µ(p)
A
µB p
µB p
− G
− G(p) µ µ(1 − p) + G
µ(p)µ(1 − p)
µ(1 − p)
µ(1 − p)
n µA p o
µB p
=
G
− G(p) µ(p) − G
− G(p) µ(1 − p) (µB − µA )(1 − p)
µ(p)
µ(1 − p)
and since (µB − µA )(1 − p) we conclude that M(p) < 0 for p < 21 .
5 Application to Parallel Queueing with Incomplete Information
85
If p = 12 then (u, ξ) = (1, G0p ( 21 ; 1)) and (u, ξ) = (0, G0,p ( 12 ; 1)) as well fulfil the HJB
equation. Hence both allocations would be optimal. We choose u∗ 12 = 12 as linear com
bination of 0 and 1, which is optimal with corresponding ξ ∗ = 21 G0,p ( 21 ; 1) + G0p ( 12 ; 1) =
0 ∈ ∂p G( 12 ), since ∂p G( 12 ) is symmetric. Although the choice of ξ ∗ is not necessary, since
its coefficient is equal to 0 for u∗ = 12 . Additionally the estimator φ∗ (p) remains constant
1
for u∗ = 12 up to the next jump. Thus the service rate is split up fifty-fifty and the next
2
served customer determines which queue is served next exclusively, since directly after a
jump the estimator pt is not equal 12 anymore.
Remark 5.21 From the proof we conclude the stay-on-a-winner property: If the server
finishes the service of a customer in queue i, then it will continue serving queue i (assuming
queue i is not empty). This is true since the completion of the service of a customer in
queue i makes queue i more probable to be the ”better” queue, see (5.9).
The following table illustrates the results above in a numerical context. We simulate a
parallel queueing model with unknown service rates which are symmetric. We consider
different strategies and compare them to each other.
strategy
simulated cost
cµ rule (complete information)
8.4795
prefer queue 2
prefer queue 1
allocate constant 50 : 50
uniform distribution on {0, 1}
control limit with p = 12
8.4975
8.7538
8.6410
8.6061
8.5224
Table 1: Symmetric case with µA = 0.1, µB = 0.3, λ1 = λ2 = 0, i1 = 3, i2 = 5, p0 = 0.9,
β = 0.9 and true parameter µ1 = µA .
Donchev proved in his symmetric bandit models, considered in Donchev and Yushkevich
(1996), Donchev (1998) and Donchev (1999) with the help of Presman and Sonin (1990)
and Presman (1990), that the control limit strategy with control limit p = 21 is optimal.
They use a logarithmic scale in their proof which works due to their special structure of
their cost rate. In our model their approach seems not to work.
5.2.5
Complete Information about One Service Rate
Consider now case B) in (5.3). Hence the parameter µ2 is known and the three inequalities
can be rewritten as
A
c1 µ B
1 > c2 µ 2 > c1 µ 1 .
(5.10)
86
5.2 Unknown Service Rates: the Bayesian Case
Since one parameter is completely observable the estimator process qt is dispensable and
pt is the solution of
A
1
X
1
1
dpt = ut (1) (µB
(5.11)
1 − µ1 )pt (1 − pt ) 1 Xt > 0 dt + Φ(pt− )dNt (Xt− , Xt− − 1)
where Φ(p) = µ11(p) µA
1 p, see also section 5.2.1. The results of lemma 5.8 and 5.9 can be
carried through to this section. Thus φut (p) is
• monotone increasing in t and p
• Lipschitz continuous in t and p.
Again the separation property of the value function J(i1 , i2 , p) = V∞ (i1 , i2 , p) holds true.
Hence the generalized HJB-equation given in (5.8) simplifies to
βG(p)
=
inf
ξ∈∂p G(p)
u∈U
(
(5.12)
)
A
ξ(µB
1 − µ1 )p(1 − p)u + (G(p + Φ(p)) − G(p) − c1 K)µ1 (p)u − c2 µ2 K(1 − u)
We prove next, that in this model there even exists a pure optimal strategy. That means
one queue is always served exclusively. This result is very special to case of one unknown
service rate (here µ1 ). It will not hold true for the case of two unknown service rates in
general as pointed out in the last sections.
Theorem 5.22 There exists an optimal pure control u∗ = (u∗t ), in particular u∗t ∈ {0, 1}.
Proof: The existence of an optimal deterministic control u∗ is a consequence of the existence
theorem 5.18. There we have seen, that the optimal control serves one queue exclusively,
except the minimum point of the Hamiltonian is not unique. In this case, it is sufficient
to choose the minimum point such that the estimator process remains constant (see the
comments after theorem 5.18). This is here the case, if we choose u∗ = 0. The existence of
ξ ∗ = G0,p (p; 1) is guaranteed as in theorem 5.18 by the regularity of the value function in
p. Summarizing we have found a pure control with the verification theorem 4.4.
From the proof we conclude that if the optimal server switches from a non-empty queue
1 to queue 2 it will remain there until queue 2 is empty, since the completion of a service
in queue 2 has no influence to the estimator process p. From the HJB-equation (5.12) and
the last proof it is clear,
that the optimal control is a control limit strategy. It is defined
∗
∗
∗
∗
as u = u (Xt− , pt− ) with
1
i2 = 0, i1 > 0
i1 = 0, i2 > 0
2
∗
u (i1 , i2 , p) = 1
i1 > 0, i2 > 0, H(p) < c2 µ2 K
2
i1 > 0, i2 > 0, H(p) ≥ c2 µ2 K
arbitrarily i = i = 0
1
2
5 Application to Parallel Queueing with Incomplete Information
87
where the control limit is given by
A
H(p) := (G(p + Φ(p)) − G(p) − c1 K)µ1 (p) + G0,p (p; 1)(µB
1 − µ1 )p(1 − p).
Unfortunately, this control limit is not suitable since it depends on the function G(p) which
is quite hard to compute. But on the other hand it is clear that there exists a critical value
τ ∗ defined by the first time point where the control limit H(p∗t ) is greater or equal to
c2 µ2 K, such that it is optimal to serve queue 1 for t < τ ∗ and queue 2 for t ≥ τ ∗ as long as
customers are waiting in both queues. This is the well-known stay-on-a-winner property.
The next theorem states, that if the estimate c1 µ1 (p) is greater than c2 µ2 , then it is always
optimal to serve queue 1.
Theorem 5.23 If c1 µ1 (p) ≥ c2 µ2 , then it is optimal to serve queue 1.
Proof: The statement is an immediate consequence of the properties of the optimal control
in the complete information case, discussed in theorem 5.1.
This is some kind of one-step-look-ahead-rule: a myopic strategy, as in the case of complete information. Due to the monotonicity of p 7→ µ1 (p) we are able to characterize the
optimality condition of the last theorem in dependence on p. For this define
p≤ := sup{p ∈ [0, 1] | c1 µ1 (p) ≥ c2 µ2 }
and if p ≤ p≤ serving queue 1 (means u∗ (p) = 1) is optimal. Note that p≤ is well-defined,
≤
since for p = 0 we get c1 µ1 (0) = c1 µB
1 > c2 µ2 by (5.10). p is independent of the length of
the queues i1 , i2 and can be computed explicitly as
p≤ =
c1 µ B
1 − c2 µ 2
.
A
c1 (µB
1 − µ1 )
A
From this equation we see that p≤ is monotone increasing in µB
1 and decreasing in µ1 and
µ2 . This makes sense, since the higher µ2 the lower p has to be such that c1 µ1 (p) ≥ c2 µ2
is satisfied. Similar interpretations hold for the other parameters.
Notice, that c2 µ2 > c1 µ1 (p) does not imply the optimality of serving queue 2 as we will
see in the following. If no new customers arrive at both queues we are able to give a
sufficient condition for an optimal pure control in the spirit of the cµ-rule as in the complete
information case, see theorem 5.1. Hence assume
λ1 = λ2 = 0.
By this assumption we immediately get a finite state space SX , that means
SX = {0, . . . , i1 (0)} × {0, . . . , i2 (0)},
88
5.2 Unknown Service Rates: the Bayesian Case
where ij (0) is the number of customers waiting at time 0 in queue j. We first mention,
that as a consequence of the finite queue length, the cost function in the MDP given by
1 +c2 i2
r(i1 , i2 ) = c1 iα+β
is bounded. Since δ < 1 we can apply Banach’s fixed point theorem due
to the boundedness of V∞ (i1 , i2 , p) and we obtain without any continuity and compactness
assumptions
lim Vn (i1 , i2 , p) = V∞ (i1 , i2 , p),
n→∞
V∞ is the unique fixed point of T and for every bounded V0
||V∞ − T n V0 || ≤
δn
||T V0 − V0 ||.
1−δ
Notice that Howard’s policy improvement algorithm is also applicable now.
We denote the state where both queues are empty by G0 := (0, 0). G0 is an absorbing
set, since if both queues are empty the optimization problem is terminated and the state
process (Xt , pt ) will never leave G0 × [0, 1], since there are no arrivals to the queues. It is
clear, that it is never optimal to serve an empty queue (see remark 5.11). Therefore we
restrict the set of admissible actions for i ∈
/ G0 to
i1 > 0, i2 > 0
A
D(i1 , i2 , p) = {γ ∈ A | γt ≡ 1} i1 > 0, i2 = 0
{γ ∈ A | γt ≡ 2} i1 = 0, i2 > 0
As a consequence we know, that under each admissible strategy the set G0 is reached almost
surely, if all possible service rates are strictly positive.
∞
Lemma 5.24 Assume µA
there exists
1 > 0. Then for all (i1 , i2 , p) and all policies π ∈ F
a random variable τ := τ (i1 , i2 , p) with Pπ (τ < ∞) = 1 such that
Pπ (Xτ , pτ ) ∈ G0 × [0, 1] | (X0 , p0 ) = (i1 , i2 , p) = 1
and hence the MDP is terminating.
Proof: Avoiding to serve an empty queue, one only looks for the time point where i1
customers in queue 1 and i2 in queue 2 are served. The service behaviour is described by
Poisson processes and we know, that i1 and i2 jumps happen in finite time almost surely
as long as the intensities are positive, see for example Brémaud (1981).
Since there are no arrivals we are able to use a very special solution technique, called
recursion in the state space, which proceeds as follows. Define G1 := {(1, 0), (0, 1)}, G2 :=
{(2, 0), (1, 1), (0, P
2)} and so on. Then we gain a disjoint partition of SX = {0, . . . , i1 (0)} ×
{0, . . . , i2 (0)} =
Gk . Define N := N(i1 , i2 , p) as the number of jumps of pt until Xτ ∈ G0
and V (i1 , i2 , p) := VN (i1 , i2 , p). By the terminating property we know
V (i1 , i2 , p) = 0 ∀(i1 , i2 ) ∈ G0 .
5 Application to Parallel Queueing with Incomplete Information
89
Then compute for (i1 , i2 ) ∈ Gk , k = 1, 2, . . ., the value function V (i1 , i2 , p) with the help
of the Bellman equation, which is possible since for the computation of V (i1 , i2 , p) for
(i1 , i2 ) ∈ Gk only the knowledge of V (j1 , j2 , p) for (j1 , j2 ) ∈ Gκ , κ = 0, . . . , k −1 is necessary.
We will demonstrate this in the proof of theorem 5.25.
The next theorem gives a sufficient condition for an optimal control. It is the counterpart
of theorem 5.23, where we gave a condition, such that serving queue 1 is optimal. Now:
if the highest estimated value of c1 µ1 is less than c2 µ2 (remember the monotonicity result
in lemma 5.10), then the optimal strategy is to serve queue 2 until it is empty. Hence
the stay-on-a-winner property is obtained again. The idea behind this result is, that the
estimate for c1 µ1 remains less than c2 µ2 all the time. Consequently there is no expected
benefit to accept higher cost now for new informations in the hope of lower future cost.
Introduce the operator M defined by
Mp := p + Φ(p).
Theorem 5.25 If i1 > 0, i2 > 0 and
c2 µ2 > c1 µ1 (Mi1 −1 p)
(5.13)
then it is optimal to serve queue 2 until it is empty.
Before we prove this theorem we discuss the preconditions of the theorem in more detail
and receive some additional results for the proof.
Lemma 5.26 The set of all p which fulfils (5.13) is an interval, that means
p ∈ [0, 1] | c2 µ2 > c1 µ1 (Mi1 −1 p) = (p> (i1 ), 1],
where p> (i1 ) := inf p ∈ [0, 1] | c2 µ2 > c1 µ1 (Mi1 −1 p) . p> (i1 ) is well-defined since for
p = 1 we have c1 µ1 (1) = c1 µA
1 < c2 µ2 by assumption (5.10).
Proof: By lemma 5.10 we know that p 7→ Mp is monotone increasing and by assumption
B
>
>
(5.10) we have µA
1 < µ1 . Let p > p (i1 ) =: p then
Mp ≥ Mp>
⇒
Mi1 −1 p ≥ Mi1 −1 p>
⇒
c1 µ1 (Mi1 −1 p) ≤ c1 µ1 (Mi1 −1 p> ) < c2 µ2
where the last inequality holds by the definition of p> (i1 ). In particular p> (i1 ) is welldefined and depends only on i1 and not on i2 .
Observe that p> (i1 ) can be computed explicitly for every i1 as for example
p> (2) =
B
µB
1 (µ2 −µ1 )
A −µB )
µA
(µ
1
1
1
1−
µ2 −µB
1
µA
1
.
90
5.2 Unknown Service Rates: the Bayesian Case
As an immediate consequence of the definition of p> (i1 ) one can show:
p> (i1 ) = inf p ∈ [0, 1] | c2 µ2 > c1 µ1 (Mi1 −1 p)
= inf p + Φ(p) ∈ [0, 1] | c2 µ2 > c1 µ1 (p + Φ(p) + Mi1 −2 (p + Φ(p)))
= (p + Φ(p))> (i1 − 1).
Additionally we get the monotonicity of i1 7→ p> (i1 ). That means the more customers
are waiting in queue 1, the higher p> (i1 ), in particular the higher the threshold for the
estimate of the (bad) service rate µA
1 for serving queue 2.
Lemma 5.27 i1 7→ p> (i1 ) is monotone increasing.
Proof: By lemma 5.10 we know that Φ(p) ≤ 0, thus µ1 (Φ(p)) ≥ 0 by (5.10). Hence
p> (i1 ) = inf p ∈ [0, 1] | c2 µ2 > c1 µ1 (Mi1 −1 p)
≥ inf p ∈ [0, 1] | c2 µ2 > c1 µ1 (Mi1 −2 p) = p> (i1 − 1).
Important for the proof of the optimality in theorem 5.25 is the following lemma, which
guarantees that the recursion in the state space is possible. That means, if condition (5.13)
is satisfied in state (i1 , p), then it is fulfilled if one customer is served at queue 1 and the
estimator has been updated from p to p+Φ(p), in particular it is fulfilled in (i1 −1, p+Φ(p)).
Lemma 5.28 If p > p> (i1 ) then Mφγt (p) > p> (i1 − 1) for all i1 ≥ 2 and for all t ≥ 0.
Proof: By lemma 5.8 we know φγt (p) > p and we conclude for i1 = 2
c1 µ1 (φγt (p)) ≤ c1 µ1 (p) ≤ c1 µ1 (Mp) < c2 µ2
where the second inequality holds due to lemma 5.10. Assume the statement for i2 ≥ 2 is
true, then with lemma 5.8 and 5.10
c1 µ1 (Mφγt (p)) ≤ c1 µ1 (Mp) ≤ c1 µ1 (M(p + Φ(p))) < c2 µ2 ,
where the last inequality follows as in the case i2 = 2.
We are now in the position to give the proof of theorem 5.25. As in the complete information
case in theorem 5.1 we use an interchange argument and additionally the recursion in the
state space.
Proof of theorem 5.25:
The details of the proof can be found in appendix B. We illustrate here the main steps.
Since the existence of an optimal pure control is guaranteed by theorem 5.22 define a
decision rule f = (ft ) = (f (Xt1 , Xt2 , pt )) where
2 i1 > 0, i2 > 0
f (i1 , i2 , p) := 1 i2 = 0
2 i1 = 0
5 Application to Parallel Queueing with Incomplete Information
91
Denote the slightly modified decision rule g by
1
2
g(i1 , i2 , p) :=
1
2
i1
i1
i2
i1
> 0, i2 > 0, t ∈ B
> 0, i2 > 0, t ∈
/B
=0
=0
where B ⊂ [0, ∞). Without loss of generality assume B = [0, ε]. Consider then two policy
π = (f, g, f, . . . , f ) ∈ F n
π̃ = (g, f, . . . , f ) ∈ F n
and
and compute
Vn,π̃ (i1 , i2 , p) − Vn,π (i1 , i2 , p) =
Z
ε
−(α+β)t
e
0
c2 µ2 − c1 µ1 (φgt (p))
α+β
dt ≥ 0.
(5.14)
This inequality is true, since c1 µ1 (φgt (p)) ≤ c1 µ1 (Mi1 −1 p) < c2 µ2 . For this statement we
B
have to use φgt (p) ≥ p ≥ Mp (see lemma 5.10) and µA
1 < µ1 by (5.10). Finally we have to
show that (ft )t≥0 is a minimizer of Vn (i1 , i2 , p) for p > p> (i1 ) and all n ∈ N. Consider
Tg Vn (i1 , i2 , p)
=
(5.14)
≥
Tg Vn,(f,...,f ) (i1 , i2 , p) = Vn+1,(g,f,...,f ) (i1 , i2 , p)
Vn+1,(f,g,f,...,f ) (i1 , i2 , p) = Tf Vn (i1 , i2 , p)
and hence f = (ft ) is a minimizer of Vn (i1 , i2 , p). In the detailed computations we used
Vn (i1 , i2 , φgs (p)) = Vn,(f,...,f ) (i1 , i2 , φgs (p))
which is true by induction hypotheses, since φgs (p) ≥ p and
Vn (i1 − 1, i2 , φgs (p) + Φ(φgs (p))) = Vn,(f,...,f ) (i1 − 1, i2 , φgs (p) + Φ(φgs (p)))
Vn (i1 , i2 − 1, φgs (p)) = Vn,(f,...,f ) (i1 , i2 − 1, φgs (p))
by the recursion in the state space, since the preconditions of this theorem are fulfilled for
n in states (i1 − 1, i2 , φgs (p) + Φ(φgs (p))) and (i1 , i2 − 1, φgs (p)) by lemma 5.28.
In the special case of (1, i2 ) the condition of theorem 5.25 simplifies to
c2 µ2 > c1 µ1 (p)
which is the certainty equivalence principle of the cµ-rule, where the unknown parameter µ1
is replaced by its estimator µ1 (p). In general the preconditions of the theorem guarantees
that the highest estimated value for µ1 is such that c1 µ1 (p) remains less than c2 µ2 , which is
essential for the proof of the optimality of the cµ-rule. That means looking for the optimal
decision knowledge of the conditional probability p is not necessary anymore.
With theorem 5.23 in mind we have found an optimal (pure) control for the model without
arrivals in the intervals [0, p≤ ] and (p> (i1 ), 1] where p≤ = p> (1). The interval (p> (i1 ), 1] is
getting smaller for increasing i1 , see lemma 5.27. This fact is illustrated in figures 4 and
5. Remember that outside these both intervals the optimal control serves also one queue
exclusively as proven in theorem 5.22.
92
5.2 Unknown Service Rates: the Bayesian Case
u∗ (1) 6
1−
6
1−
p> (3) −
−
p> (2) −
p> (1) = p≤ −
|
p (i1 )
≤
|
| p (i1 ) 1 p
>
Figure 4:
Optimal Control in a
Waiting-Cost Model without Arrivals
for fixed i1
−
−
−
−
|
|
|
i
1
2
3
1
Figure 5:
Optimal Control in a
Waiting-Cost Model without Arrivals
ut (1) = 1, ut (1) = 0
Knowing that the optimal strategy is a control limit strategy we propose and discuss some
reasonable control limit strategies in the next table. Notice that the real control limit is
not suitable, since it depends on G(p) as seen on page 86, which is quite hard to compute.
Unfortunately our numerical investigations do not uniquely indicate a ”best” strategy to
us, but the certainty equivalence control performs well mostly (see table 2).
strategy
simulated cost
cµ rule (complete information)
8.50442
split service constant 50 : 50
allocate according to an uniform distribution on {0, 1}
control limit with mean of p> (i1 ) and p≤
certainty equivalence
8.6506
8.6244
8.6308
8.6139
B
Table 2: One Service Rate known with c1 = c2 = 1, µA
1 = 0.1, µ1 = 0.3, µ2 = 0.2,
λ1 = λ2 = 0, i1 = 3, i2 = 5, p0 = 0.9, β = 0.9 and true Parameter µ1 = µA
5.2.6
The Optimal Control in a Model with Reward-Function
We consider now a parallel queueing setup as in section 5.1, but with different cost structure. Instead of waiting costs for each customer in queue, we earn for each served customer
of queue i a positive reward ri ≥ 0. Hence the reward criterion is
Z ∞
X1
X2
−βt
1
1
2
2
Eu
e
r1 dNt (Xt− , Xt− − 1) + r2 dNt (Xt− , Xt− − 1) → max .
0
5 Application to Parallel Queueing with Incomplete Information
93
X
Nt j (ij , ij − 1) denotes the departure process of queue j, modelled by a Poisson process
with intensity
µj 1 u = j 1 ij > 0 ,
where ij is the number of waiting customers in queue j. Using the definition of the
intensities the objective function can be written as
Z ∞
−βt
1
2
Eu
e
r1 µ1 ut (1)1 Xt > 0 + r2 µ2 ut (2)1 Xt > 0 dt .
0
Here the reward function is bounded. Hence we do not require the continuity and compactness conditions for the convergence of Vn to V∞ due to Banach’s fixed point theorem.
As in remark 5.11 it is never advantageously to serve an empty queue. One can prove that
the rµ-rule is optimal in the complete information case, that means
(
1 i1 > 0
∗
r1 µ1 ≥ r2 µ2 =⇒ u∗t := (u∗ (Xt−
)) with u∗ (i1 , i2 ) =
is optimal.
2 i1 = 0
The proof of this optimality statement works completely similar as in theorem 5.1, where
we get in the spirit of (5.1) the following inequality
α
1
1−
(r2 µ2 − r1 µ1 ) ≤ 0.
VN +1,(g,f ∗ ,...,f ∗ ) − VN +1,(f ∗ ,g,f ∗ ,...,f ∗ ) =
α+β
α+β
Hence we see that this myopic control in the case of complete information is optimal.
B
As in the last sections we assume, that µ1 ∈ {µA
1 , µ1 } is not observable and similar to
(5.10)
A
r1 µ B
1 > r2 µ 2 > r1 µ 1 .
(5.15)
One can derive the conditional probabilities for µ1 as in section 5.2.1 and define the reduced
MDP as in section 5.2.2 with the slight modification in the reward function given by
Z ∞
r(i1 , i2 , p, γ) =
e−(α+β)t r1 µ1 (φγt (p))γt (1)1 i1 > 0 + r2 µ2 γt (2)1 i2 > 0 dt.
0
Again a optimality equation can be formulated, but for this reward criterion a more explicit
solution can be given by another method: the optimal control is a pure one which is
completely characterized by an index.
Remember first remark 5.7 where we introduced the notation Tt (1) as service time at queue
1 in [0, t], in particular
Z t
Tt (1) :=
ut (1)dt =⇒ dTt (1) = ut (1)dt
0
Z t
Tt (2) :=
ut (2)dt =⇒ dTt (2) = ut (2)dt.
0
94
5.2 Unknown Service Rates: the Bayesian Case
Due to ut (1) + ut (2) = 1 we get Tt (1) + Tt (2) = t. Then we define (p̃t , X̃t ) as solution of
the control independent analogons to (5.11) and (2.12)
A
X
1
1
dpt = (µB
1 − µ1 )pt (1 − pt )dt + Φ(pt− )dNt (Xt− , Xt− − 1)
cX
dXt = QX (pt )dt + dM
t
(p0 , X0 ) = (p, x0 ),
where QX (p) = qijX (p) with
µ1 (p)
µ2
X
qij (p) = λ1
λ2
−λ − λ − µ (p) − µ
1
2
1
2
j
j
j
j
j
= i − e1
= i − e2
= i + e1
= i + e2
=i
These two equations are the control independent stochastic differential equations for the
state process, denoting the number of waiting customers in each queue and the estimator
process for the unknown service rate. Since the jump-sizes of Xt and pt are independent
of the control, it is clear that for a fixed control u = (ut ) the following relations are true:
1,u
2,u 1
2
u
= X̃Tt (1) , X̃Tt (2) .
pt = p̃Tt (1) and
Xt , X t
Here we stress the dependence of the state process Xti,u on the control process u. Thus we
see there is a one-to-one relation between the process controlled through u at time t and
the uncontrolled process considered at the time point Tt (i), means after serving Tt (i) units
of time.
Additionally, we transform the objective function in a notation depending on Tt (i) by
Z ∞
1,u
2,u
−βt
u
Eu
e
r1 µ1 (pt )1 Xt > 0 ut (1) + r2 µ2 1 Xt > 0 ut (2) dt
0
Z ∞
1
2
−βt
= E
r1 µ1 (p̃Tt (1) )1 X̃Tt (1) > 0 dTt (1) + r2 µ2 1 X̃Tt (2) > 0 dTt (2) ,
e
0
where we change the notation from Eu to E due to the previous comments. In the next
theorem we prove the optimality of a pure index strategy.
Theorem 5.29 Define by
E
Γ(i1 , p) := esssupσ>0
hR
σ
0
i
e−βt r1 µ1 (p̃t )1 X̃t1 > 0 dt | p̃0 = p, X̃01 = i1
hR
i
σ −βt
1
E 0 e dt | p̃0 = p, X̃0 = i1
5 Application to Parallel Queueing with Incomplete Information
the Gittins index for queue 1. Then (u∗t ) with
1 X̃T1t∗ (1)
2 X̃ 2 ∗
Tt (2)
∗
∗
1
2
ut := u (X̃Tt∗ (1) , X̃Tt∗ (2) , p̃Tt∗ (1) ) =
2
1
X̃
Tt∗ (2)
2 X̃ 1 ∗
Tt (1)
95
> 0, Γ(X̃T1t∗ (1) , p̃Tt∗ (1) ) ≥ r2 µ2
> 0, Γ(X̃T1 ∗ (1) , p̃Tt∗ (1) ) < r2 µ2
t
=0
=0
is an optimal control, where the Tt∗ (i) corresponds to u∗t via Tt∗ (i) =
Rt
0
u∗s (i)ds.
Proof: Consider first with the product rule
Z ∞ Z ∞
−βs
1
e µ1 (p̃s )1 X̃s > 0 ds de−β(t−Tt (1))
0
Tt (1)
Z ∞
−β(t−Tt (1)) ∞
−βs
1
=
e µ1 (p̃s )1 X̃s > 0 dse
t=0
Tt (1)
Z ∞
Z ∞
−β(t−Tt (1))
−βs
1
−
e
d
e µ1 (p̃s )1 X̃s > 0 ds
0
Tt (1)
Z ∞
= −
e−βs µ1 (p̃s )1 X̃s1 > 0 ds
0
Z ∞
Z ∞
−β(t−Tt (1))
−βs
1
−
e
d
e µ1 (p̃s )1 X̃s > 0 ds .
0
Tt (1)
Thus we get
Z ∞
e−βt µ1 (p̃Tt (1) )1 X̃T1t (1) > 0 dTt (1)
0
Z ∞
Z ∞
−β(t−Tt (1))
−βs
1
= −
e
d
e µ1 (p̃s )1 X̃s > 0 ds
0
Tt (1)
Z ∞
Z ∞ Z ∞
−βs
1
−βs
1
=
e µ1 (p̃s )1 X̃s > 0 ds +
e µ1 (p̃s )1 X̃s > 0 ds de−β(t−Tt (1)) .
0
0
Tt (1)
Taking expectation and using the Markov-property of our model we continue
Z ∞
−βt
1
E
e µ1 (p̃Tt (1) )1 X̃Tt (1) > 0 dTt (1)
0
Z ∞
−βs
1
= E
e µ1 (p̃s )1 X̃s > 0 ds
0
Z ∞ Z ∞
−βs
1
−β(t−Tt (1))
+E
e µ1 (p̃s )1 X̃s > 0 ds de
0
Tt (1)
Z ∞
−βs
1
= E
e µ1 (p̃s )1 X̃s > 0 ds
0
Z ∞ Z ∞
−βs
1
X
−β(t−Tt (1))
+E
E
e µ1 (p̃s )1 X̃s > 0 ds | FTt (1) de
0
Tt (1)
96
5.2 Unknown Service Rates: the Bayesian Case
= E
Z
∞
−βs
e
0
µ1 (p̃s )1
Z ∞ Z
+E
E
0
X̃s1
∞
> 0 ds
−βs
e
µ1 (p̃s )1
Tt (1)
X̃s1
−β(t−Tt (1))
> 0 ds | X̃Tt (1) , p̃Tt (1) de
.
−βt
Applying the
theorem
R ∞representation
of Bank and ElKaroui (2004) (with f (t, l) := βe l
−βs
1
and Xt := t e µ1 (p̃s )1 X̃s > 0 ds) results in
Z ∞
−βs
1
E
e r1 µ1 (p̃s )1 X̃s > 0 ds
0
Z ∞ Z ∞
−βs
1
−β(t−Tt (1))
+E
E
e r1 µ1 (p̃s )1 X̃s > 0 ds | X̃Tt (1) , p̃Tt (1) de
0
Tt (1)
Z ∞
−βs
1
= E
e
inf Γ(X̃ν , p̃ν )ds
ν∈[0,s]
0
Z ∞ Z ∞
−βs
1
−β(t−Tt (1))
+E
E
e
inf Γ(X̃ν , p̃ν )ds | X̃Tt (1) , p̃Tt (1) de
ν∈[Tt (1),s]
0
Tt (1)
Z ∞
−βs
1
≤ E
e
inf Γ(X̃ν , p̃ν )ds
ν∈[0,s]
0
Z ∞ Z ∞
−βs
1
−β(t−Tt (1))
+E
(5.16)
e
inf Γ(X̃ν , p̃ν )ds de
ν∈[0,s]
0
Tt (1)
Z ∞
−βs
1
= E
e
inf Γ(X̃ν , p̃ν )dTs (1) ,
ν∈[0,Ts (1)]
0
where the inequality holds true due to de−β(t−Tt (1)) ≤ 0, since e−β(t−Tt (1)) is monotone
decreasing in t. The last equality follows analogously as in the beginning of the proof by
applying the product rule.
Thus we have found the following bounds for the objective function:
Z ∞
−βs
1
2
E
e
r1 µ1 (p̃Ts (1) )1 X̃Ts (1) > 0 dTs (1) + r2 µ2 1 X̃Ts (2) > 0 dTs (2)
0
Z ∞
1
2
−βs
≤ E
inf Γ(X̃ν , p̃ν )dTs (1) + r2 µ2 1 X̃Ts (2) > 0 dTs (2)
(5.17)
e
ν∈[0,Ts (1)]
0
"Z
!#
∞
≤ sup E
inf Γ(X̃ν1 , p̃ν )dT̃s (1) + r2 µ2 1 X̃T̃2s (2) > 0 dT̃s (2)
(5.18)
e−βs
ν∈[0,T̃s (1)]
0
(T̃ )
We only have to prove, that for our strategy T ∗ equality holds in (5.17) and (5.18). Consider
from (5.16) the expression
Z
0
∞
E
Z
∞
Tt (1)
−βs
e
inf
ν∈[Tt (1),s]
Γ(X̃ν1 , p̃ν )
− inf
ν∈[0,s]
Γ(X̃ν1 , p̃ν )
ds | X̃Tt (1) , p̃Tt (1) de−β(t−Tt (1)) .
5 Application to Parallel Queueing with Incomplete Information
97
If this expression is equal to 0, then we have equality in (5.16) and consequently in (5.17).
Since
inf
ν∈[Tt (1),s]
Γ(X̃ν1 , p̃ν ) − inf Γ(X̃ν1 , p̃ν )
(5.19)
ν∈[0,s]
is lower-semi-right-continuous and greater or equal 0, we have equality in (5.16) if and only
if
dTt (1) = ut (1)dt < 1
⇐⇒
inf
ν∈[Tt (1),s]
Γ(X̃ν1 , p̃ν ) = inf Γ(X̃ν1 , p̃ν ).
ν∈[0,s]
This is the case for T ∗ , since if Γ(X̃Tt∗ (1) , p̃Tt∗ (1) ) < r2 µ2 then we have by definition of T ∗
that Γ(X̃Tt∗ (1) , p̃Tt∗ (1) ) = inf∗ Γ(X̃ν1 , p̃ν ) and hence (5.19) is zero.
ν∈[0,Tt (1)]
If we consider (5.18) we see that s 7→ inf Γ(X̃ν ,1 p̃ν ) is monotone increasing. Hence the
ν∈[0,s]
myopic strategy is optimal and for this control equality holds in (5.18), that means
ut (1) = 1
⇐⇒
inf
ν∈[0,Tt (1)]
Γ(X̃ν1 , p̃ν ) ≥ r2 µ2 .
But this condition is equivalent for Tt∗ (1) due to its definition to Γ(X̃T1t∗ (1) , p̃Tt∗ (1) ) ≥ r2 µ2 .
As an immediate consequence of theorem 5.29 we get the following characterization of the
value function:
Corollary 5.30 It holds with (Tt∗ (1), Tt∗ (2)) from theorem 5.29:
Z ∞
∗
1
∗
2
−βs
J(i1 , i2 , p) = E
inf∗ Γ(X̃ν , p̃ν )dTs (1) + r2 µ2 1 X̃Ts∗ (2) > 0 dTs (2) .
e
0
ν∈[0,Ts (1)]
Remark 5.31
a) We have seen that if the optimal strategy has switched from queue 1 to queue 2 it
will remain there unless queue 2 is empty or a new arrival occurs at queue 1. New
arrivals at queue 2 have no influence to the optimal action in this moment, since
Γ(i1 , p) is independent of queue 2. On the other hand an arrival at queue 1 makes
queue 1 more likely to serve, in particular the optimal server will remain at queue 1
if he was at queue 1 before the arrival since
hR
i
σ
E 0 e−βt r1 µ1 (p̃t )1 X̃t1 > 0 dt | p̃0 = p, X̃01 = i1 + 1
hR
i
Γ(i1 + 1, p) = esssupσ>0
σ
E 0 e−βt dt | p̃0 = p, X̃01 = i1 + 1
hR
i
σ −βt
1
1
E 0 e r1 µ1 (p̃t )1 X̃t > 0 dt | p̃0 = p, X̃0 = i1
hR
i
≥ esssupσ>0
σ
E 0 e−βt dt | p̃0 = p, X̃01 = i1
= Γ(i1 , p),
since 1 X̃t1 (i1 +1) > 0 ≥ 1 X̃t1 (i1 ) > 0 . Hence i1 7→ Γ(i1 , p) is monotone increasing.
98
5.2 Unknown Service Rates: the Bayesian Case
b) The optimality of this index strategy can be extended from the Bayesian setting to
the Hidden-Markov-Model, where µ1 changes over time according to an unobservable
environment process (Zt ). The proof works completely analogous since we did not
make use of the behaviour of pt .
c) If both, µ1 and µ2 are unknown the optimal strategy is again a index strategy, but
the existence of a pure index strategy is not guaranteed anymore.
d) Our numerical studies indicate that if the service of a customer in queue 1 is finished
it is never optimal to change to queue 2, expect it was the last waiting customer in
queue 1. In particular it indicates Γ(i1 , p) ≥ r2 µ2 ⇒ Γ(i1 − 1, p + Φ(p)) ≥ r2 µ2 , from
which the stay-on-a-winner property follows.
We have seen in this proof that the model with the reward criterion is a classical bandit
problem. This is due to the special structure of the rewards, whereas the model with
waiting costs discussed in section 5.2.2 is not a bandit problem. This is explained by the
fact, that the waiting costs at both queues are not independent.
Remark 5.32 The proof of theorem 5.29 is adopted to Bank and Küchler (2007). In this
work the authors consider a bandit problem in continuous time and prove that the set of
optimal allocation strategies is equal to the set of so-called index strategies. This Gittins
theorem is well-known and proven in various ways, see for example ElKaroui and Karatzas
(1997), Kaspi and Mandelbaum (1995) and Kaspi and Mandelbaum (1998). Our statement
is slightly different, since we do not claim that every optimal control has to be of the
structure of u∗t . In particular we drop here the special assumption of the synchronisation
property. But with the same ideas we are able to show that in a classical two-armed-bandit
model, where one service rate is Bayesian and the other one is known, index-strategies are
pure strategies. This can be obtained, by proving in the notations of Bank and Küchler
(2007), that t ∈ D implies σ2 (N(t)−) = 0.
If there are no arrivals to the system we can prove under the same conditions as in theorem
5.25 that it is optimal to serve queue 2 until it is empty. But with this reward structure
the proof simplifies in various way and we do not need the recursion in the state space
technique. If there are no arrivals and if µA
1 > 0 then similar to lemma 5.24 there exists
a random variable τ (i1 , i2 , p) such that for all t ≥ τ (i1 , i2 , p) both queues are empty and
the program terminates. It is easy to prove, that the definition of the Gittins index can be
extended to stopping times (see e.g. Bank and ElKaroui (2004)) as
hR
i
σ −βt
1
1
E 0 e r1 µ1 (p̃t )1 X̃t > 0 dt | p̃0 = p, X̃0 = i1
hR
i
Γ(i1 , p) = esssupσ∈(0,τ (i1 ,i2 ,p))
.
σ
E 0 e−βt dt | p̃0 = p, X̃01 = i1
Then we are able to claim in the spirit of theorem 5.25 the following sufficient condition
for an optimal control.
5 Application to Parallel Queueing with Incomplete Information
99
Theorem 5.33 If r1 µ1 (Mi1 −1 p) < r2 µ2 then serving queue 2 unless it is empty is optimal.
Proof: It is sufficient due to theorem 5.29 and remark 5.31 to prove that Γ(i1 , p) < r2 µ2 .
This is true because of
hR
i
σ
E 0 e−βt r1 µ1 (p̃t )1 X̃t1 > 0 dt | p̃0 = p, X̃01 = i1
hR
i
Γ(i1 , p) = esssupσ∈(0,τ (i1 ,i2 ,p))
σ
E 0 e−βt dt | p̃0 = p, X̃01 = i1
≤ r1 µ1 (Mi1 −1 p) < r2 µ2 .
The proof can be simplified once more by making use of the objective function
#
"Z
τ (i1 ,i2 ,p)
E
e−βt r1 µ1 (p̃t )1 X̃t1 > 0 1 ut = 1 + r2 µ2 1 X̃t2 > 0 1 ut = 2 dt .
0
By assumption we know r1 µ1 (p̃t ) < r2 µ2 for all t ∈ [0, τ (i1 , i2 , p)]. Thus it is obvious, that
queue 2 has a higher priority to queue 1.
This model with reward criterion is quite similar to the model completely solved in Donchev
and Yushkevich (1996), Donchev (1998) and Donchev (1999). Especially the model in
Donchev (1998) is very related to our model with two difference. First, we do not divide
a constant flow to two servers, in contrast we have stochastic arrivals at each queue and
the queues are served. As a consequence the Gittins index in our model depends on the
current length of the queue. Second, Donchev (1998) assumed that both service parameters
are unknown in a symmetric way as in section 5.2.4. This means µ1 , µ2 ∈ {µA , µB } with
µ1 = µA if and only if µ2 = µB where the parameters change at random times. He solved
the model with the help of variational inequalities, but using the theory of bandit problems
yields in the same results as indicated in remark 5.31.
5.3
Unknown Length of the Queues: the 0-1-Observation
Assume now that the server is not able to observe queue 1 completely. The server can
only differ if there are more than two customers waiting or not. For simplicity, assume
that queue 2 and all parameters are completely known, in particular we are in the case of
a 0-1-observation, see page 11. Thus the information structure for i1 is given by
I(1) = {0, 1}
and
I(2) = {2, 3, . . .}.
Assume furthermore that c1 µ1 > c2 µ2 . Hence it is always optimal to apply the cµ-rule and
serve queue 1 if the server obtains i1 ∈ I(2). But what is the optimal service allocation (or
at least a well-performing) if the server only knows i1 ∈ I(1), that means maybe there is
a customer waiting or maybe not.
100
5.3 Unknown Length of the Queues: the 0-1-Observation
As in section 5.2 we derive a partial differential equation for the estimator process pt :=
Pu (Xt1 = 0 | FtY ). Then we state the HJB-equation and suggest some reasonable strategies,
which we compare numerically, since the optimal control can not be computed analytically.
If the observation changes at time τ from f2 , especially i1 ∈ I(2), to f1 , in particular
i1 ∈ I(1), then the server knows that in this moment there is exactly one customer waiting
in queue 1, hence pτ = 0. Starting from this a-priori probability the estimator process
evolves as pt = φt−τ (0) where φt (0) is the unique solution of
ṗ = −λ1 p2 + µ1 u(1 − p)
p0 = 0.
If queue 1 is not served, that means u = 0, the estimator is decreasing. This is reasonable,
since no customer can leave queue 1 whereas only new customers may join queue 1. The
generalized HJB-equation can be stated under the observation Yt = f1 , in particular under
i1 ∈ I(1), as
=
βW (p, i2, f1 )
(
inf
ξ∈∂p W (p,i2 ,f1 )
u∈U
c1 (1 − p) + c2 i2 + ξ(µ1 (1 − p)u − λ1 p2 )
+ W (0, i2, f2 ) − W (p, i2 , f1 ) λ1 (1 − p) + W (p, i2 + 1, f1 ) − W (p, i2 , f1 ) λ2
)
+ W (p, i2 − 1, f1 ) − W (p, i2 , f1 ) µ2 (1 − u) .
As mentioned above we discuss two reasonable strategies for this model, where the state
i1 = 0 is not observable completely.
5.3.1
Threshold-Strategy
The threshold-strategy will serve queue 1 if the estimated probability that no customer is
waiting is under a given threshold p∗ . Otherwise it serves queue 2. It is clear, that if the
observation process Yt changes from f2 to f1 that the server continues serving queue 1,
since in the moment of the change at time τ the estimator p starts in pτ = 0. Define σ1 as
the first time when the threshold is reached after τ and σ2 as the first time after τ where
the observation changes from f1 to f2 . Then with σ := min{σ1 , σ2 } the estimator is given
for t ∈ [τ, σ) by pt = φt−τ (0) where
n
o
ρ+µ1
−µ1 + tanh 21 tρ + 12 ln ρ−µ
ρ
1
φt (0) =
2λ1
p
with ρ := 4µ1 λ1 + µ21 .
5 Application to Parallel Queueing with Incomplete Information
101
A special threshold is the certainty equivalence threshold pCEP defined by
pCEP := 1 −
c2 µ 2
∈ (0, 1).
c1 µ 1
Its name is justified by
c1 µ1 (1 − pCEP ) = c1 µ1
c2 µ 2
= c2 µ 2 ,
c1 µ 1
that means, pCEP is the value for which the estimate
c1 µ1 (1 − p) = c1 µ1 P(one customer is waiting in queue 1)
is equal to c2 µ2 . If c2 > c1 under c1 µ1 > c2 µ2 then pCEP is monotone increasing in c2 for
fixed c1 , µ1 , µ2 .
The following figure illustrates the relative costs of threshold policies with respect to the
expected cost under complete information, that means on the y-axis we have
γ :=
V threshold
V complete information
,
for different thresholds, denoted on the x-axis. The values were chosen by λ1 = λ2 = 0.1,
µ1 = 0.3, µ2 = 0.4, c1 = 2, c2 = 1, β = 0.9, i1 (0) = 1, i2 (0) = 2, hence pCEP = 13 .
1.075
1.07
1.065
1.06
1.055
1.05
1.045
1.04
1.035
1.03
1.025
1.02
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Figure 6: Relative Costs under Threshold-Strategies
The results of various numerical studies can be summarized as follows:
0.9
102
5.3 Unknown Length of the Queues: the 0-1-Observation
• If the threshold is near 1 then γ is higher, since the server undervalues the fact, that
one customer is still waiting in queue 1. On the other hand, if the threshold is near
0 the server overrates this fact and probably serves an empty queue.
• The certainty equivalence principle threshold control works very well under all threshV threshold
old strategies and mostly γ = V complete
information attains its minimum for this control.
• If c2 > c1 under c1 µ1 ≥ c2 µ2 , then strategies with higher threshold fit better, which
is reasonable, since the waiting costs for the last waiting customer in queue 1 are
significant compared to the waiting costs in queue 2.
• γ is decreasing in i1 , which is due to the fact, that Xt1 (denoting the length of queue
1) stays more time in the second observation group I(2) = {2, 3, . . .}, for which the
(complete information) cµ-rule is optimal.
5.3.2
Double-Threshold-Strategy
We have seen that the estimator pt is monotone decreasing if queue 1 is not served. If the
threshold is reached a threshold strategy will stop the service of queue 1, the estimator
decreases, is then under the threshold again and service will restart. Therefore it seems to
be more reasonable to wait a certain time until the estimator pt reaches a lower level p∗ .
This level is less than the first threshold p∗ . If the lower level is reached the server changes
service back to queue 1 until the upper threshold p∗ is reached again. Through the waiting
period it becomes more likely that a customer is waiting in queue 1. We will call this kind
of strategy double-threshold-strategy. It is illustrated in the following figure:
serve queue 1, here pt increases
-
|
0
|
p∗
|
p∗
|
1
-
p
serve queue 2, here pt decreases
Figure 7: Double-Threshold-Strategy
For p∗ = p∗ the double-threshold-strategy simplifies to the normal threshold-strategy. Since
the double-threshold-strategy evolves like a threshold-strategy if the lower threshold was
5 Application to Parallel Queueing with Incomplete Information
103
reached at time τ we find in this case an equivalent closed formula for pt as in in the
previous section, except the process does not start in 0 (except for the case, where the
observation just changed) but in p∗ . Hence pt = φt−τ (p∗ ) where
n
o
ρ+µ1 +2λ1 p
−µ1 + tanh 12 tρ + 12 ln ρ−µ
ρ
1 −2λ1 p
φt (p) =
.
2λ1
If the estimator pt reaches the threshold p∗ at time σ, then queue 2 is served and pt is
decreasing until p∗ is reached again or a change in the observation occurs. If queue 1 is
not served the estimator evolves as pt = φt−σ (p∗ ) where
1
φt (p) =
.
λ1 t + 1p
In the next figure we illustrate the results of our numerical studies. Only strategies with
p∗ ≤ p∗ are considered. On the left axis p∗ and on the right p∗ is marked. Again we
V double-threshold
consider the relative cost γ = V complete
information on the vertical axis.
1.07
1.065
1.06
1.055
1.05
1.045
1.04
1.035
1.03
1.025
1.02
1.015
0.9
0.1
0.8
0.2
0.7
0.3
0.6
0.4
0.5
0.5
0.4
0.6
0.3
0.7
0.2
0.8
0.9
0.1
Figure 8: Relative Costs under Double-Threshold-Strategies
Our impression from the numerical investigation is that double-threshold-strategies perform better than one-threshold-strategies. But in none of our simulations a doublethreshold-strategy bet the certainty equivalence principle threshold strategy.
104
6
Conclusion
Conclusion
Control models with partial information are treated in several publications over the last
years. In particular, the Hidden-Markov-Model was investigated in various applications of
financial portfolio optimization. Nevertheless, coarsening of the observations due to group
representation of the states is considered nowhere, although it appears very often in the real
world. This thesis closes this gap and introduces the notion of information structures. For
the unobservable part of the state process a conditional probability is used as estimator
and an explicite filter equation is derived. Additionally, we transform the optimization
problem based on the unobservable process under incomplete information into one with
complete information. We rigorously prove the equivalence of these two problems, often
neglected by many authors in their works. Furthermore we investigate the dependence of
the optimal value on the information structure.
Besides discussing properties and characteristics of the conditional probabilities and the
value function we propose two methods for the solution of the transformed complete information model. We extend the Hamilton-Jacobi-Bellman equation and the corresponding
verification technique by using the Clarke derivative. An advantage of this approach is
that we require weaker assumptions instead of the strong ones in the classical case. These
are fulfilled for our value function by its concavity. The second approach makes use of the
piecewise-deterministic behaviour of the estimator process. For this purpose we define a
time-discrete Markovian-Decision-Process (MDP) whose value function coincides with the
value function of the original problem. Additionally, one can construct from an optimal
policy of the MDP an optimal control for the original model. Hence we can use all tools
of the established MDP-theory.
Combining all the developed results and applying to a parallel queueing model we analyze
a setup with unknown Bayesian service rates strictly mathematically. Interesting results
arise, as for example the separation property and the explicit characterization of the value
function. Furthermore we show the existence of an optimal control which serves one queue
exclusively almost everywhere. If one service rate is known, then the optimal control is
a pure one. The last result holds true also for general bandit problems. Moreover we
find sufficient conditions for the optimality of controls. The symmetric case is completely
solved with the optimality of a threshold strategy. These results close a gap in the present
queueing research, in particular the proofs demonstrate the power of our proposed solution
procedures.
Appendix
A
105
Tools for Theorem 3.5
Here we append the lemmas and their proofs needed in the proof of theorem 3.5, where we
[
derive the filter equation for X
t Zt and hence for pt . Define for a process (Ht )
△Ht := Ht − Ht−
and denote by [H, H̃]t its quadratic covariation with the process (H̃t ).
Lemma A.1 It holds:
[X, Z]t =
d
n X
n X
d X
X X
i
δijµν (ej − ei )(gν − gµ )△NsZ (µ, ν)Xs−
.
0<s≤t i=1 j=1 µ=1 ν=1
Proof:
[X, Z]t =
X
△Xs △Zs
0<s≤t
n X
n
X X
=
0<s≤t
i=1 j=1
n X
n
X X
(∗)
=
0<s≤t
(∗∗)
=
X
(ej −
ei )△NsX (i, j)
d X
d
X
(gν − gµ )△NsZ (µ, ν)
µ=1 ν=1
(ej − ei )
i=1 j=1
n X
n X
d X
d
X
d X
d
X
µ=1 ν=1
i
δijµν △NsZ (µ, ν)Xs−
d X
d
X
Z
·
(gν − gµ )△Ns (µ, ν)
µ=1 ν=1
i
δijµν (ej − ei )(gν − gµ )△NsZ (µ, ν)Xs−
,
0<s≤t i=1 j=1 µ=1 ν=1
where we used in (∗) the construction of NtX (i, j) and the assumption that ÑtX and NtZ
do not jump at the same time. In (∗∗) we used △NsZ (µ, ν) ∈ {0, 1} for all µ and ν which
results in
(
0
if µ 6= µ′ or ν 6= ν ′
△NsZ (µ, ν) · △NsZ (µ′ , ν ′ ) =
△NsZ (µ, ν) else.
If δijµν ≡ 0 for all i, j, µ, ν, (this means, there are no common jumps of Xt and Zt ), we
obtain [X, Z]t ≡ 0.
Remark A.2 An analogous representation for the quadratic covariation is
[X, Z]t =
Z tX
n X
n X
d X
d
0
i=1 j=1 µ=1 ν=1
i
δijµν (ej − ei )(gν − gµ )dNsZ (µ, ν)Xs−
.
106
Tools for Theorem 3.5
Lemma A.3 The n × d-matrix Xt Zt , where one entry is one and all others zero, has the
following representation:
Z tX
n X
d
i µ
Z
Xt Z t = X 0 Z 0 +
ei QZ gµ + (Q̃X
µ + Q̃µ )ei gµ Xs Zs ds
0
+
i=1 µ=1
Z tX
n X
n X
d X
d
0
+
i=1 j=1 µ=1 ν=1
Z t
Z
Xs− dMs +
0
0
Z
i
δijµν (ej − ei )(gν − gµ )Xs−
dNsZ (µ, ν)
t
Zs− dMsX .
R
Proof:
The
proof
is
an
easy
application
of
the
Itô-formula
(convention:
by
ZdX we mean
R
(dX)Z ):
Z t
Z t
Xt Z t = X 0 Z 0 +
Xs− dZs +
Zs− dXs + [X, Z]t
0
0
Z t
Z t
Z
Z
= X0 Z 0 +
Xs− (Q Zs ds + dMs ) +
Zs− (QX (Zs )Xs ds + dMsX ) + [X, Z]t
0
Z0 t
Z t
Z
Z
X
= X0 Z 0 +
Xs QZ Zs ds +
(Q̃X
1 , . . . , Q̃d ) + (Q̃1 , . . . , Q̃d ) Zs Xs Zs ds
0
0
Z t
Z t
Z
+[X, Z]t +
Xs− dMs +
Zs− dMsZ
(∗)
= X0 Z 0 +
Z
0
0
t
Xs QZ Zs +
0
+[X, Z]t +
Z
d
X
µ=1
t
Xs− dMsZ
+
0
= X0 Z 0 +
Z tX
n X
d
0
+[X, Z]t +
t
Zs− dMsX
Z
i µ
(ei QZ gµ + (Q̃X
µ + Q̃µ )ei gµ )Xs Zs ds
Xs− dMsZ
0
In (∗) we used
Z
0
i=1 µ=1
t
Z
Z
µ
(Q̃X
µ + Q̃µ )Xs Zs gµ ds
+
Z
t
Zs− dMsX .
0
(a1 , . . . , ad )ZXZ =
d
X
aµ Z
µ
µ=1
=
n X
d X
d
X
i
µ
r
X aµ Z Z ei gr =
i=1 µ=1 r=1
n
X
ei X
i
d
X
gr Z r
r=1
i=1
n X
d
X
i
µ
X aµ Z ei gµ =
i=1 µ=1
d
X
aµ Z µ Xgµ ,
µ=1
where the second last equality holds, since exactly one entry of Z ∈ SZ is one and the
d P
d
d
P
P
others are zero, in particular
ZrZν =
Z r . Lemma A.1 completes the proof.
r=1 ν=1
r=1
Appendix
107
Since we estimate the process Xt Zt by NtY (k, l), which is equivalent to the estimation
procedure by Yt , we need a formula for the quadratic covariation between the unobservable
process (XZ) and the observation NtY (k, l). The next lemma contains this formula.
Lemma A.4 It holds for all k, l ∈ {1, . . . , m}:
Z tn X X
Y
[(XZ), N (k, l)]t =
0
i
k
(ej − ei )Xs−
Zs Ys−
dÑsX (i, j)
i∈I(k) j∈I(l)
+
d X
d
X X X
δijµν (ej gν
−
µ
i
k
ei gµ )Xs−
Zs−
Ys−
dNsZ (µ, ν)
i∈I(k) j∈I(l) µ=1 ν=1
o
.
Proof: By construction of the processes Zt , Xt and Yt it is only possible that NtY (k, l) and
NtX (i, j) jump at the same time, if i ∈ I(k) and j ∈ I(l). NtX (i, j) jumps if ÑtX (i, j) jumps
(and then Zt does not jump) or if the jump is influenced by a jump of Zt . Consequently
we get:
[(XZ), N Y (k, l)]t =
X
△(Xs Zs )△NsY (k, l)
0<s≤t
=
n X
n
X nX
0<s≤t
i
(ej − ei )Xs−
△ÑsX (i, j)Zs △NsY (k, l)
i=1 j=1
+
n X
n X
d X
d
X
µ
i
δijµν (ej gν − ei gµ )Xs−
Zs−
△NsZ (µ, ν)△NsY (k, l)
i=1 j=1 µ=1 ν=1
=
Z tn X X
0
o
i
k
(ej − ei )Xs−
Zs Ys−
dÑsX (i, j)
i∈I(k) j∈I(l)
+
d X
d
X X X
i∈I(k) j∈I(l) µ=1 ν=1
o
µ
i
k
dNsZ (µ, ν) ,
Zs−
Ys−
δijµν (ej gν − ei gµ )Xs−
where the last equality follows like (∗∗) in the proof of lemma A.1.
We are now interested in the expectation of Xt Zt NtY (k, l) which we need for the derivation
of the filter equation. Note that we even derive in the proof of the next lemma an explicit
representation for Xt Zt NtY (k, l).
108
Tools for Theorem 3.5
Lemma A.5 It holds:
E Xt Zt NtY (k, l)
d
d
hZ t X X
X
X
i µ k
Z
X
= E
ej
δijµν gν qµν
gµ q̃ij,µ
+
Xs− Zs− Ys− ds
0 i∈I(k) j∈I(l)
+
Z
t
Y
Ns−
(k, l)
0
ν=1
µ=1
n X
d
nX
Z
i µ
(ei QZ gµ + (Q̃X
µ + Q̃µ )ei gµ )Xs Zs
i=1 µ=1
+
n X
n X
d X
d
X
δijµν (ej
− ei )(gν −
µ
Z
i
gµ )qµν
Zs−
Xs−
i=1 j=1 µ=1 ν=1
o i
ds .
Proof: First we apply Itô (note that N0Y (k, l) = 0) and use the results from the lemmas
above, hence
Z t
Z t
Y
Y
Y
Xt Zt Nt (k, l) =
Xs− Zs− dNs (k, l) +
Ns−
(k, l)d(Xs Zs ) + [XZ, N Y (k, l)]t
=
Z
0
t
Xs− Zs− dNsY (k, l) +
0
+
Z
t
Y
(k, l)
Ns−
0
Z
0
t
Y
Ns−
(k, l)
0
n X
n X
d X
d
X
d
n X
nX
i=1 µ=1
o
Z
i µ
(ei QZ gµ + (Q̃X
+
Q̃
)e
g
)X
Z
µ
µ i µ
s s ds
i
dNsZ (µ, ν)
δijµν (ej − ei )(gν − gµ )Xs−
i=1 j=1 µ=1 ν=1
Z t
Y
Z
Y
+
Ns− (k, l)Xs− dMs +
Ns−
(k, l)Zs− dMsX
0
Z0 t X X
i
k
+
(ej − ei )Xs−
Zs Ys−
dÑsX (i, j)
0 i∈I(k) j∈I(l)
Z
+
Z
t
t
d
d X
X X X
0 i∈I(k) j∈I(l) µ=1 ν=1
µ
i
k
δijµν (ej gν − ei gµ )Xs−
Zs−
Ys−
dNsZ (µ, ν).
Taking expectation on both sides, noticing that the expectation over the martingaleintegrals is zero and using the property of the intensity of NtY (k, l)
d
Y
X X X
µ
k
X
i
k
E dNsY (k, l) = E qkl
(Zs , Xs )Ys−
ds = E
qij,µ
Xs−
Zs−
Ys−
ds
i∈I(k) j∈I(l) µ=1
we get:
E Xt Zt NtY (k, l)]
d
hZ t
X X X
µ
i
X
k
= E
Zs−
Xs−
qij,µ
Xs− Zs−
ds
Ys−
0
i∈I(k) j∈I(l) µ=1
Appendix
109
+
Z
t
Y
Ns−
(k, l)
0
d
nX
Z
µ
Xs Zsµ QZ gµ + (Q̃X
µ + Q̃µ )Xs Zs gµ
µ=1
+
n X
n X
d X
d
X
i=1 j=1 µ=1 ν=1
+
Z
t
X X
d
X
0 i∈I(k) j∈I(l) µ=1
+
Z
t
µ
i
X
k
(ej − ei )Xs−
q̃ij,µ
Zs−
Zs Ys−
ds
|{z}
=gµ
d
d X
X X X
0 i∈I(k) j∈I(l) µ=1 ν=1
(∗)
= E
hZ
t
d
X X X
o
µ
Z
i
δijµν (ej − ei )(gν − gµ )qµν
Zs−
Xs−
ds
µ
µ
i
k Z
δijµν (ej gν − ei gµ )Xs−
Zs−
Ys−
qµν Zs−
ds
i
µ
X
i
k
ei gµ qij,µ
Xs−
Zs−
Ys−
ds
0 i∈I(k) j∈I(l) µ=1
+
Z
t
Y
Ns−
(k, l)
0
+
Z
t
. . . ds +
d
d X
X X X
0 i∈I(k) j∈I(l) µ=1 ν=1
−
= E
hZ
t
Z
t
d
X X X
ej
0 i∈I(k) j∈I(l)
+
Z
0
t
Y
Ns−
(k, l)
d
X
d
X
d
X X X
µ
X
i
k
ej gµ q̃ij,µ
Xs−
Zs−
Ys−
ds
0 i∈I(k) j∈I(l) µ=1
µ
Z
i
k
δijµν ej gν qµν
Xs−
Zs−
Ys−
ds
|
X
gµ q̃ij,µ
µ=1
t
X
ei gµ (q̃ij,µ
0 i∈I(k) j∈I(l) µ=1
X X
Z
+
d
X
µ
i
k
Z
Zs−
Ys−
ds
δijµν qµν
) Xs−
ν=1
{z
X
=qij,µ
+
d
X
ν=1
}
i
i µ k
Z
δijµν gν qµν
Xs− Zs− Ys− ds
Z
i µ
(ei QZ gµ + (Q̃X
µ + Q̃µ )ei gµ )Xs Zs
µ=1
+
n X
n X
d X
d
X
i=1 j=1 µ=1 ν=1
δijµν (ej
− ei )(gν −
µ
Z
i
gµ )qµν
Zs−
Xs−
i
ds .
µ
µ
µ
µ
i
In (∗) we used Zs−
Zs−
= Zs−
∈ {0, 1} by construction of SZ and Xs− Zs− Xs−
Zs−
=
µ
i
ei gµ Xs−
Zs−
.
Y
[
Next we calculate as in lemma A.3 a representation for X
t Zt := E[Xt Zt | Ft ]. By writing
\
\
[
(X
t Zt )i· we mean the i-th row of Xt Zt and by (Xt Zt )·µ the µ-th column.
110
Tools for Theorem 3.5
Lemma A.6 It holds:
Z tX
n X
d
[
[
X
t Z t = X0 Z 0 +
0
+
Z
\
(ei QZ gµ + (Q̃X
µ + Q̃µ )ei gµ )(Xs Zs )iµ ds
i=1 µ=1
Z tX
n X
n X
d X
d
0
i=1 j=1 µ=1 ν=1
Z
\
c
δijµν (ej − ei )(gν − gµ )(X
s Zs )iµ qµν ds + Mt ,
ct is a FtY -martingale with expectation zero. It has the representation
where M
Z tX
m X
m
Y [
c
Mt =
φ(k,l) (s) dNsY (k, l) − qkl
(Xs Zs )Ysk ds ,
0 k=1 l=1
Y
where φ(k,l) (t) := φ(k,l) (X\
t− Zt− ) is Ft -predictable.
[
Proof: Using the definition of X
t Zt and lemma A.3 we get with Itô’s formula:
Z tX
n X
d
h
Z
[
Xt Z t = E X0 Z 0 +
(ei QZ gµ + (Q̃X
µ + Q̃µ )ei gµ )(Xs Zs )iµ ds
0
+
i=1 µ=1
Z tX
n X
n X
d X
d
0
+
i=1 j=1 µ=1 ν=1
Z t
Z
Xs− dMs +
0
0
Z tX
n X
d
Z
t
h
= E X0 Z 0 +
+
Zs− dMsX | FtY
i
Z
(ei QZ gµ + (Q̃X
µ + Q̃µ )ei gµ )(Xs Zs )iµ ds
0
i=1 µ=1
Z tX
n X
n X
d X
d
0
+
i
δijµν (ej − ei )(gν − gµ )Xs−
dNsZ (µ, ν)
µ
Z
i
Zs−
ds)
(dNsZ (µ, ν) − qµν
δijµν (ej − ei )(gν − gµ )Xs−
i=1 j=1 µ=1 ν=1
Z tX
n X
n X
d X
d
µ
i
Z
δijµν (ej − ei )(gν − gµ )Xs−
qµν
Zs−
ds
0
i=1 j=1 µ=1 ν=1
Z t
i
Z
X
Y
+
Xs− dMs +
Zs− dMs | Ft
0
0
Z tX
n X
d
Z
[
X0 Z 0 +
(ei QZ gµ + (Q̃X
µ + Q̃µ )ei gµ )(Xs Zs )iµ ds
0 i=1 µ=1
Z
=
+
t
Z tX
n X
n X
d X
d
0
i=1 j=1 µ=1 ν=1
Z
ct ,
δijµν (ej − ei )(gν − gµ )(Xs Zs )iµ qµν
ds + M
where the last equality holds due to Wong and Hajek (1985) and since the three martingales
ct . The representation of this martingale is standard, see for example
are summarized in M
Brémaud (1981).
Appendix
111
Y
[
Now we compute the expectation of X
t Zt Nt (k, l) and the representation of this expression
as in lemma A.5.
Lemma A.7 It holds:
h
i
Y
[
E X
t Zt Nt (k, l)
hZ t
Y
k
\
= E
(X\
s− Zs− + φ(k,l) (s))qkl (Xs− Zs− )Ys− ds
0
+
Z
t
Y
Ns−
(k, l)
0
n X
d
nX
Z
\
(ei QZ gµ + (Q̃X
µ + Q̃µ )ei gµ )(Xs Zs )iµ
i=1 µ=1
+
d
n X
n X
d X
X
Z
\
δijµν (ej − ei )(gν − gµ )(X
s Zs )iµ qµν ds
i=1 j=1 µ=1 ν=1
oi
.
Y
[
Proof: Using Itô’s formula (with N0Y (k, l) = 0) and the fact that X
t Zt and Nt (k, l) jump
only at the same time we get:
Y
[
X
t Zt Nt (k, l)
Z t
Z t
Y
Y
d Y
[
=
X\
Ns−
(k, l)d(X
s− Zs− dNs (k, l) +
s Zs ) + [XZ, N (k, l)]t
0
=
Z
0
t
Y
Ns−
(k, l)
0
n X
d
nX
i=1 µ=1
+
+
Z
\
(ei QZ gµ + (Q̃X
µ + Q̃µ )ei gµ )(Xs Zs )iµ
Z
n X
n X
d X
d
X
i=1 j=1 µ=1 ν=1
t
Y
X\
s− Zs− dNs (k, l) +
0
X
Z
\
c
δijµν (ej − ei )(gν − gµ )(X
s Zs )iµ qµν ds + dMs
o
φ(k,l) (s)△NsY (k, l).
0<s≤t
Taking expectation on both sides we get with the definition of the FtY -intensity of NtY (k, l):
Z
Z t
t
Y
Y
k
Y
[
\
E X
(X\
Ns−
(k, l) . . . ds .
t Zt Nt (k, l) = E
s− Zs− +φ(k,l) (s))qkl (Xs− Zs− )Ys− ds+
0
0
112
B
Proof of Theorem 5.25
Proof of Theorem 5.25
Here we add the detailed proof of theorem 5.25. Define
2 i1 > 0, i2 > 0
f (i1 , i2 , p) := 1 i2 = 0
2 i1 = 0
and
1
2
g(i1 , i2 , p) :=
1
2
i1
i1
i2
i1
> 0, i2 > 0, t ∈ B
> 0, i2 > 0, t ∈
/B
=0
=0
where B ⊂ [0, ∞). Remember that the existence of an optimal pure control is justified by
theorem 5.22. Consider then with ft := f (Xt1 , Xt2 , φft (p)) and gt := g(Xt1 , Xt2 , φgt (p)) the
strategies
π = ((ft )t≥0 , (gt )t≥0 , (ft )t≥0 , . . . , (ft )t≥0 ) ∈ F n
π̃ = ((gt )t≥0 , (ft )t≥0 , (ft )t≥0 , . . . , (ft )t≥0 ) ∈ F n .
Without loss of generality assume B = [0, ε]. Denote Vn−1,gf the expected discounted
reward over (n − 1)-periods under the strategy ((gt ), (ft ), . . . , (ft )) ∈ F n−1 and with the
same interpretation for Vn−2,f under ((ft ), (ft ), . . . , (ft )) ∈ F n−2 . Remember r(i1 , i2 ) =
c1 i1 +c2 i2
and compute
α+β
Vnπ (i1 , i2 , p) = Tf Vn−1,gf (i1 , i2 , p)
Z ∞
n
o
= r(i1 , i2 ) +
e−(α+β)t Vn−1,gf (i1 , i2 , φft (p))(α − µ2 ) + Vn−1,f g (i1 , i2 − 1, φft (p))µ2 dt
| {z }
| {z }
0
=p
=p
(
Z
∞
e−(α+β)t r(i1 , i2 )(α − µ2 ) + r(i1 , i2 − 1)µ2
= r(i1 , i2 ) +
0
+
hZ
ε
0
−(α+β)s
e
n
Vn−2,f (i1 , i2 , φgs (p))(α − µ1 (φgs (p)))
1, i2 , φgs (p)
Φ(φgs (p)))µ1 (φgs (p))
+Vn−2,f (i1 −
+
n
+
e−(α+β)s Vn−2,f (i1 , i2 , φgε (p))(α − µ2 )
ε
o i
+Vn−2,f (i1 , i2 − 1, φgε (p))µ2 ds (α − µ2 )
hZ ε
n
+
e−(α+β)s Vn−2,f (i1 , i2 − 1, φgs (p))(α − µ1 (φgs (p)))
Z
∞
0
o
ds
Appendix
113
Z
o
+Vn−2,f (i1 − 1, i2 − 1, φgs (p) + Φ(φgs (p)))µ1 (φgs (p)) ds
Z ∞
n
+
e−(α+β)s Vn−2,f (i1 , i2 − 1, φgε (p))(α − µ2 )
ε
)
o i
+Vn−2,f (i1 , i2 − 2, φgε (p))µ2 ds µ2 dt
Z ∞
c2 µ 2 c2 µ 2 −(α+β)t
= r(i1 , i2 ) +
r(i1 , i2 )α −
e
dt +
r(i1 , i2 )α −
dt
e
α+β
α+β
0
ε
(Z
Z ∞
h
ε
−(α+β)t
−(α+β)s
+
α Vn−2,f (i1 , i2 , φgs (p))(α − µ2 )
e
e
ε
0
−(α+β)t
0
+Vn−2,f (i1 , i2 − 1, φgs (p))µ2
h
−µ1 (φgs (p)) Vn−2,f (i1 , i2 , φgs (p))
1, i2 , φgs (p)
−Vn−2,f (i1 −
h
−µ1 (φgs (p)) Vn−2,f (i1 , i2 − 1, φgs (p))
+
Z
∞
e−(α+β)t
0
(Z
ε
∞
+
i
Φ(φgs (p)))
i
(α − µ2 )
! )
i
−Vn−2,f (i1 − 1, i2 − 1, φgs (p) + Φ(φgs (p))) µ2 ds dt
h
e−(α+β)s α Vn−2,f (i1 , i2 , φgε (p))(α − µ2 )
+Vn−2,f (i1 , i2 − 1, φgε (p))µ2
h
−µ2 Vn−2,f (i1 , i2 , φgε (p))
i
i
−Vn−2,f (i1 , i2 − 1, φgε (p)) (α − µ2 )
h
−µ2 Vn−2,f (i1 , i2 − 1, φgε (p))
! )
i
−Vn−2,f (i1 , i2 − 2, φgε (p)) µ2 ds dt.
In a complete analogous way we get
Vn,π̃ (i1 , i2 , p) = Tg Vn−1,f f (i1 , i2 , p)
= r(i1 , i2 )
Z ε
n
−(α+β)t
+
e
Vn−1,f f (i1 , i2 , φgt (p))(α − µ1 (φgt (p)))
0
+Vn−1,f f (i1 −
1, i2 , φgt (p)
+
Φ(φgt (p)))µ1 (φgt (p))
o
dt
114
Proof of Theorem 5.25
+
Z
∞
−(α+β)t
e
ε
Z
n
o
g
g
Vn−1,f f (i1 , i2 , φε (p))(α − µ2 ) + Vn−1,f f (i1 , i2 − 1, φε (p))µ2 dt
ε
= r(i1 , i2 ) +
e−(α+β)t r(i1 , i2 )(α − µ1 (φgt (p))) + r(i1 − 1, i2 )µ1 (φgt (p)) dt
0
Z ∞
+
e−(α+β)t r(i1 , i2 )(α − µ2 ) + r(i1 , i2 − 1)µ2 dt
ε
(Z
Z
∞
ε
e−(α+β)t
+
0
0
+
Z
e−(α+β)s (α − µ1 (φgt (p))) ·
∞
h
i
· Vn−2,f (i1 , i2 , φgt (p))(α − µ2 ) + Vn−2,f (i1 , i2 − 1, φgt (p))µ2 ds
Z ∞
h
+
e−(α+β)s µ1 (φgt (p)) Vn−2,f (i1 − 1, i2 , φgt (p) + Φ(φgt (p)))(α − µ2 )
0
)
i
+Vn−2,f (i1 − 1, i2 − 1, φgt (p) + Φ(φgt (p)))µ2 ds dt
e−(α+β)t
(Z
0
ε
+
Z
0
∞
∞
h
e−(α+β)s (α − µ2 ) Vn−2,f (i1 , i2 , φgε (p))(α − µ2 )
+Vn−2,f (i1 , i2 −
h
e−(α+β)s µ2 Vn−2,f (i1 , i2 − 1, φgε (p))(α − µ2 )
1, φgε (p))µ2
i
ds
)
i
+Vn−2,f (i1 , i2 − 2, φgε (p))µ2 ds dt
= r(i1 , i2 )
Z ∞
Z ε
c2 µ 2 c1 µ1 (φgt (p)) −(α+β)t
−(α+β)t
dt +
dt
r(i1 , i2 )α −
e
+
r(i1 , i2 )α −
e
α+β
α+β
ε
0
(
Z ε
Z ∞
h
−(α+β)t
+
e
e−(α+β)s α Vn−2,f (i1 , i2 , φgt (p))(α − µ2 )
0
0
+Vn−2,f (i1 , i2 −
h
−µ1 (φgt (p)) Vn−2,f (i1 , i2 , φgt (p))
−µ1 (φgt (p))
1, φgt (p))µ2
i
i
−Vn−2,f (i1 − 1, i2 , φgt (p) + Φ(φgt (p))) (α − µ2 )
h
Vn−2,f (i1 , i2 − 1, φgt (p))
−Vn−2,f (i1 − 1, i2 −
1, φgt (p)
+
Φ(φgt (p)))
! )
i
µ2 ds dt
Appendix
+
115
Z
∞
−(α+β)t
e
ε
(Z
h
α Vn−2,f (i1 , i2 , φgε (p))(α − µ2 )
∞
−(α+β)s
e
0
+Vn−2,f (i1 , i2 − 1, φgε (p))µ2
h
−µ2 Vn−2,f (i1 , i2 , φgε (p))
−Vn−2,f (i1 , i2 −
h
−µ2 Vn−2,f (i1 , i2 − 1, φgε (p))
1, φgε (p))
i
i
(α − µ2 )
! )
i
−Vn−2,f (i1 , i2 − 2, φgε (p)) µ2 ds dt.
Consider then as in (5.1) the difference of Vn,π̃ (i1 , i2 , p) and Vn,π (i1 , i2 , p) :
Vn,π̃ (i1 , i2 , p) − Vn,π (i1 , i2 , p) =
Z
0
ε
−(α+β)t
e
c2 µ2 − c1 µ1 (φgt (p))
α+β
dt ≥ 0.
(B.1)
This inequality is true, since c1 µ1 (φgt (p)) ≤ c1 µ1 (Mi1 −1 p) < c2 µ2 . For this assertion we
B
have to use φgt (p) ≥ p ≥ Mp (see lemma 5.10) and µA
1 < µ1 by (5.10).
We now have to prove that ((ft )t≥0 ,. . . , (ft )t≥0 ) ∈ F n is optimal for all n and then we
∞
∞
know that lim (ft )t≥0
= (ft )t≥0
is optimal for the infinite horizon MDP. Starting
n→∞
with n = 0 and V0 (i1 , i2 , p) = 0 each decision rule f is a minimizer of V0 (i1 , i2 , p). Assume now that (ft )t≥0 is a minimizer of V0 (i1 , i2 , p), . . . , Vn−1 (i1 , i2 , p) for p > p> (i1 ), thus
((ft )t≥0 , . . . , (ft )t≥0 ) ∈ F n is optimal for the n-period model in (i1 , i2 , p). We have to show
that (ft )t≥0 is a minimizer of Vn (i1 , i2 , p) for p > p> (i1 ) again.
Tg Vn (i1 , i2 , p)
(
Z ∞
=
e−(α+β)t (c1 i1 + c2 i2 )
0
+
Z
ε
−(α+β)s
e
0
+
Z
ε
∞
n
Vn (i1 , i2 , φgs (p))(α − µ1 (φgs (p)))
o
+Vn (i1 − 1, i2 , φgs (p) + Φ(φgs (p)))µ1(φgs (p)) ds
n
e−(α+β)s Vn (i1 , i2 , φgε (p))(α − µ2 )
o
)
+Vn (i1 , i2 − 1, φgε (p))µ2 ds dt
(∗)
=
Z
0
∞
(
e−(α+β)t (c1 i1 + c2 i2 )
116
Proof of Theorem 5.25
+
Z
0
+
Z
ε
ε
e−(α+β)s Vn,(f,...,f ) (i1 , i2 , φgs (p))(α − µ1 (φgs (p)))
+Vn,(f,...,f ) (i1 −
∞
1, i2 , φgs (p)
+
Φ(φgs (p)))µ1 (φgs (p))
e−(α+β)s Vn,(f,...,f ) (i1 , i2 , φgε (p))(α − µ2 )
o
ds
)
+Vn,(f,...,f ) (i1 , i2 − 1, φgε (p))µ2 ds dt
(B.1)
= Tg Vn,(f,...,f ) (i1 , i2 , p) = Vn+1,(g,f,...,f ) (i1 , i2 , p) ≥ Vn+1,(f,g,f,...,f ) (i1 , i2 , p)
Z ∞
n
=
e−(α+β)t (c1 i1 + c2 i2 )
0
Z ∞
+
e−(α+β)s Vn,(g,f,...,f ) (i1 , i2 , φfs (p))(α − µ2 )
| {z }
0
=p
o
+Vn,(g,f,...,f ) (i1 , i2 , φfs (p))µ2 ds dt
| {z }
=p
= Tf Vn,(g,f,...,f ) (i1 , i2 , p) ≥ Tf Vn (i1 , i2 , p).
Thus (ft ) is a minimizer of Vn (i1 , i2 , p). In (∗) we used
Vn (i1 , i2 , φgs (p)) = Vn,(f,...,f ) (i1 , i2 , φgs (p))
which is true by induction hypotheses, since φgs (p) ≥ p and
Vn (i1 − 1, i2 , φgs (p) + Φ(φgs (p))) = Vn,(f,...,f ) (i1 − 1, i2 , φgs (p) + Φ(φgs (p)))
Vn (i1 , i2 − 1, φgs (p)) = Vn,(f,...,f ) (i1 , i2 − 1, φgs (p))
by the recursion in the state space. The preconditions of this theorem are fulfilled for n ∈ N
in states (i1 − 1, i2 , φgs (p) + Φ(φgs (p))) and (i1 , i2 − 1, φgs (p)) as well by lemma 5.28.
Bibliography
117
Bibliography
Altman, E., Jimenez, T., Nunez-Queija, R., and Yechiali, U. (2003). Optimal Routing
among ./M/1 Queues with Partial Information. INRIA Research Report No. 4985.
Altman, E., Marquez, R., and Yechiali, U. (2004). Admission and Routing Control with
Partial Information and Limited Buffers. Working Paper.
Asmussen, S. (2003). Applied Probabilities and Queues. Springer-Verlag.
Bank, P. and ElKaroui, N. (2004). A Stochastic Representation Theorem with Applications
to Optimization and Obstacle Problems. The Annals of Probability, 32 (18):1030–1067.
Bank, P. and Küchler, C. (2007). On Gittins’ Index Theorem in Continuous Time. Stochastic Processes and their Applications, 117 (9):1357–1371.
Bellman, R. (1977). Dynamic Programming. Princeton University Press.
Bensoussan, A., Cakanyildirim, M., and Sethi, S. (2003). Partially Observed Inventory
Systems. Proceedings of the 44th IEEE Conference on Decision and Control, pages
1023–1028.
Bertsekas, D. and Shreve, S. (1978). Stochastic Optimal Control: The Discrete Time Case.
Academic Press.
Bhulai, S. (2002). Markov Decision Processes - The Control of High-Dimensional Systems.
PhD-thesis, Vrije Universiteit Amsterdam.
Borisov, A. (2007). The Wonham Filter under Uncertainty: a Game-Theoretic Approach.
submitted to Proceedings of NET-COOP 2007.
Borisov, A. and Stefanovich, A. (2005). Optimal Filtering for HMM Governed by Special
Jump Processes. Proceedings of the 44th Conference on Decision and Control, pages
5935–5940.
Braun, M. (1993). Differential Equations and Their Applications. Springer-Verlag.
Brémaud, P. (1981). Point Processes and Queues: Martingale Dynamics. Springer-Verlag.
Ceci, C. and Gerardi, A. (2000). Filtering of a Markov Jump Process with Counting
Observation. Applied Mathematics and Optimization, 42:1–18.
Clarke, F. (1983). Optimization and Nonsmooth Analysis. Wiley.
Davis, M. (1993). Markov Models and Optimization. Chapman & Hall.
Dempster, M. (1989). Optimal Control of Piecewise Deterministic Markov Processes. Applied Stochastic Analysis, 5:303–325.
118
Bibliography
Dempster, M. and Ye, J. (1995). Impulse Control of Piecewise Deterministic Markov
Processes. The Annals of Applied Probability, 5 (2):399–423.
Donchev, D. (1998). On the Two-Armed Bandit Problem with Non-Observed Poissonian
Switching of Arms. Mathematical Methods of Operations Research, 47:401–422.
Donchev, D. (1999). Exact Solution of the Bellman Equation for a β-discounted Reward
in a Two-Armed Bandit with Switching Arms. Journal of Applied Mathematics and
Stochastic Analysis, 12 (2):151–160.
Donchev, D. and Yushkevich, A. (1996). Average Optimality in a Poisson Bandit with
Switching Arms. Mathematical Methods of Operations Research, 45:265–280.
ElKaroui, N. and Karatzas, I. (1997). Synchronization and Optimization for Multi-Armed
Bandit Problems in Continuous Time. Computational and Applied Mathematics, 16:117–
152.
Elliott, R., Aggoun, R., and Moore, J. (1997). Hidden Markov Models: Estimation and
Control. Springer-Verlag.
Fleming, W. and Rishel, R. (1975).
Springer-Verlag.
Deterministic and Stochastic Optimal Control.
Fleming, W. and Soner, H. (1993). Controlled Markov Processes and Viscosity Solutions.
Springer-Verlag.
Forwick, L. (1997). Optimale Kontrolle Stückweise Deterministischer Prozesse. PhDThesis, Universität Bonn.
Framstad, N., Øksendal, B., and Sulem, A. (2004). Sufficient Stochastic Maximum Principle for the Optimal Control of Jump Diffusions and Applications to Finance. Journal
of Optimization Theory and Applications, 121:77–98.
Gihman, I. and Skorohod, A. (1979). Controlled Stochastic Processes. Springer-Verlag.
Hanson, F. (2007). Applied Stochastic Processes and Control for Jump Diffusions: Modeling, Analysis, and Computation. Society for Industrial Mathematics.
Haussmann, U. (1986). A Stochastic Maximum Principle for Optimal Control of Diffusions.
Longman Scientific & Technical.
Hernandez-Lerma, O. (1989). Adaptive Markov Control Processes. Springer-Verlag.
Hernandez-Lerma, O. and Lasserre, J. (1996). Discrete-Time Markov Control Processes.
Springer-Verlag.
Bibliography
119
Honhon, D. and Seshadri, S. (2007). Admission Control with Incomplete Information to
a Finite Buffer Queue. Probability in the Engineering and Informational Sciences, 21
(1):19–46.
Hordijk, A. and Koole, G. (1992). On the Shortest Queue Policy for the Tandem Parallel
Queue. Probability in the Engineering and Informational Sciences, 6:63–79.
Karatzas, I. and Shreve, S. (2001). Methods of Mathematical Finance. Springer.
Kaspi, H. and Mandelbaum, A. (1995). Lévy Bandits: Multi-Armed Bandits Driven by
Lévy Processes. The Annals of Applied Probability, 5 (2):541–565.
Kaspi, H. and Mandelbaum, A. (1998). Multi-Armed Bandits in Discrete and Continuous
Time. Stochastic Processes and their Applications, 8 (4):1270–1290.
Kitaev, M. and Rykov, V. (1995). Controlled Queueing Systems. CRC-Press.
Koole, G. (1998). A Transformation Method for Stochastic Control Problems with Partial
Information. Systems and Control Letters, 35:301–308.
Kushner, H. and Dupuis, P. (2001). Numerical Methods for Stochastic Control Problems
in Continuous Time. Springer-Verlag.
Last, G. and Brandt, A. (1995). Marked Point Processes on the Real Line. Springer-Verlag.
Lin, K. and Ross, S. (2003). Admission Control with Incomplete Information of a Queueing
System. INFORMS, Operations Research, 51:645–654.
Liptser, R. and Shiryayev, A. (2004a). Statistics of Random Processes I: General Theory.
Springer-Verlag.
Liptser, R. and Shiryayev, A. (2004b). Statistics of Random Processes II: Applications.
Springer-Verlag.
Massey, W. and Whitt, W. (1998). Uniform Acceleration Expansions for Markov Chains
with Time-Varying Rates. The Annals of Applied Probability, 8 (4):1130–1155.
Miller, B., Avrachenkov, K., Stepanyan, K., and Miller, G. (2005). Flow Control as a
Stochastic Optimal Control Problem with Incomplete Information. Problems of Information Transmission, 41 (2):150–170.
Øksendal, B. and Sulem, A. (2005).
Springer-Verlag.
Applied Stochastic Control of Jump Diffusions.
Presman, E. and Sonin, I. (1990). Sequential Control with Incomplete Information. Academic press.
120
Bibliography
Presman, K. (1990). Poisson Version of the Two-Armed Bandit Problem with Discounting.
Theory of Probability and Its Applications, 35 (2):307–317.
Raman, A., DeHoratius, N., and Ton, Z. (2001). Execution: The Missing Link in Retail
Operations. California Management Review, 43 (2):136–152.
Rishel, R. (1978). The Minimum Principle, Separation Principle, and Dynamic Programming for Partially Observed Jump Markov Processes. IEEE Transactions on Automatic
Control, 23:1009–1014.
Rockafellar, R. (1996). Convex Analysis. Princeton University Press.
Rogers, L. and Williams, D. (2003). Diffusions, Markov Processes and Martingales. Cambridge University Press.
Winter, J. (2007). Finite horizon control problems under partial information. Proceedings
of NET-COOP 2007, pages 120–128.
Wong, E. and Hajek, B. (1985). Stochastic Processes in Engineering Systems. SpringerVerlag.
Yong, J. and Zhou, X. (1999). Stochastic Controls: Hamiltonian Systems and HJB Equations. Springer-Verlag.
List of Tables and List of Figures
121
List of Tables
1
2
Simulation for the Symmetric Case . . . . . . . . . . . . . . . . . . . . . .
Simulation for One Service Rate known . . . . . . . . . . . . . . . . . . . .
85
92
List of Figures
1
2
3
4
5
6
7
8
Parallel Queueing Model . . . . . . . . . . . . . .
Simulation of Conditional Probabilities . . . . . .
Parallel Queueing Model . . . . . . . . . . . . . .
Optimal Control in a Waiting-Cost Model without
Optimal Control in a Waiting-Cost Model without
Relative Costs under Threshold-Strategies . . . .
Double-Threshold-Strategy . . . . . . . . . . . . .
Relative Costs under Double-Threshold-Strategies
. . . . .
. . . . .
. . . . .
Arrivals
Arrivals
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
for fixed
. . . . .
. . . . .
. . . . .
. . . . .
. .
. .
. .
i1
. .
. .
. .
. .
.
.
.
.
.
.
.
.
.
1
. 30
. 61
. 92
. 92
. 101
. 102
. 103
122
German Summary
Zusammenfassung
Informationsbeschaffung und -verarbeitung werden zu immer wichtigeren Bestandteilen
eines Entscheidungsprozesses. Durch die wachsende Komplexität, aber auch durch die
wachsenden Möglichkeiten Informationen zu beschaffen, steigen entsprechend Beschaffungs- und Verarbeitungskosten (Zeitaufwand, Analyse, Vergleiche, ...) hierfür. Dadurch
stehen Entscheidungsträger vor der Frage, welche Informationen für eine Entscheidung relevant sind bzw. wie hoch die erwartete Kostensteigerung beim Fehlen dieser Information
ist. Folglich gilt es jeweils abzuwägen, ob die zusätzliche Information überhaupt einen
(monetären) Vorteil und Nutzen bringt. Ein klassisches Beispiel aus dem alltäglichen
Leben ist der Supermarkt. Um Zeit zu sparen wird an den Kassen nicht mehr jeder Joghurt
separat gescannt, sondern es wird nur noch die Gesamtzahl der Joghurts gezählt und dann
ein zufälliger Joghurt auf dem Band als Referenz über den Kassenscanner gezogen. Für
den Preis des Kunden ist dies nicht entscheidend, aber das Lagerhaltungssystem, das mit
dem Abrechnungssystem der Kassen gekoppelt ist, kann nun nicht mehr feststellen, wie
viele Erdbeer- oder Pfirsichjoghurts im Laden vorhanden sind. Dadurch müssen für die
Nachbestellung entweder alle Joghurts im Laden getrennt nach Sorten gezählt oder die
Bestellentscheidung mit einem Informationsdefizit getroffen werden.
Dieses einfache, alltägliche Beispiel illustriert bereits die Problematik und lässt sich auf
viele weitere reale Probleme übertragen. Neben dem Lagerhaltungsmodell seien hier nur
Fragestellungen aus dem Bereich des Internets und der Telekommunikation, der Steuerung
von Call-Centern oder dem Verhalten auf Finanzmärkten genannt. Die unvollständige Information kann dabei nach zwei verschiedenen Einflussarten klassifiziert werden. Die erste
ist, dass ein Umweltprozess, der den eigentlichen Zustandsprozess beeinflusst (bspw. hängt
das stochastische Verhalten des Zustandsprozesses von der Umwelt ab), nicht beobachtbar
ist. Als Beispiel mag hier die weltweite Wirtschaftslage als Einfluss auf das Kreditrating
eines einzelnen Unternehmens dienen. Die zweite Variante ist, dass der Zustandsprozess an
sich nicht beobachtbar ist. In der Literatur wird bisher meist der Fall betrachtet, dass ein
mit dem Zustandsprozess korrelierter (eindimensionaler) Prozess beobachtbar ist. In der
Praxis dagegen können jedoch häufig nur Gruppen von Zuständen unterschieden werden.
Dieser Fall ist in der Literatur bislang kaum untersucht worden.
Die vorliegende Arbeit schließt diese Lücke für Markov’sche Sprungprozesse. In Kapitel 2 konstruieren wir zunächst den dreikomponentigen Zustandsprozess mit diskretem
Zustandsraum. Die erste Komponente, der Umweltprozess, ist ein Markov’scher Sprungprozess, der auf zwei Arten den eigentlichen Zustandsprozess beeinflusst. Die erste ist,
dass der Umweltprozess, den Generator, also die Stochastik, des Zustandsmarkovprozess
bestimmt (Hidden-Markov-Model). Der zweite Einfluss ist, dass zugelassen wird, dass
Änderungen im Zustand des Umweltprozess zu unmittelbaren Sprüngen im eigentlichen Zustandsprozess führen. Dies ist motiviert durch die Tatsache, dass sich bspw. im Falle einer
Verschlechterung der wirtschaftlichen Lage (Umweltprozess) auch die Kreditwürdigkeit
eines einzelnen Unternehmens (Zustandsprozess) verschlechtert. Die dritte Komponente
des Prozesses wird durch die Informationsstruktur definiert, bei der wir annehmen, dass
German Summary
123
nur Gruppen von Zuständen des Zustandsprozesses beobachtbar sind (vgl. Definition 2.1).
Wir illustrieren diese Konstruktion anhand verschiedener Spezialfälle, die u.a. das HiddenMarkov-Model und eine 0-1-Beobachtung beinhalten. Anschließend führen wir unser Optimierungsproblem (P ) ein. Bei diesem sollen die erwarteten diskontierten Kosten, die
von dem Verhalten des dreikomponentigen gesteuerten Prozesses abhängen, über einen
unendlichen Horizont minimiert werden. Als zulässige Steuerungen sind nur solche zugelassen, die auf den erhältlichen Beobachtungen beruhen. Dieses Optimierungsproblem ist
nicht direkt lösbar, da der Zustandsprozess nicht vollständig beobachtbar ist.
In Kapitel 3 definieren wir ein zu (P ) äquivalentes Problem (Pred ), welches eines unter
vollständiger Information und dadurch direkt lösbar ist. Wir zeigen im Reduktionstheorem 3.13, dass die optimalen Steuerungen und die Optimalwerte der beiden Probleme
dieselben sind. Zuvor berechnen wir in Theorem 3.5 eine explizite Darstellung der bedingten Wahrscheinlichkeit, dass Umwelt- und Zustandsprozess in einem Zustand sind
unter den bisherigen Beobachtungen und diskutieren Eigenschaften dieser Filtergleichung
(3.10). Dieser stückweise-deterministische Schätzprozess ersetzt im reduzierten Problem
den unbekannten Umwelt- und Zustandsprozess. Analog können die erwarteten Kosten als
Funktion des Schätzprozesses dargestellt werden. Abschließend diskutieren wir in diesem
Kapitel die Abhängigkeit zwischen Informationsstruktur und dem Optimalwert und zeigen
erste Eigenschaften der Wertfunktion wie bspw. die Konkavität im Schätzer.
Kapitel 4 befasst sich mit der Lösung des reduzierten Problems (Pred ) unter vollständiger
Information. Zunächst zeigen wir in Theorem 4.3, dass die Wertfunktion Lösung einer
verallgemeinerten Hamilton-Jacobi-Bellman Gleichung ist. Die Verallgemeinerung besteht
dabei darin, dass wir die Differenzierbarkeitsannahme auf differenzierbar im Clarke’schen
Sinne abschwächen. Diese Annahme erfüllt die konkave Wertfunktion von (Pred ). Wir
beweisen zudem notwendige und hinreichende Bedingungen an die optimale Steuerung.
Letzteres führt in Theorem 4.4 zur verallgemeinerten Verifikationsmethode. Ein zweiter
Lösungsvorschlag ist die Formulierung eines zeitdiskreten Markov’schen Entscheidungsproblems (MDP). Hierbei schlagen wir Nutzen aus dem stückweise-deterministischen Verhalten des Schätzprozesses. Der Aktionenraum wird dabei durch einen Funktionenraum
beschrieben. Theorem 4.7 beweist den Zusammenhang zwischen diesem MDP und (Pred ).
Insbesondere sind die Optimalwerte gleich und aus der optimalen Politik des MDP, deren
Existenz wir in Theorem 4.14 diskutieren, lässt sich eine optimale Steuerung von (Pred )
herleiten. Zudem zeigen wir den Zusammenhang der beiden Lösungsansätze in Theorem
4.9 auf.
Im abschließenden Kapitel 5 betrachten wir ein Warteschlangenmodell. Ein Server kann
dabei seine Servicekapazität auf zwei Warteschlangen aufteilen, wobei für jeden wartenden
Kunden eine Kostenrate ci , abhängig von Schlange i, anfällt. Die zufälligen Bedienzeiten
hängen von der Bedienrate µi ab. Im Falle vollständiger Information ist bekannt, dass
die cµ-Regel optimal ist, d.h. es ist optimal, die Schlange zu bedienen, bei der ci µi
größer ist, sofern dort ein Kunde wartet. Wir zeigen zunächst, dass die cµ-Regel ebenfalls optimal bleibt, sofern die Informationsstruktur hinreichend genügend fein ist. AnB
schließend analysieren wir den Fall, dass jedes µi zwei Werte µA
i bzw. µi annehmen kann
124
German Summary
(Bayes’scher Fall), wobei dem Server nicht bekannt ist, welcher Wert angenommen wird.
Wir leiten basierend auf den Grundlagen aus Kapitel 3 zunächst eine explizite Darstellung
des Schätzprozesses her und diskutieren Eigenschaften des Schätzers. Danach definieren
wir das (uniformisierte) zeitdiskrete Markov’sche Entscheidungsproblem im Sinne von Abschnitt 4.2. Damit beweisen wir für die Wertfunktion die Separationseigenschaft in Theorem 5.13.
Anschließend beweisen wir mit Hilfe der verallgemeinerten Verifikationsmethode aus Kapitel 4, dass die optimale Strategie fast immer eine Schlange exklusiv bedient. Für den
B
B
A
symmetrischen Fall, d.h. µA
1 = µ2 und µ1 = µ2 , zeigen wir unter c1 = c2 = 1 die
Optimalität einer Kontrollgrenzenregel mit Kontrollgrenze p∗ = 21 . Im Fall, dass eine
Bedienrate bekannt ist, beweisen wir, dass es stets optimal ist, nur eine Warteschlange
zu bedienen, und geben hinreichende Bedingungen für die optimale Steuerung. Wie im
symmetrischen Fall erhalten wir die aus der Banditen-Theorie bekannte stay-on-a-winner
Eigenschaft für die optimale Steuerung. Diese Resultate werden durch Simulationsergebnisse veranschaulicht. Abschließend betrachten wir anstelle der Minimierung der Kosten
ein Modell mit Gewinnfunktion für jeden bedienten Kunden. Dieses Problem lösen wir
mit Hilfe eines Gittins-Index vollständig und ergänzen dabei Resultate aus der zeitstetigen Banditenprobleme-Theorie. Ist dagegen nicht beobachtbar, ob ein Kunde in einer
Schlange wartet oder nicht, so ist eine geschlossene Lösungsformel für den Schätzprozess
nicht verfügbar. In Abschnitt 5.3 vergleichen wir zwei plausible Steuerungen für eine solche
0-1-Informationsstruktur numerisch.
Danke
• Prof. Dr. Ulrich Rieder für Ihre Anregungen und Tipps, für die fruchtbaren Diskussionen mit Ihnen, für den enormen Rückhalt im Endspurt, für die Möglichkeit meine
Dissertation an Ihrem Lehrstuhl zu schreiben und die Freiheiten in Forschung und
Lehre, die mir selbstständiges und eigenverantwortliches Arbeiten ermöglichten
• Prof. Dr. Dieter Kalin für Ihr Interesse an meiner Arbeit und die Übernahme des
Zweitgutachtens, sowie die gemeinsame Zeit am Institut
• Frank, Marc, Thomas, Harald und allen anderen Kollegen in der Fakultät für die
gute Arbeitsatmosphäre, den Austausch und den Spaß außerhalb der Promotion
• meiner Familie für Eure Unterstützung und Euren Rückhalt
• Euch Freunden für das Leben außerhalb der Universität, was mir Abwechslung und
immer wieder neuen Schwung für meine Promotion brachte
• Daniela für Deine Liebe, Geduld, Motivation und aufbauenden Ermunterungen
© Copyright 2026 Paperzz