Continuous Time Continuous Time Markov Decision

Informatik IV
Continuous Time
Markov Decision Processes:
Theory,
y Applications and
Computational Algorithms
Peter Buchholz,
Buchholz
Informatik IV, TU Dortmund, Germany
© Peter Buchholz 2011
1
Informatik IV
Overview 1
 Continuous Time Markov Decision Processes (CTMDPs)
 Definition
 Formalization
 Applications
A li ti
 Infinite Horizons
 Result Measures
 Optimal Policies
 Computational Methods
© Peter Buchholz 2011
Overview
2
Informatik IV
Overview 2
 Finite horizons
 Result Measures
 Optimal Policies
 Computational
C
t ti
lM
Methods
th d
 Advanced Topics
 Model Checking CTMDPs
 Infinite State Spaces
 Transition Rate Bounds
 ….
© Peter Buchholz 2011
Overview
3
Informatik IV
Continuous Time Markov Chains ((CTMCs)) with Rewards
1.0 (3.0)
1.0 (0.0)
0.0 (2.0)
1
1.0 (0.0)
0.0 (1.0)
2
0.5 (0.1)
3
Components:
States
0.5 (0.1)
Initial Probabilities
1.0 (0.0)
Rate Rewards
Transitions
0.01 ((0.0))
Transition Rates
4
Impulse Rewards
0 0 (0
0.0
(0.0)
0)
© Peter Buchholz 2011
CTMDP Basics
4
Informatik IV
Behavior of the CTMC with rewards:
 P
Process stays
t
an exponentially
ti ll di
distributed
t ib t d ti
time iin a state
t t
(if it is not absorbing)
 Performs a transition into a successor state
(according to the transition rates)
 Accumulates (rate) rewards per time unit in a state and
impulse reward per transition
Result measures:
 Average reward accumulated per unit of time in the long run
 Discounted reward
 Accumulated reward during a finite interval
 Accumulated reward before entering a state or subset of states
© Peter Buchholz 2011
CTMDP Basics
5
Informatik IV
Extension: Choose between different actions in a state
1.0 (3.0)
0.0 (2.0) 1.0 (0.0)
1.0 (0.0)
1.0 (0.0)
1
0.0 (1.0)
2
3
Decision between
0.5 (0.1)
0.5 (0.1)
0.2 (0.0) 10.0 (0.0)
0.0 (0.0)
5
red or blue
1.0 (0.0)
in the states 2 and 3!
1.0 (0.0)
0.01 ((0.0))
4
Continuous Time
Markov Decision Process
0 0 (0
0.0
(0.0)
0)
© Peter Buchholz 2011
CTMDP Basics
6
Informatik IV
Markov Decision Process (with finite state and action spaces)
 State space S = {1,…,n}
{1
n} (S 
 in the countable case)
 Set of decisions Di = {1,…,mi} for iS
1n
 Vector of transition rates qiu 1,n
where qiu(j) <  is the transition rate from i to j (ij, i,j S under
decision u Di
 exponential sojourn time in state i with rate - qiu(i) = ij qiu(j)
if decision u is taken, afterwards transition into j with probability
qiu(j) / - qiu(i)
 ru(i)
the (non-negative, decision dependent) reward in state i,
we assume ru(i) < 
 su(i,j) the (non-negative, decision dependent) reward of a transition
from state i into state j , we assume su(i,j)
(i j) <  and s
and su(i,j)
(i j) = 0 for ii=jj or
qiu(j) =0
7
© Peter Buchholz 2011
CTMDP Basics
Informatik IV
Goal: Analysis
y
and control of the system
y
in the interval [[0,T]
, ]
(T =
is included
 dt is the decision vector at time t where dt(i)  Di
 Decision space D = Xi=1..n Di
 Qd 
nn
n,n
size i
1..n
mi
with Q
ith Qd i,j
i j qid i j transition matrix of the CTMC under t
iti
t i f th CTMC d
decision vector d
 rd
n,1
1
rate reward vector under decision vector d
d
d d i i
d
 Sd
n,n
impulse reward matrix under decision vector d
 p0 is the initial distribution of the CTMDP at time t 0
© Peter Buchholz 2011
CTMDP Basics
8
Informatik IV
Control of CTMDP via policies:
A policy  is a measurable function from [0,T] into D
(set of all policies M
 decisions can depend on the time and the state (but not on the history)
t defines dt, the decision vector taken at time t
t,T is the policy  restricted to the interval [t,T]
Policy  is
 piecewise constant, iff 0 = t0 < t1 < … < tm = T exist such that dt = dt’
for t,t’  (ti-1,ti]
 constant, iff d is independent of t
(i.e., decisions depend only on the state)
© Peter Buchholz 2011
CTMDP Basics
9
Informatik IV
Other forms of policies:
p
 randomized policy: t defines a probability distribution over D
 history dependent: t depends on (xt’, at’) for 0 ≤ t’
t <t
restricted forms of history dependency
 reward dependent: t depends on reward accumulated in [0,t)
Policies types can be combined:
E
E.g.,
piecewise
i
i constant
t t history
hi t
d
dependent,
d t constant
t t randomized
d i d ….
© Peter Buchholz 2011
CTMDP Basics
10
Informatik IV
CTMDP with a fixed p
policy
y : Stochastic Process ((Xt, At)
 Xt state process
 At action/decision process
Both processes together define
 Gt gain/reward process (i
(i.e.,
e accumulated reward in [0,t))
[0 t))
Behavior of Gt:
 Changes with rate ra(i) if Xt = i and At = a
 Makes a jjump
p of height
g sa((i,j)
,j) if Xt jjumps
p at time t from i to j and At = a
© Peter Buchholz 2011
CTMDP Basics
11
Informatik IV
accept
Gate
reject
Policy:
 accept if t < 2 and popul < 3
 reject otherwise
Rewards:
 3-popul per time unit
 1 per service
x
g
t
© Peter Buchholz 2011
CTMDP Basics
t
12
Informatik IV
Mathematical Basics:
Assume for the moment that policy  is known
E l ti off th
Evolution
the state
t t process Xt:
Let 0 ≤ t ≤ u ≤ T
Vπt,t
tt
d π
π
du
= I and
Vt,u
=
V
Q
t
t,u
du t
Vπt,u ((i, j) := Prob. of being in j at u, if the process is in i at t
π
Let Vπt = V0,t
π
We have pu = pt Vt,u
and pt = p0 Vtπ
CTMDP Basics
© Peter Buchholz 2011
13
Informatik IV
Evolution of the gain process Gt:
Let Wdt be a n x n matrix with:
½ d (i)
r t (i)
if i = j
dt
W (i, j) =
sdt (i) (i, j)qidt (j) otherwise
Then −
d π
gt = wπ + Qdt gtπ
dt
wdt (i) =
n
X
Wdt (i, j)
j=1
(backwards in time!)
gπt,T is the accumulated gain in the interval [t,T] and
gπT the final gain at time T
π
π
π
such that gt,T = Vt,T gT +
© Peter Buchholz 2011
Z
T
t
CTMDP Basics
π
Vu,T
wπ du
14
Informatik IV
For
o ((T − t) → ∞ usua
usuallyy aalso
so gπt,T
o ds
t T (i) → ∞ holds
Results for [0,
are not meaningful in this case!
Alternative ways of defining the gain vector
Alternative ways of defining the gain vector
 Time averaged gain
Ã
!
Z T
1
π
π
gπt,T =
Vt,T
gTπ +
Vu,T
wπ du for T > t
T −t
t
 Discounted gain
π
gπt,T = e−β(T −t) Vt,T
gTπ +
Z
T
t
π
e−β(u−t) Vu,T
wπ du
for discount factor  > 0
© Peter Buchholz 2011
CTMDP Basics
15
Informatik IV
From analysis to optimization problems:
Find a policy  (from a specific class) that maximizes/minimizes the gain
O ti l gain
Optimal
i
g+
0,T
= sup
π∈M
¡
π
g0,T
¢
¡ π ¢
Optimal policy π = arg sup g0,T
+
π∈M
(
(or
g−
0,T
= inf
π∈M
¡
π
g0,T
¢
)
¡ π ¢
inf g0,T )
(or π = arg π∈M
−
+/- need not be unique!
Policy  is -optimal iff
© Peter Buchholz 2011
° ±
°
° π
π °
°g0,T − g0,T ° ≤ ²
CTMDP Basics
16
Informatik IV
Examples (queuing networks)
Routing
Resource allocation/deallocation
Router
on
Scheduling
© Peter Buchholz 2011
CTMDP Basics
off
17
Informatik IV
Examples (reliability, availability)
Components
System
W ki
Working
G1
Find a maintenance
policy to minimize
G2
G4
MainteM
i t
nance
G3
G5
A5
Failure
A1 A2 A3
© Peter Buchholz 2011
system down time
A4
CTMDP Basics
18
Informatik IV
Examples (security)
Attack tree
(specifies sequence
off attack
tt k steps)
t
)
System
or
Adversary: Find attack steps to reach goal
Defender: Find mechanisms to defend the system
© Peter Buchholz 2011
19
Informatik IV
Examples (OR)
Ai li yield
Airline
i ld managementt
Inventory control
arrival rate (t)
reject
delivery
accept
cancellation
orders
sold
goods
no show
Inventory
© Peter Buchholz 2011
CTMDP Basics
20
Informatik IV
Examples (AI)
Sensors
Agent
Environment
Actors
A
Agent
t and
d environment
i
t are modeled
d l d as CTMDP
CTMDPs
Optimal policy corresponds to an “optimal” behavior of the agent
© Peter Buchholz 2011
CTMDP Basics
21
Informatik IV
CTMDP on Inifinite
CTMDPs
I ifi it Horizons
H i
We consider a CTMDP in the interval [0,
 Result Measures
 Reward to Absorption
 Average Reward
 Discounted Reward
 Optimal Policies
 Computational Methods
© Peter Buchholz 2011
Infinite Horizons
22
Informatik IV
R
Reward
d to
t Absorption
Ab
ti
Assumptions:
 wd((i)) ≥ 0 for i  St and 0 for i  Sa
and all d D
Set of transient states St
 lim
t→∞
π
Find a policy  such that g =
in all components
© Peter Buchholz 2011
j∈Sa
Vπt (i,
(i j) = 1 for all i ∈ S
and all  M
Absorbing
state(s) Sa
Goal:
X
Z
∞
0
π π
Vis
t h dt minimal/maximal
Infinite Horizons
23
Informatik IV
A
Average
R
Reward
d
π
Find a p
policy
y  such that g = Tlim
→∞
1
T
is maximal/minimal in all components
ÃZ
T
0
Vtπ wπ dt
!
No further assumptions necessary ( lim (Vπt ) exists in our case!)
t→∞
Discounted Reward
π
Find a policy  such that g = Tlim
→∞
ÃZ
is maximal/minimal in all components
T
0
e−βt Vtπ wπ dt
!
(discount factor  ≥ 0)
© Peter Buchholz 2011
Infinite Horizons
24
Informatik IV
Optimal Policies:
St ti
Stationary
policies
li i 1 and
d 2 exist
i t such
h th
thatt
gπ1 = sup (gπ ) and gπ2 = inf (gπ )
π∈M
∈M
π∈M
in the cases we consider here
The optimal policies might not be unique such that other criteria can
be applied to rank policies with identical gain vectors
(we do not go in this direction here)
© Peter Buchholz 2011
Infinite Horizons
25
Informatik IV
Computational
p
Methods
 Basic step is the transformation of the CTMDP into an equivalent
DTMDP (plus a Poisson process) using uniformization
 Afterwards methods for computing the optimal policy/gain vector in
DTMDPs can be applied (Poisson process is not needed)
© Peter Buchholz 2011
Infinite Horizons
26
Informatik IV
The uniformization approach:
pp
CTMC for a fixed
decision vector d
1) Poisson process with rate
 ≥ maxdD
Qd(i,i))
(i i))
d D maxiS
i S ((-Q
for the timing
t
2) DTMC Pd=Qd/ + I for transitions
1
2
1.0
2.0
3.0
3
© Peter Buchholz 2011
2/3
1
2
1/3
2/3
1.0
1/3
Pseudo transition
3
3) Transformed rewards
d
d
w0 (i) = wd (i)/α or w0 (i)/(α + β)
discount factor ’ =  / (+) 27
Infinite Horizons
Informatik IV
Uniformization transforms a CTMDP = (Qd, p0, wd, S, D) into a
DTMDP = (Pd, p0, w‘d, S, D) with an identical
 optimal
ti l gain
i vector
t g** and
d
 optimal stationary policy * (described by vector d*)
The optimal gain vector is the solution of the following equation for the
discounted and total reward
⎛
⎞
n
X
u
g∗ (i) = max ⎝w0 (i) + β 0
Pu (i, j)g∗ (j)⎠ for i = 1, . . . , n
u∈D
∈Di
j=1
Vector d* results from setting
g d*(i)
( ) to argmax
g
in the equation
q
© Peter Buchholz 2011
Infinite Horizons
28
Informatik IV
For the optimization of average rewards, we restrict the model class to
unichain models :
A DTMDP is unichain if for every d  D matrix Pd contains a single
recurrent class of states (p
(plus p
possibly
y some transient states))
For unichain DTMDPs the average
g reward is constant for all states and
observes the following equations
⎛
u
ρ∗ + h∗ (i) = max ⎝w0 (i) +
u∈Di
© Peter Buchholz 2011
n
X
j=1
⎞
Pu (i, j)h∗ (j)⎠ for i = 1, . . . , n
Infinite Horizons
29
Informatik IV
Numerical methods to compute the optimal gain vector + policy
 linear programming
 value iteration
+ combinations of value and policy iteration
 policy iteration
Practical problems
Curse of dimensionality, state space explosion
Slow convergence, long solution times
© Peter Buchholz 2011
Infinite Horizons
30
Informatik IV
Linear Programming
Discounted Reward Case
Average Reward Case
³
´
Ã
!
0d
0 d
g ≤ maxd∈D w + β P g ≤
P P u 0u
max
xi w (i)
()
∗
∗
0d
0 d∗ ∗
g =w +β P g
i∈S u∈Di
subject to
g* is the largest g that satisfies
³
g ≤ max w
d∈D
LP max
µ
subject
j
to
0u
0d
P
0
d
+β P g
xi
i∈S
xi ≤ w (i) + β
¶
0
n
P
´
P
xui
u∈D
P iP
i∈S u∈Di
−
P P
j∈S u∈Di
u
xi = 1
Pu (j, i)xuj = 0
for all i ∈ S, u ∈ Di
Pu (j, i)xj
j=1
for all i ∈ S,
S u ∈ Dj
© Peter Buchholz 2011
Infinite Horizons
31
Informatik IV
u
 d* and g* (* and h*) can be derived from the results of the LP xi
 An LP with
n
X
|Di | variables and
 n or
n
i=1
X
|Di | or n+1 constraints

i=1
has to be solved (this can be time and memory consuming!)
Usually Linear Programming is not the best choice to compute optimal
policies for MDP problems!
© Peter Buchholz 2011
Infinite Horizons
32
Informatik IV
V l iteration
Value
it
ti
Initialize k=0 and g(k) ≥ 0, then iterate until convergence
⎛
⎞
n
X
g(k+1) (i) = max ⎝w0 (i) + β 0
Pu (i, j)g(k) (j)⎠
u∈Di
j 1
j=1
or
(k+1)
h
u∈Dn
0
n
P
!
Pu (i, j)h(k) (j)
j=1
!
n
P
0
w (n) +
Pu (n,
(n j)h(k) (j)
(i) = maxu∈Di
Ã
− max
ma
Ã
w (i) +
Repeated
vector matrix
products
j=1
for all i = 1
1,….,n
n
© Peter Buchholz 2011
Infinite Horizons
33
Informatik IV
 Easy to implement
 Convergence towards the optimal gain vector and policy
stop if policy does not change and ||g(k) – g(k-1)|| < 
n
X
n · nz ·
mi
 Effort
Eff t per iteration:
it ti
i=1
where nz is the average number of non-zeros in a row of Pd
 Often slow convergence and a huge effort even for CTMDPs of a
moderate
d t size
i
© Peter Buchholz 2011
Infinite Horizons
34
Informatik IV
Policy iteration
Initialize k=0 and initial policy d0, then iterate until convergence
g(k) = w0
dk
+ β 0 Pdk g(k)
h(k) + ρk e = w0
and
dk
+ β 0 Pdk h(k) and
d h(k) (n)
( )=0
³
dk+1 = arg maxd∈M w
³
dk+1 = arg maxd∈M w
until dk+1 = dk
© Peter Buchholz 2011
or
0d
+ β 0 Pd g(k)
0d
´
+ Pd h(k)
Infinite Horizons
´
Repeated
solution of
sets of linear
equations
or
35
Informatik IV
 Optimal policy and gain vector is computed after finitely many steps
n
X
O(n3 ) + O(n · nz ·
mi )
 Effort per iteration:
i=1
 For larger state spaces huge effort
(solution of a set of linear equations of order n)
© Peter Buchholz 2011
Infinite Horizons
36
Informatik IV
Improvements by combining policy and value iteration
Initialize k=0 and initial policy d0, then iterate until convergence
³¡
´
¢
d
g(k) ⇐ iterate I − β 0 Pdk g = w0 k
or
h(k) + ρk ⇐ iterate
³¡
¢
I − Pdk h = w 0
dk
´
subject
j
to h(k) ((n)) = 0
where iterate is some advanced iteration techniques like SOR, ML,
GMRES….and
³
d
dk+1 = arg maxd∈M w0 + β 0 Pd g(k)
³
dk+1 = arg maxd∈M w
0d
+ Pd h(k)
´
´
or
(k-1)
1)|| <  or ||h(k) – h(k
(k-1)
1)|| < 
until dk+1 = dk and ||g(k) – g(k
© Peter Buchholz 2011
Infinite Horizons
37
Informatik IV
Example:
Exponential
class switching
delay and
service times
Determination of the best
non-preemptive scheduling
strategy in steady state
Poisson
or IPP
arrivals
Methods
 Value iteration
P li ititeration
ti
 Policy
Finite buffer
capacity over all
classes
 Combined approach with
BiCGStab or GMRES +
Nonlinear reward function:
(n1 + n2 + n3 )1.2 + n1 + 1.5n2 + 2n3
© Peter Buchholz 2011
Infinite Horizons
ILU preconditioning
38
Informatik IV
Exponential interarrival times (fast convergence of all solvers)
No need for
advanced solvers
value iteration
works fine!
© Peter Buchholz 2011
39
Informatik IV
IPP interarrival times (systems are much harder to solve)
Advanced solvers
are much more
efficient for larger
configurations!
g
© Peter Buchholz 2011
40
Informatik IV
Some remarks:
 Analysis for infinite horizons in principle well understood
 Complex models with larger state spaces require sophisticated
solution methods
 Work on numerical methods less developed than in CTMC/DTMC
analysis
 Which method to use?
 Which preconditioner?
 How many iterations?
 …….
© Peter Buchholz 2011
41
Informatik IV
CTMDP on Fi
CTMDPs
Finite
it Horizons
H i
We consider a CTMDP in the interval [0,T T  Result Measures
 Accumulated Reward
 Accumulated Discounted Reward
 Optimal Policies
 Computational Methods
© Peter Buchholz 2011
Finite Horizons
42
Informatik IV
Many problems are defined naturally on a finite horizon, e.g.,
y
yield management
g
((for a specific
p
tour))
 scheduling (for a specific set of processes)
 maintenance (for a system with finite operational periods)
…
use of the optimal stationary policy is suboptimal
(e.g.,
(e
g , maintenance
a te a ce for
o a machine
ac e just be
before
o e itt is
s sshut
ut do
down))
© Peter Buchholz 2011
Finite Horizons
43
Informatik IV
Optimal policies for DTMDPs on a horizon of T steps
Compute iteratively
⎛
g(k+1) (i) = max ⎝wu (i) + β
u∈Di
n
X
j=1
⎞
Pu (i, j)g(k) (j)⎠
for k=1,2,…,T starting with g(0)(i) = 0 for all i=1,…,n
where  = 1.0 for the case without discounting
Effort O(T·n·nz·(i=1..nmi))
© Peter Buchholz 2011
Finite Horizons
44
Informatik IV
H
How
about
b t applying
l i th
the DTMC resulting
lti ffrom uniformization
if
i ti for
f the
th
transient analysis of CTMDPs?
 Times between transitions in the uniformized DTMC are exponentially
distributed with rate a rather than constant
 Substitution of the distribution by the mean as done in the stationary
case does not work
 It has been shown that the approach computes the optimal policy for
uniformization parameter 
(but this results in O(T) steps to cover the interval [0,T])
© Peter Buchholz 2011
Finite Horizons
45
Informatik IV
Poisson process with rate
 ≥ maxdD maxiS (-Q
( Qd(i,i))
(i i))
for the timing
Uniformization revisited
t
Probability for k
events in [t
[t-,t]
 t]
(t, k) =
e-t tk
/ k!
Each event equals a transition in the
DTMC
2/3
Probability for
f >k
events
2
1/3
2/3
1.0
1 – l=0..k (t, k)
© Peter Buchholz 2011
1
1/3
Pseudo transition
3
Finite Horizons
46
Informatik IV
Known results for CTMDPs on finite horizons
(by Miller 1968, Lembersky 1974, Lippman 1976 all fairly old!)
 Ap
policy
y is optimal
p
if it maximizes for almost all t[0,T]
[ ]
d
gt = Qdt gt + wdt and gT ≥ 0
π∈M
∈
dt
 There exists a piecewise constant policy * which results in vector g*t
max (Qπ gt + wπ ) where −
and maximizes the equation
(a policy is piecewise constant, if m< and
0=t0<t1<…<tm=T exist and di is the decision vector in [ti-1,ti))
i.e., the optimal policy depends on the time and the state but changes
only finitely often in [0,T]
© Peter Buchholz 2011
Finite Horizons
47
Informatik IV
Selection of an optimal policy (using results of Miller 1968)
Assume that g*t is known, then the following selection procedure select d*
which is optimal in (t-,t) for some >0
Define the sets
F1(g
(g*t) = {dD
{
| d maximizes q(1)((d)}
)}
F2(g*t) = {dF1(g*t) | d maximizes -q(2)(d)}
…
Fn+1(g*t) = {dFn(g*t) | d maximizes (-1)nq(n+1)(d)}
where q(1)(d) = Qdg*t + wd
(j 1) where
(j 1)= q(j-1)
(j 1)(d) for
q(j)(d) = Qdq(j-1)
h
q(j-1)
f any dF
d Fj-1(g*
( * t)
Select the lexicographically smallest vector from Fn+1(g*t)
© Peter Buchholz 2011
Finite Horizons
48
Informatik IV
Constructive proof in Miller 1968 defines a base for an algorithm:
1. Set t’ =T and initialize gT ;
2. Select dtt’ as described ;
3.
Obtain gt for t ≤ t’ ≤ T by solving
d
− gt = Qdt gt + wdt
dt
with terminal condition gt’ ;
4 Set t’’
4.
t = inf{t | dt satisfies the selection procedure} ;
5. If t’’ ≤ 0 terminate, else goto 2. with t’ = t’’ ;
Not implementable!
© Peter Buchholz 2011
Finite Horizons
49
Informatik IV
From exact to approximate
pp
optimal
p
p
policies:
A policy  is -optimal if ||g*t – gt|| ≤  for all t  [0,T]
Discretization approach:
Define for h:
for h small enough this defines a stochastic matrix
d
Let gt−hh = Pd
h gt + hw + o(h)
Representation of a DTMDP for policy optimization
 For h0 the computed policy is -optimal with 0
 Value of  is unknown
unknown, effort O(h-1·n·nz·(
n nz (i=1..n
i 1 mi))
© Peter Buchholz 2011
Finite Horizons
50
Informatik IV
Idea of the uniformization based approach
pp
Idealized policies
non realizable
Best reward
reward
d* locally best policy
t
© Peter Buchholz 2011
tt-
t
t-
Finite Horizons
51
Informatik IV
A uniformization based approach
Basic steps
1 Start with t=T and gT=gT-=g+T=0 ;
1.
2. Compute an optimal decision vector d based on g-t;
3. Compute a lower bound for g-t- using decision d in (t-,t);
4. Compute an upper bound g+t- for any policy in (t-,t) ;
5. Choose  such that || g+t- – g-t- || <  (t – ) / T ;
6. Store d and t- ;
7. If t- > 0 then set t = t- and goto 2; else terminate ;
All this has to be computable!
© Peter Buchholz 2011
Finite Horizons
52
Informatik IV
Computation of the lower bound g-t-using decision d in [t-,t);
d is the optimal decision at t based on g-t
S ce d is
Since
s constant,
co s a , this
s equa
equals
s the
e transient
a se a
analysis
a ys s o
of a C
CTMC
C
which can be computed using unformization
Consider up to K Poisson steps (Prob. known!)
Reward gained at the
end of the interval
Reward gained
+ accumulated
in k+1, k+2, …
steps (lower
bound)
t-
© Peter Buchholz 2011
g-t
Reward accumulated
in the interval
inter al
Finite Horizons
t
53
Informatik IV
Formally, we get:
(k)
(k−1)
v − = Pd v−
(0)
(k)
Then
g−
t−δ
=
K
X
k=0
(k−1)
with v− = gt− and w− = Pd w−
(k)
γ(αδ, k)v−
1
+
α
K
X
k=0
Ã
1−
k
X
l=0
!
(0)
with w− = wd
(k)
γ(αδ, l) w− + η(αδ, wd , gt− )
where (, wd, g-t) bounds the missing Poisson probabilities
Effort
 for vector computation in O(K·n·nz)
 for evaluation of a new  in O(K·n)
© Peter Buchholz 2011
Finite Horizons
54
Informatik IV
Computation of the upper bound g+t-based on two ideas:
1. Compute separate policies for accumulated reward and reward gained
at the end
2. Assume that we can choose at every jump of the Poisson process a
new policy
Compute for k = 0, 1,…, K steps in the interval
Reward gained
+ accumulated
in k+1, k+2, …
steps (upper
new d
bound)
t-
Reward gained at the
end of the interval
new d
new d
g-t
new d‘
new d‘
© Peter Buchholz 2011
Reward accumulated
in the interval
inter al
Finite Horizons
t
new d‘
new d‘
55
Informatik IV
Formally
and
(k)
w+
= max
d∈D
³
(k−1)
Pd w+
´
with w
(0)
¡ d¢
= max w
(identical for all intervals)
d∈D
For g+t ≥ g*t we obtain
Ã
!
K
k
³
´
X
X
(k)
(k)
+
∗
gt−δ =
γ(αδ, k) v+ + 1 −
γ(αδ, l) w+ + ν(αδ, wd , gt+ ) ≥ gt−δ
k=0
l=0
(, wd, g+t ) bounds the truncated Poisson probabilities
y is better than any
y realizable p
policy!!
y
Policy
© Peter Buchholz 2011
Finite Horizons
56
Informatik IV
Effort
 for vector computation in O(K·n·nz·(i=1..nmi))
 for evaluation of a new  in O(K
O(K·n)
n)
can be applied in a line search to find an appropriate 
Error proportional to the length of the subinterval
Ch
Choose
 such
h th
thatt || g+t- – g-t- || <  (t – ) / T ;
© Peter Buchholz 2011
Finite Horizons
57
Informatik IV
If  is known and s is independent
p
of the decision vector,, the
upper bound can be improved
by computing a non-realizable bounding policy for
accumulated
l t d and
d gained
i Ãd reward
d
!
(k)
x+
=
max
dk
dk
∈D
1 ,...,d
k
k
Y
dk
i
P (ηk gt+ + ζk w)
i=1
where
We have then
∞
X
(k)
∗
x+ ≥ gt−δ
k=0
Effort O(K2·n·nz·(
n nz (i=1..n
i=1 nmi))
© Peter Buchholz 2011
Finite Horizons
58
Informatik IV
 Overall effort O(K·n·nz·(i=1..nmi))
if the optimal policy is selected from Fi(g-t) for some
small i
(usually the case)
 Local error O(2)  Global error O() 
for anyy >0 ((theoretically)
y) the appropriate
pp p
p
policy
y can be
computed
© Peter Buchholz 2011
Finite Horizons
59
Informatik IV
1.0
1
1.0
1.0
1.0
2
0.2
1.0
1.0
1.0
1.0
10 0
10.010.0
1.0
4
0.01
0.0
3
F T ≥ 70.5058:
For
70 5058
 b,b in [T - 4.11656, T]
 b,r in (T – 70.5058,
T – 4.11656]
1.0
 r,r in [0, T - 70.50558)
5
0.0
g0T = (20.931, 20.095, 19.138, 20.107, 8.6077)
Bounds with  = 1.0e-6
1 0e 6
© Peter Buchholz 2011
Finite Horizons
60
Informatik IV
P
P
P
4 processors, 2 buses, 3 memories, 1 repair unit
P
Prioriti ed repair  Computation
Prioritized
Comp tation of the availability
a ailabilit
CTMDP with 60 states
M
M
M
Availability in [0,100]

iter
Lower
bound
Upper
bound
Availabilty at T=100
iter
Lower
Bound
Upper
Bound
1.0e-2
1.0e
2
1080
0.986336
0.995790
872
0.987109
0.995386
1.0e-4
2026
0.995633
0.995722
1419
0.995188
0.995282
1.0e-6
5896
0.995721
0.995721
21089
0.995279
0.995280
1.0e-8
34273
0.995721
0.995721
1513361
0.995280
0.995280
© Peter Buchholz 2011
Finite Horizons
61
Informatik IV
Cap = 10
=1
=1
CTMDP with 121 states
Goal maximization of the throughput
in [0,100]
=0.5
Troughput in [0,100]

iter
Lower
bound
Upper
bound
Sw.
Time
1.0e+0
6098
97.4599
98.4556
68.2561
1 0e 1
1.0e-1
45866
97 4878
97.4878
97 5875
97.5875
68 3037
68.3037
1.0e-2
334637
97.4879
97.4979
68.3099
1 0e-3
1.0e
3
2981067
97 4881
97.4881
97 4891
97.4891
68 3102
68.3102
© Peter Buchholz 2011
Finite Horizons
62
Informatik IV
Effort
 Linear in t and in -1 (and in n·nz)
 Influence of K depends on the model
 K too small results in small time steps due to the
truncated Poisson probabilities
 K too large results in many unneccessarry
computations that are truncated due to the difference
in the policy bounds
 Adaptive approach to choose K such that fraction of
the error due to the truncation of the Poisson
probabilities remains constant
© Peter Buchholz 2011
Finite Horizons
63
Informatik IV
Extensions
 Method can be applied to discounted rewards after
introducing small modifications
 Method can be applied to CTMDPs with time dependent
b t piecewise
but
i
i constant
t t rates
t and
d rewards
d
 Method can be extended to CTMDPs with time
dependent rates and rewards
 Method can be extended to countable state spaces
p
© Peter Buchholz 2011
Finite Horizons
64
Informatik IV
Advanced Topics
 Model
ode C
Checking
ec g C
CTMDPs
s
 Inifinite State Spaces
 CTMDPs with bounds on the transition rates
 Equivalence of CTMDPs
 Partially observable CTMDPs
 ……
© Peter Buchholz 2011
Advanced Topics
65
Informatik IV
CSL model checking of CTMDPs
joint work with Holger Hermanns, Ernst Moritz Hahn, Lijun Zhang
 Model checking of CTMCs is very popular
 Extension for CTMDPs define formulas that hold for a set of states
and all schedulers / some scheduler
 Model
M d l checking
h ki means to compute ffor every state whether
h h a fformula
l
holds/does not hold
 Validation of path formulas requires computation of minimal/maximal
gain vectors for finite or infinite horizons
g
© Peter Buchholz 2011
Advanced Topics
66
Informatik IV
Syntax of CSL for CTMDPs:
Φ := a | ¬Φ
Φ | Φ ∧ Φ | PJ (Φ UI Φ) | SI (Φ) | ItJ (Φ) | CIJ (Φ)
where
h
I und
d J are closed
l
d iintervals
t
l and
d t iis some ti
time point
i t
a | ¬Φ | Φ ∧ Φ
© Peter Buchholz 2011
with the usual interpretation
Advanced Topics
67
Informatik IV
s |= PJ (Φ UI Ψ) if the probability of all paths that start in state s observe 
until  in I falls into J for all policies
s |= SI (Φ) if the
th process starts
t t in
i state
t t s and
dh
has a ti
time averaged
d stationary
t ti
reward over state observing  that lies in I for all policies
s |= ItJ (Φ) if the process starts in state s and has a instantaneous reward at
time t in states obser
observing
ing  that lies in J for all policies
s |= CIJ (Φ) if the process starts in state s and has an accumulated reward
over states observing in the interval I that lies in J for all
policies
li i
© Peter Buchholz 2011
Advanced Topics
68
Informatik IV
Validation of formulas, an example:
[t ,T ]
s |= C[p01 ,p2 ] (Φ) iff a− (s) ≥ p1 ∧ a+ (s) ≤ p2
¡ π π
¢
¡ π π
¢
−
+
where a = inf V0,t0 gt0 ,T |Φ and a = sup V0,t0 gt0 ,T |Φ
π∈M
π∈M
Two step approach to compute Vπ0,t0 and gπt0 ,T |Φ
Computation of Vπ0,t0 with the standard approach, i.e.
d π
π
π
Vt,t = I and
Vt,u = Vt,u
Qdu
du
Computation of gπt0 ,T |Φ using vectors sπ |Φ and gT |Φ then
Z T
π
π
gπt,T |Φ = Vt,T
gTπ |Φ +
Vu,T
wπ |Φ du
t
where h|Φ (s) = h(s) if s |= Φ and 0 else
© Peter Buchholz 2011
Advanced Topics
Separate
error
bounds for
both
quantities
69
Informatik IV
A small example
s4 is made absorbing to
compute the probability
of reaching s4 in the
interval
[3,7] ´´
³
³
[3 7]
sup Psπ1 true U [3,7]
s4
π∈M
²1
9.0e − 4
5.0e − 4
1 0e − 4
1.0e
1.0e − 5
=
² = 1.0e − 3
time bounded prob.
prob iter1
0.97170
0.97186
207
0.97172
0.97186
270
0 97175
0.97175
0 97185
0.97185
774
0.97175
0.97185 5038
Psmax
1
³
´
[3 7]
[3,7]
♦ s4
iter2
90
89
88
88
² = 6.0e − 4
time bounded prob.
prob iter1
—
—
—
0.97176
0.97185
270
0 97178
0.97178
0 97185
0.97185
774
0.97179
0.97185 5038
iter2
—
93
91
91
, ]
Table 1: Bounds for reaching s4 in [3,
[3 7],
7] i.e.,
i e Psmax
(♦[[3,7]
s4 ).
)
1
© Peter Buchholz 2011
Advanced Topics
70
Informatik IV
Infinite (countable) state spaces
 Infinite horizon
 Discounted reward
 Average reward
 Finite horizon
We assume that the control space per state remains finite
finite,
all rewards and transitions rates are bounded
© Peter Buchholz 2011
71
Informatik IV
Infinite horizon discounted case
 We assume that the optimality equations (slide 28) for the original
model have a solution
 We compute results on Ŝ = {1,…,n}
S
d
d
 Let Q̂ be the n x n submatrix of Q and ŵd the n dimensional
d
S
subvector of w restricted to states from Ŝ
d
d
d
 Define P̂ = Q̂ /α + I and ŵ0 = wd /(α + β) (see slide 27)
This defines an DTMDP with n states
© Peter Buchholz 2011
72
Informatik IV
Value iteration:
 Modified
M difi d DTMDP can b
be analyzed
l
d with
i h value
l iiteration
i ((see slide
lid 33)
¯
´
³
dk |Ŝ (k−1)
dk (k−1) ¯
lim
k
P̂
g
|
−
P
g
¯ k∞ < ² then
 If k→∞
Ŝ
Ŝ
²
where
kĝ∗ − g∗ |Ŝ k∞ <
0
1−β
 g
pp
to the
ĝ∗ , g∗ are the vectors to which value iteration applied
reduced and original MDP converges
 dk (i) equals the decision in state i during the k-th
k th iteration of value
iteration applied to the finite system if i ∈ Ŝ and is arbitrary else
 g(k) (i) equals the value in in the k-th iteration of value iteration
applied to the finite system if i ∈ Ŝ and is an upper bound for the
value function otherwise.
© Peter Buchholz 2011
73
Informatik IV
Policy iteration:
 If for
f some policy
li  , vector g̃(k) |Ŝ is
i the
h gain
i vector restricted
i d to states
(k)
from Ŝ and ĝ is the approximate solution computed for the finite
kg̃(k) |Ŝ − ĝ(k) k∞ < δ and
system, such that k
³
´¯
dk+1 |Ŝ (k)
¯
 kP̂
ĝ |SŜ − Pdk+1 ḡ(k) ¯ k∞ < ² where dk+1 results from g
ĝk
and g
then
(k)
Ŝ
S
(k)
is an arbitrary extension of ĝ to S,
³
´
² + 2β
β0δ
(k)
∗
li
lim
kˆ − g |Ŝ k∞ <
kg
k→∞
(1 − β 0 )2
© Peter Buchholz 2011
74
Informatik IV
Infinite horizon average case
 Additional conditions are required to assure existence of an optimal
policy (many different conditions exist in the literature)
 Let for some policy  be:
C the expected gain of a cycle that starts in state 1 and ends when
entering state 1 again
N the expected number of visited states between two visits of state 1
if for all policies C is finite and the N are uniformly
y bounded, then
the Bellman equations have an optimal solution
Solution often via simulation
© Peter Buchholz 2011
75
Informatik IV
Finite horizon but infinite state spaces
(not much known from an algorithmic perspective!)
Assumptions:
 S0 = {i | p0(i) > 0}, let |S0| <
 Transition rates are bounded by  <
Define
 Sk = { j | (dD Pd)k(i,j) > 0 for some i  S0}
 ST() = k=0,…,K Sk where K = minK k=1…K (T,k) ≥ 1-
© Peter Buchholz 2011
76
Informatik IV
Algorithm for approximating/bounding the accumulated reward in [0,T]
1. Define some finite subset Ŝ of the countable state space S
(e.g. using ST() for some appropriate )
2. Define a new CTMDP with state space Ŝ ∪ {0}
⎧ d
if i, j 6= 0
⎨ Q (i, j)
Matrices
d
0
if i = 0
Q̂ =
Q
⎩ P
d
Q
(i, h) if i 6= 0, j = 0
h∈
/ Ŝ
½ d
Vectors
w (i)
if i 6= 0
ŵd± =
w
max / minj∈S,u∈Dj (wu (j)) if i = 0
½ ±
gT (i)
if i 6= 0
¡
¢
ĝ±
g
=
T
max / min
i j∈S, gT± (j) if i = 0
3 Solve the resulting CTMDP to obtain bounds for the original one
3.
© Peter Buchholz 2011
77
Informatik IV
Bounds on the transition rates
 Transition rates Qd(i,j) and reward vectors sd are not exactly known
but we know Ld(i,j)
(i j) ≤ Qd(i,j)
(i j) ≤ Ud(i,j)
(i j) and ld ≤ wd ≤ ud
(sometimes known as Bounded Parameter MDPs see e.g. Givan et
al 2000)
realistic model if parameters result from measurements
 Goal: Find a policy that maximizes the minimal/maximal gain over an
infinite or finite interval
© Peter Buchholz 2011
78
Informatik IV
Infinite horizon case:
We assume that the CTMDP is unichain for all Ld
 Uniformization can be used to transform the CTMDP in an equivalent
DTMDP (as we did before)
 For a fixed decision vector d the minimal/maximal average reward is
obtained by
y a matrix P  [[L,U]] where in everyy row all except
p one
elements are equal to the corresponding element in matrix U or L
 only finitely many possibilities exist
 determination of the bounds is again an MDP problem
© Peter Buchholz 2011
79
Informatik IV
Infinite horizon case (continued):
 Overall solution is the combination of two nested MDP problems
(Markov two person game)
 Some solution algorithms exist (stochastic min/max control) but
advanced numerical techniques are rarely used
((room for improvements
p
remains!))
Finite horizon case
 Inherently complex to the best of my knowledge almost no results
(even for the simpler DTMDP case!)
© Peter Buchholz 2011
80
Informatik IV
Thank you!
© Peter Buchholz 2011
81
Informatik IV
Bibliography
(incomplete and biased selection but there is too much to be exhaustive!)
Textbooks:
 D. P. Bertsekas. Dynamic Programming and Optimal Control
vol I & II, 2nd ed. Athena Scientific 2005, 2007.
 Q. Hu, W. Yue. Markov Decision Processes with their Applications.
Springer 2008.
 X
X. Gua,
Gua O.
O Hernandez-Lerma.
Hernandez Lerma Continuous-Time
Continuous Time Markov Decision
Processes – Theory and Applications. Springer 2009.
 M. L. Puterman.
ute a Markov
a o Decision
ec s o Processes:
ocesses Discrete
sc ete Stoc
Stochastic
ast c
Dynamic Programming. Wiley 1994.
© Peter Buchholz 2011
Bibliography
82
Informatik IV
Papers:
 R
R. F
F. Serfozo.
Serfozo An Equivalence Between Continuous and Discrete
Markov Decision Processes. Operations Research 37 (3), 1979, 616620.
 B
B. L
L. Mill
Miller. Finite
Fi it State
St t Continuous
C ti
Ti
Time M
Markov
k D
Decision
i i P
Processes
with a Finite Planing Horizon. Siam Journal on Control 6 (2), 1968,
266-280.
 M. R. Lembersky. On Maximal Rewards and -Optimal Policies in
Continuous Time Markov Decision Processes. The Annals of Statistics
2 (1)
(1), 1974
1974, 159-169
159 169.
 S. A. Lippman. Countable-State, Continuous-Time Dynamic
Programming with Structure. Operations Research 24 (3), 1976, 477490.
 R. Givan, S. Leach, T. Dean. Bounded-parameter Markov decision
processes. Artificial Intelligence 122 (1/2), 2000, 71-109.
71 109.
© Peter Buchholz 2011
Bibliography
83
Informatik IV
Own papers:
 P. Buchholz, I. Schulz. Numerical Analysis of Continuous Time Markov
Decision Processes over Finite Horizons
Horizons. Computers & OR 38
38, 2011
2011,
651-659.
 P. Buchholz, E. M. Hahn, H. Hermanns, L. Zhang. Model Checking of
CTMDPs. Proc. CAV 2011.
 P. Buchholz. Bounding Reward Measures of Markov Models using
Markov Decision Processes
Processes. To appear in Numerical Linear Algebra
and Applications
© Peter Buchholz 2011
Bibliography
84