chap8_Mar19 - CSE

Chapter 8
Continuous Time Markov Chains
Definition
• A discrete-state continuous-time stochastic process
is called a Markov chain if
for t0 < t1 < t2 < …. < tn < t , the conditional pmf
satisfies the relation
• A CTMC is characterized by state changes that
can occur at any arbitrary time
• Index space is continuous.
• The state space is discrete valued.
Continuous Time Markov Chain
(CTMC)
• A CTMC can be completely described by:
– Initial state probability vector for X(t0):
– Transition probabilities.
– Also,
Homogenous CTMCs
•
is a time-homogenous CTMC iff
Or, the conditional pmf satisfies:
• A CTMC is said to be irreducible if every state can
be reached from every other state, with a non-zero
probability.
• A state is said to be absorbing if no other state can
be reached from it with non-zero probability.
CTMC Chapman-Kolmogorov
Equation
• It can also be written as :
•In the matrix form, (Matrix Q is called the infinitesimal
generator matrix (or simply Generator Matrix)
CTMC Steady-state Solution
• Steady state solution of CTMC
• Irreducible CTMCs having +ve steady-state {πj}
values are called recurrent non-null.
• Performance measures may be computed by
assigning reward rates to states and computing
expected steady state reward rates
• Accumulated reward (over an interval of time)
Continuous Time Birth-Death
Process
• The CTMC
and i={0,1,2,…} forms a
B-D process, if λi, i={0,1,2,..} and μi, i={1,2,..}
exists, and λi: Birth rate (>= 0) and μi: Death rate
(>= 0)
Continuous Time Birth-Death Process (contd.)
In Steady-state,
Steady State Equations
These are called balance eqs. Re-arranging above,
=0
M/M/1 Queue
• Arrivals follow Poisson distribution, i.e., interarrival times are all i.i.d, EXP(λ).
Poisson arrival
Process with rate λ
• Inter-departure times are i.i.d, EXP(μ).
• N(t): birth-death proc., λk=λ; μk=μ.
• Define, ρ=λ/μ (traffic intensity, in Erlangs)
M/M/1 queue (contd.)
• From the balance flow equations, we get
• ρ < 1 (for reasons of stability).
• Expected # of customers,
M/M/1 queue (contd.)
• This measure can be viewed as a weighted
average,
.
• By choosing suitable weights to the states of a
CTMC, we can get most measures of interest and
the resulting model is known as the MRM(Markov
Reward Model).
• Other measures:
– Average queue length (E[n])
– Average (expected) response time
– Average (expected) wait time etc.
M/M/1 queue: Little’s formula
• Let the random variable R denote the response time
(defined as the time elapsed from the instant of job arrival until its
completion)
Little’s law states
E[R] = E[N]/λ
• Here
• Response time (R) = wait time (W) + service time
(S)
E[W] = E[R] – E[S] = 1/μ(1-ρ) - 1/ μ .
Response time distribution (tagged job
approach)
• Assuming FCFS and steady-state conditions
– If there are already n jobs in the system, the next job
(N+1)st will experience a response time =R=
S*+S’1+S2+..+SN
– S* : service time for the (N+1)st job; S’1+: residual
service time for job currently undergoing service (#1).
– Because of the memory-less property, these times are
EXP( ).
– Hence, for some N=n, the LST of R is,
– Therefore,
M/M/m queue
• m-servers service the queue.
1
Poisson arrivals (λ)
μ
.
.
.
m
M/M/m Queue Solution
M/M/m Queue performance measures
• Average queue length E[N]: rk = k
M/M/m Queue performance measures
• Server utilization: rv M - number of busy servers.
For number of customers 0 <= k <= m, the
number of busy servers = k. Beyond that the
number of busy servers = m.
• A customer may have to join the queue.
Poisson stream behavior
• M/M/m: input/output both form Poisson streams.
• m=2 case
– Case 1: Two independent queues
Two separate Poisson streams
 2 separate M/M/1 queues
l/2
m
l/2
m
– Case 2: M/M/2 case
l/2
Two separate Poisson streams
l/2
m
1
l
Combined Poisson steams
m
2
Comparative performance
• Case 1: For each M/M/1 queue,
• Case 2: Common queue M/M/2
M/M/1/n Queue
• Finite queue size, finite buffer space 
finite state space.
Max. capacity n
Poisson
Stream
l
m
Arrivals exceeding
capacity are rejected
l

l
2
1
m
l
l
m
• Steady State Solution:
n
m
m
M/M/1/n Queue Performance Measures
• Mean queue length (expected # of jobs in
the system).
– rk = k,
• Loss probability
– rn = 1, rk = 0, k=0,1,..,n-1
• Throughput
– rk = m , k=1,2, ..,n; r0 = 0 (or, rk = l , k=0,1,2,
..,n-1; rn = 0)
M/M/1/n: Response time distribution
• Response time distribution: Job may be rejected (or
accepted)
– Unconditional
– Conditional (conditioned on the job being accepted):
• Reward assignment: for the kth state, response time
experienced by the tagged task is sum of k-service times,
each of which is EXP(μ), i.e., k-stage Erlang.
– Unconditional
– Conditional
Special cases of Birth-Death Process
• Pure birth processes
– Poisson process
– Software Reliability Growth Model: NHPP
• Number of software failures occurring in (0, t] is N(t), and
N(t) is Poisson with, λ(t) = abe-bt and m(t) = E[N(t)] = a(1- e-bt)
• Instantaneous failure intensity, λ(t) = b[a-m(t)]
– Transient solution may be found using Laplace
transforms
• Pure death processes
– No-repairs
Markov Availability Model
2-State Markov Availability Model
l
UP
1
1
DN
0
m
l
1
m
 MTTF
 MTTR
1) Steady-state balance equations for each state:
– Rate of flow IN = rate of flow OUT
• State1:
m  0  l 1
• State0:
l 1  m  0
2 unknowns, 2 equations, but there is only one independent
equation.
2-State Markov Availability Model
(Continued)
 0 1  1
Need an additional equation:
l
1
 1  1  1  1 
l
m
1
m
1
m
1l
MTTF
 1  Ass 



l l  m 1 l  1 m MTTF  MTTR
1
m
MTTR
1  Ass 
MTTF  MTTR
Downtime in minutes per year =
MTTR
MTTF  MTTR
* 8760*60
Ass  0.99999  1  Ass  10 5  DTMY  5.356 min
2-State Markov Availability Model
(Continued)
2) Transient Availability
for each state:
– Rate of buildup = rate of flow IN - rate of flow OUT
dP1
 m P0 (t )  l P1 (t )
dt
dP1
since P0 (t )  P1 (t )  1 we have
 m (1  P1 (t ))  l P1 (t )
dt
dP1
 ( m  l ) P1 (t )  m
dt
This equation can be solved to obtain assuming P1(0)=1
P1 (t )  A(t ) 
m
lm

l
lm
e ( l  m )t
2-State Markov Availability Model
(Continued)
3) R(t )  elt
4) Steady State Availability:
lim A(t )  Ass 
t 
m
lm
Using SHARPE to Solve the models
Markov availability model
• Assume we have a two-component parallel
redundant system with repair rate m.
• Assume that the failure rate of both the components
is l.
• When both the components have failed, the system
is considered to have failed.
Markov availability model
(Continued)
• Let the number of properly functioning
components be the state of the system.
The state space is {0,1,2} where 0 is the
system down state.
• We wish to examine effects of shared vs.
non-shared repair.
Markov availability model
(Continued)
2l
2
l
1
0
m
2m
2l
l
2
1
m
0
m
Non-shared (independent)
repair
Shared repair
Markov availability model
(Continued)
• Note: Non-shared case can be modeled & solved
using a RBD or a FTREE but shared case needs the
use of Markov chains.
Steady-state balance equations
• For any state:
Rate of flow in = Rate of flow out
Consider the shared case
2l 2  m 1
(l  m ) 1  2l 2  m 0
l 1  m 0
i: steady state probability that system is in state i
Steady-state balance equations
(Continued)
• Hence
Since
We have
or
m
m
2 
1 1   0
l
2l
 0  1   2  1
m
 m  m 
 0   0     0  1
l 1  l  2l 
0 
2
m m
1  2
l 2l
Steady-state balance equations
(Continued)
• Steady-state unavailability = 0= 1 - Ashared
Similarly for non-shared case,
steady-state unavailability = 1 - Anon-shared
1  Anonshared 
1
2m
m
1
 2
l l
2
• Downtime in minutes per year = (1 - A)* 8760*60
Steady-state balance equations
Homework
• Return to the 2 control and 3 voice channels example and
assume that the control channel failure rate is lc, voice channel
failure rate is lv.
• Repair rates are mc and mv, respectively. Assuming a single
shared repair facility and control channel having preemptive
repair priority over voice channels, draw the state diagram of a
Markov availability model. Using SHARPE GUI, solve the
Markov chain for steady-state and instantaneous availability.
Markov Reliability Model
Markov reliability model
with repair
• Consider the 2-component parallel system but disallow
repair from system down state
• Note that state 0 is now an absorbing state. The state
diagram is given in the following figure.
• This reliability model with repair cannot be modeled using
a reliability block diagram or a fault tree. We need to
resort to Markov chains. (This is a form of dependency
since in order to repair a component you need to know the
status of the other component).
Markov reliability model
with repair (Continued)
Absorbing state
• Markov chain has an absorbing state. In the
steady-state, system will be in state 0 with
probability 1. Hence transient analysis is of
interest. States 1 and 2 are transient states.
Markov reliability model
with repair (Continued)
Assume that the initial state of the Markov chain
is 2, that is, P2(0) = 1, Pk (0) = 0 for k = 0, 1.
Then the system of differential Equations is written
based on:
rate of buildup = rate of flow in - rate of flow out
for each state
Markov reliability model
with repair (Continued)
dP2 (t )
 2lP2 (t )  mP1 (t )
dt
dP1 (t )
 2lP2 (t )  (l  m ) P1 (t )
dt
dP0 (t )
 lP1 (t )
dt
Markov reliability model
with repair (Continued)
After solving these equations, we get
R(t) = P2(t) +P1(t)

Recalling that
MTTF   R(t ) dt
0
3
m
MTTF 
 2
2l 2l
, we get:
Markov reliability model
with repair (Continued)
Note that the MTTF of the two component
parallel redundant system, in the absence
of a repair facility (i.e., m = 0), would have
been equal to the first term,
3 / ( 2*l ), in the above expression.
Therefore, the effect of a repair facility is to
increase the mean life by m / (2*l2), or by a
m
2
factor
m
2l
(1 
3
)
2l
3l
1
Markov Reliability Model
with Imperfect Coverage
Markov model
with imperfect coverage
Next consider a modification of the above
example proposed by Arnold as a model of
duplex processors of an electronic
switching system. We assume that not all
faults are recoverable and that c is the
coverage factor which denotes the
conditional probability that the system
recovers given that a fault has occurred.
The state diagram is now given by the
following picture:
Now allow for Imperfect
coverage
c
Markov model
with imperfect coverage
(Continued)
Assume that the initial state is 2 so that:
P2 (0)  1, P0 (0)  P1 (0)  0
Then the system of differential equations are:
dP2 (t )
 2lcP2 (t )  2l (1  c) P2 (t )  mP1 (t )
dt
dP1 (t )
 2lcP2 (t )  (l  m ) P1 (t )
dt
dP0 (t )
 2l (1  c) P2 (t )  lP1 (t )
dt
Markov model
with imperfect coverage
(Continued)
After solving the differential equations we obtain:
R(t)=P2(t) + P1(t)
From R(t), we can system MTTF:
l (1  2c)  m
MTTF 
2l[l  m (1  c)]
It should be clear that the system MTTF and system reliability are
critically dependent on the coverage factor.
2-component Availability model with
detection delay
• 2-component availability model
– Steady state availability Ass = 1-π0
• Failures detection stage takes random time, EXP(δ)
– Down states are ‘0’ and ‘1D’  Ass = 1- π0- π1D
Therefore, steady state unavailability U(δ) is given by
2-component availability model
with finite coverage
• Coverage factor = c (probability that the
fault is covered)
• ‘1C’ state is a re-boot (down) state.
2-components availability model
: delay+finite coverage
• Model has detection delay+coverage factor
• Down states are ‘0’, ‘1C’ and ‘1D’.
Preventive Maintenance example
• Prolonged usage of a component may lead to
increased failure rate (i.e. IFR situation)
• Hence, life time may be modeled as HypoEXP()
distribution, say 2-stage Hypo.
• Component is inspected randomly. Time between
inspections is a random, following EXP(λi).
Inspection completion time is EXP(μi).
• What does inspection do?
m
– First stage of life – no action
– Second stage of life – repair
• That is, preventive maintenance
• State = <#stage, faulty>
0,0
1,0
l1
am
mi
li
li
1,2
mi
0,1
1,1
l2
2
Performance Models
• Example: 2-servers with different service times.
– State = <n1, n2>
• Performance: Average no. of jobs in the system, E[n1+n2]
– Reward rate rn1, n2 = n1+n2
– Except for the <0,0>, in all other states, viz., <k,0> and
<k,1>, there are k jobs in the system.
SOURCES OF COVERAGE
DATA
• Measurement Data from an Operational system:
Large amount of data needed;
Improved Instrumentation Needed
• Fault/Error Injection Experiments
Costly yet badly needed: tools from
CMU, Illinois, Toulouse
SOURCES OF COVERAGE
DATA (Continued)
• A Fault/Error Handling Submodel
Phases of FEHM:
Detection, Location, Retry, Reconfig, Reboot
Estimate Duration & Prob. of success of each
phase
IBM(EDFI), HARP(FEHM), Draper(FDIR)
Homework 6:
Modify the Markov model with imperfect coverage to
allow for finite time to detect as well as imperfect
detection. You will need to add an extra state, say D.
The rate at which detection occurs is  . Draw the state
diagram and using SHARPE GUI investigate the effects
of detection delay on system reliability and mean time to
failure.