SENSITIVITY ANALYSIS OF RELIABILITY AND

S E N S I T I V I T Y A N A L Y S I S OF R E L I A B I L I T Y
AND PERFORMABILITY MEASURES FOR
MULTIPROCESSOR SYSTEMS *
J a m e s T . B l a k e , A n d r e w L. R e i b m a n ! a n d K i s h o r S. T r i v e d i
Department
of C o m p u t e r
Science
Duke University
Durham,
North Carolina
Abstract
27706
[4,17,18] and performability [2,13] of multiprocessor systems. In addition to computing traditional measures of
Traditional evaluation techniques for multiprocessor systems use Markov chains and Markov reward models to
compute measures such as mean time to failure, reliability, performance, and performability. In this paper,
we discuss the extension of Markov models to include
parametric sensitivity analysis. Using such analysis, we
c a n guide system optimization, identify parts of a system model sensitive to error, and find system reliability
and performability bottlenecks.
As an example we consider three models of a 16 processor, 16 memory system. A network provides communication between the processors and the memories.
Two crossbar-network models and the Omega network
are considered. For these models, we examine the sensitivity of the mean time to failure, unreliability, and
performability to changes in component failure rates.
We use the sensitivities to identify bottlenecks in the
three system models.
I n d e x t e r m s - - Multistage interconnection networks,
Omega networks, performability, reliability, sensitivity
analysis.
system performance, it is often interesting to determine
the performance or reliability "bottleneck" of a system
or to optimize system architectures. Towards this end,
we discuss parametric sensitivity analysis [5], the computation of derivatives of system measures with respect
to various model parameters [6,7,19]. These derivatives
can be used to guide system optimization. Parameters
with large sensitivities usually deserve close attention
in the quest to improve system characteristics. These
parameters may also indicate elements of a model that
are particularly prone to error [20].
The paper is organized as follows: A brief summary
of Markov and Markov reward models for performance
and reliability modeling is given in Section 2. Section
3 discusses the computation of parametric sensitivities.
Three reliability/performability models for a multiproeessor system (MPS) are given in Section 4. In Section
1
Introduction
5, we present numerical results for the parametric sensitivity of these models.
As the use of multiprocessor systems increases, the reliability and performance characteristics of the various
2
design options for realizing these systems must be carefully analyzed. Several papers examine the reliability
Markov Reliability Models
The evolution of a degradable system through various
configurations with different sets of operational compo-
*This work was supported in part by the Air Force Office of
Scientific Research under grant AFOSR-84-0132.
tNow with AT&T Bell Laboratories, Holmdel, NJ 07733
nents can be represented by a discrete-state, continuoustime Markov chain (CTMC), { Z ( t ) , t > 0}, with state
space q2 = { 1 , 2 , . . . ,k}. For each i , j E q2, let qij be the
transition rate from state i to state j , and define
qli
Permission to copy without fee all or part of this material is granted
provided that the copies are not made or distributed for direct
commercial advantage, the A C M copyright notice and the title o f
the publication and its date appear, and notice is given that copying
is by permission o f the Association for C o m p u t i n g Machinery. To
copy otherwise, or to republish, requires a fee a n d / o r specfic
permission.
© 1988 A C M 0 - 8 9 7 9 1 - 2 5 4 - 3 / 8 8 / 0 0 0 5 / 0 1 7 7
$1.50
=
- -
k
~ qii.
]=1
j#i
Then, Q = [q~¢] is the k by k transition rate matrix.
We let P~(t) = Prob[Z(t) = i] be the probability that
Ir 7
the system is in state i at time t. The transient state
t
Y(t) =
probability row-vector P ( t ) can be computed by solving
a m a t r i x differential equation [22],
2_(t) = P_P_(t)Q.
.fOx(u)d~.
(2)
Furthermore, if we use b a n d w i d t h to construct the reward vector, then from equation (2), Y(t) represents the
(1)
n u m b e r of requests that the IN is capable of satisfying
by time t. The expected accumulated reward is
Methods for c o m p u t i n g P(t) are compared in [15].
We can divide the state space into two sets: UP,
E[g(t)] = E[
the set of operational states, and D O W N , the set of
failure or down states. If all D O W N states are absorbing
/: X(u)du]
=
In order to c o m p u t e E[Y(t)], we
r,
P,(~)d,~.
lat L,(t)
(3)
= fg P,(,~)d,~.
(failure) states, we can obtain the system reliability from
Then, the row vector L(t) can be c o m p u t e d by solving
the state probabilities,
the system of differential equations:
R(t) = ~
"_L(t) = LCt)Q + PC0) .
P,(t).
(4)
iEUP
Methods of solving this system of equations are dis-
Associated with each state of the C T M C is a reward
cussed in [16].
rate t h a t represents the performance level of the syst e m in t h a t state.
A special case of the expected accumulated reward
The C T M C and the reward rates
are combined to form a Markov reward model [8]. Each
is the mean time to failure ( M T T F ) .
state represents a different system configuration. Tran-
defined as
MTTF =
sitions to states w i t h smaller reward rates (lower per-
The M T T F
is
/; R(t)dt.
(5)
formance levels) are component failure transitions, and,
The M T T F is a special case of E[Y(oo)], with reward
in repairable systems, transitions to states with higher
rate 0.0 assigned to all D O W N states (which are as-
performance levels are repair transitions.
sumed to be absorbing) and reward rate 1.0 assigned to
T h e choice of a performance measure for determin-
all U P states. To c o m p u t e M T T F , we solve for r_ in
ing reward rates is a function of the system to be eval-
~_~ = -~(0),
uated. Often a raw measure of system capacity such as
(6)
the instruction execution rate is useful. For an intercon-
where P_(0) is the partition of P(0) corresponding to the
nection network (IN), one appropriate performmance
U P states only. T h e m a t r i x Q is obtained by deleting
measure is bandwidth. At other times, a queueing the-
the rows and columns in Q corresponding to D O W N
oretic performance model may be used to c o m p u t e the
states. Any linear algebraic system solver can be used
reward rates. Since the time-scale of the performance-
to solve this system of equations. Although one might
related events (bandwidth) is much faster t h a n the the
like to use direct methods like Gaussian elimination; for
time-scale of the reliability-related events (component
large, sparse models, iterative methods are more prac-
failures), steady-state values from performance models
tical [211. T h e m a t r i x - Q is a non-singular, diagonally-
are used to specify the performance levels or reward
d o m i n a n t M - m a t r i x . Thus, if we use an iterative m e t h o d
rates for each state.
such as Gauss-Seidel, SOR, or optimal S O R to solve (6),
it is guaranteed to converge to the solution [23]. Then,
We let r~ denote the reward rate associated with
state i, and call r the reward vector. The reward rate of
MTTF = ~
the system at time t is given by the process X(t) -= rz(o.
rl.
(7)
iEUP
The expected reward rate at time t is
3
Sensitivity
Analysis
E[X(t)] -- ~ r,P,(t).
The results obtained from a model are sensitive to m a n y
This quantity is also called the computation availability
factors.
[2].
bution on a stochastic model is often considered.
For example, the effect of a change in distriIn
If we let Y (t) be the a m o u n t of reward accumulated
this paper, we concentrate our attention on p a r a m e t r i c
(the a m o u n t of work done) by a system during the in-
sensitivity analysis, a technique to c o m p u t e the effect
terval (0, t), then
of changes in the rate constants of a M a r k e v m o d e l on
178
where
the measures of interest. Parametric sensitivity analysis
helps: (1) guide system optimization, (2) find reliability,
=
=
1)Q')
performance, and performability bottlenecks in the sys=
tem, and (3) identify the model parameters that could
produce significant modeling errors.
~ ( i - 1)'Q" + n _ ( i - 1 ) ° Q
*,
and
One approach to parametric sensitivity analysis is to
n_if) = ~ ( i - 1)Q"
use upper and lower b o u n d s on each parameter in the
sparse m a t r i x implementation, Uniformization requires
the derivative of the measures of interest with respect to
O ( ( 2 r / + ~)qt) FLOPS. Both R u n g e - K u t t a ' s and Uni-
the model parameters [6,19]. A b o u n d on the perturbed
formization's performance degrades linearly as q (or v)
solution can then be computed with a simple Taylor
grows. Problems with values of q that are large relative
series approximation.
to the length of the solution interval are called stiff.
We assume that the transition rates qq are functions
Large values of q (and v) are common in systems with
of some parameter A. For a given value of A, we want to
repair or reconfiguration. An attractive alternative for
compute the derivative of various measures with respect
such stiff problems is an implicit integration technique
to A (e.g., OPi(t)/OA). If we let S(t) be the row vector
with execution time O ( 2 r / + rls) [15].
of the sensitivities OPi(t)/O)~, then from equation (1)
We can derive the sensitivity of E[X(t)] from the
we o b t a i n
sensitivities of the state probabilities
(8)
OE[X(t)] = ~
Orl
~ r,P,(t) = ~ -g-ZP,(t)+
~ r,S,(t).
Oh
Oh ie*
,e~
,e~
where V is the derivative of Q with respect to )~. Ass u m i n g the initial conditions do not depend on A, we
£(0)
OP(O)
=
--
lim
aP(t)
--
(13)
then H'(0) = 0. Also note that O Q . = Y/q. With a
on system reliability [20]. Our approach is to compute
h(t) = S(t)Q + P(t)V
, n_(0) = P ( 0 ) .
If the C T M C ' s initial conditions do not depend on ,~,
model to compute optimistic and conservative bounds
have
(12)
(14)
Similarly, we can derive the sensitivity of E[Y(t)] by
--
differentiating equation (3),
O.
0h
t~0 0,~
We can then solve equations (1) and (8) simultaneously
using,
V
[~(t),S_(t)] = [P__(t),S(t)] [ 0Q Q ]
Ori . .
= ~-~ni(t)+~rijo
(9)
S,(u)du.(15)
As in the instantaneous measures case, methods for
subject to the initial condition
[P(0),S_(0)] =
ft
[~,0].
computing the cumulative state probability sensitivity
(10)
vector, f~ S_(u)du, include numerical integration, the ACE
Let ~ be the n u m b e r of non-zero entries in Q, and let
algorithm for acyclic models, and Uniformization.
~7~ be the n u m b e r of non-zero entries in V.
For the special case of m e a n time to failure, if we
differentiate equation (6), we can solve for s,
For acyclic models, an efficient algorithm that requires O ( 2 r / + r/~) floating-point operations (FLOPS) is
discussed in [12].
For more general models with cy-
=
cles, we can use an explicit integration technique like
- ~-5-2
(16)
where _r is the solution obtained from equation (6). Then,
R u n g e - K u t t a . The execution time of explicit methods
OMTTF
Oh
= E
like R u n g e - K u t t a is O((2r/+ rl,) (q + v)t) F L O P S , where
iEUP
q = max, Iqlll and v = maxi Iv~.,I. To solve equation (10)
Or~
~ = ~
/CUP
s,"
(17)
with Vniformization [7], we choose q > max, Iqi,! and
This linear system can be solved using the same algo-
Q* = Q/q + I. Then,
rithms used to solve equation (6).
0
Having computed the derivative of some measure,
oo
_S(t) = ~-~ i~=oP(O)(Q*)'e -qt (qt)ii! = ~-~I(i)e'-qt°°
=
(~!)'q~,
zt ,
say M T T F , with respect to various system parameters,
/=0
(11)
,~i, there are at least three distinct ways to use the results. The first application is to provide error bounds
179
on the solution when given bounds on the input parameters.
4
Assume that each of the parameters ),i is con-
Multiprocessor
Example
tained in an uncertainty interval of width AAi. We can
Consider a multiproceesor system (MPS) t h a t consists
then a p p r o x i m a t e l y determine an uncertainty interval
of 16 processors (Ps), 16 shared memories (Ms), and
AMTTF,
an interconnection network (IN) t h a t connects the proAMTTF
~- ~
AAI O M T T F
OAi
cessors to the memories.
(18)
.
We consider three different
interconnection network models. First, the IN may be
i
modeled as one large crossbar switch, as in the C . m m p
A second use of parametric sensitivities is in the
system [18] (see Figure 1). We refer to this model as
identification of portions of a model that need refine-
SYS,.
ment. There is some cost involved in reducing the size
o-o=,=
of the intervals A),i since it requires taking additional
m e a s u r e m e n t s or performing more detailed analysis. We
--o
assume the cost (or time) of reduction in A),~ is proportional to AA~/Ai and let
I = a r g m a x i ),i
OMTTF
~
o-
,
(19)
where argmaxi]xi] denotes the value of i t h a t maximizes
o-
xi. Then, refining the I t h p a r a m e t e r is the most costFigure 1: Multiprocessor System Using a Crossbal
Switch as a Single C o m p o n e n t Interconnection Network.
effective way to improve the accuracy of the model.
A third application of parametric sensitivities is syst e m optimization and bottleneck analysis. Assume that
there are Ni copies of component i in the system, and
t h a t the failure rate of component i is )~i. Furthermore,
assume the cost of the i th subsystem is given by some
8
function ciNiA~. '~'. Define the optimization problem:
Maximize
:
Subject To :
MTTF
~
(20)
c,N,)~7 ~' <_ C O S T .
i
Figure 2: Multiprocessor System Using a Crossbar
Switch C o m p o s e d of M u l t i p l e x e r s / D e m u l t i p l e x e r s as
the Interconnection Network.
Using the m e t h o d of Lagrange multipliers [1], the optimal values of Ai satisfy:
(21)
)~7'+10MTT.______FF= c o n s t a n t .
ci Ni ai
c3Ai
Let
I
as S Y S d , is shown in Figure 2. Here, the IN is composed
of sixteen 1 × 16 demultiplexers and sixteen 16 × 1 mul-
A~. i - l - 1 0 M T T F
'
I* : a r grnax, - c~Nia~
OA~
A second more detailed network model, referred to
(22)
"
tiplexers [18].
In this arrangement, each processor is
Then, the most cost-effective point to make an incre-
connected to a demultiplexer and each m e m o r y is con-
mental investment is in subsystem type I*.
In other
nected to a multiplexer. Functionally, this is equivalent
words, the system bottleneck from the M T T F
point of
to the crossbar switch, but it pr'.ovides additional fault-
view is s u b s y s t e m I*. In our numerical examples, we
tolerance.
will use this definition of bottleneck. For convenience,
complete disconnection of all processors and memories.
In S Y S , ~ any switch failure results in the
we also assume t h a t el = a~ = 1 for all i, although other
In S Y S a , a multiplexer failure disconnects only the as-
cost functions could be used. At the conclusion of the
sociated processor or memory.
In the third model, the IN is an O m e g a network
numerical results section, we compare these results with
[11]. This network has two stages and is constructed
those obtained using the second scaling approach.
from eight 4 x 4 switching elements (SEs). This is an
180
economical alternative to a crossbar switch, since the
For each model of the 16 × 16 M P S system, we
complexity of the crossbar is O(N 2) whereas that of the
construct a Markov reliability model.
For S Y S , , we
O m e g a network is O(N log N) [10]. The MPS using the
e n u m e r a t e all possible system states.
For SYSd,
O m e g a IN is shown in Figure 3. We refer to this model
modify the C T M C used in S Y S ° by adjusting the fail-
as S Y Sn.
ure rates of the processors and memories to account for
stage 1
Stage 2
we
their associated demultiplexers/multiplexers. In S Y S n ,
the C T M C must account for the failure behavior of the
processors, memories, and the SEs to which they are
connected. By exploiting the s y m m e t r y in the Omega
~
=
network, the state description can be accomplished with
an 8-tuple. The initial state is (44444444) where position i (1 ~ i _< 4) represents the number of functioning
processors connected to an operational SE in position
i. Similarly for the memories where 5 < i < 8. This
Markov chain embodies the concept of bulk failures. For
Figure 3: Multiprocessor System Using an Omega Network with 4 x 4 Switching Elements as the Interconnection Network.
a given i, either a processor or m e m o r y may fail and the
value at position i decreases by one, or a SE may fail
and the value at position i becomes zero. For S Y S n ,
Reliability
4.1
state lumping [9] reduces the state space of the C T M C .
We use the switch fault model for reliability analysis.
To extend the conditions for lumpability to Markov re-
The p r i m a r y assumption in this model is that each com-
ward models, we require t h a t every pair of states u and v
ponent is an atomic structure. Therefore, the failure of
in the "lumped state" must have identical reward rates
any device in a c o m p o n e n t causes a total failure of the
(i.e., r~ = r~).
component.
For brevity, we also assume that system
4.2
failure occurs only as a result of the accumulation of
component failures (exhaustion of redundancy).
This
For each state in the system reliability model, we need
"perfect coverage" assumption can be easily extended
to determine an associated reward rate (performance
to incorporate "imperfect coverage."
level). We use the average number of busy memories
Gate count is used to "equalize" the failure rates of
(memory bandwidth) as the reward rate (performance
the three models of the IN. F r o m [10], an n x n crossbar
level) for each system configuration. This is an appro-
switch requires 4 n ( n - 1) gates where n is the n u m b e r of
priate choice of performance metric for the MPS since
inputs and outputs. And an n x 1 multiplexer requires
the efficiency of the system is limited by the ability of
2(n - 1) gates where n is the n u m b e r of inputs to the
the processors to r a n d o m l y access the available memo-
multiplexer. A demultiplexer also requires 2 ( n - 1 ) gates
ries.
by similar reasoning. These numbers for gate count are
In S Y S , and SYSd, the networks are non-blocking,
based on switching element construction that utilizes a
i.e., contention for the memories occurs at the m e m o r y
ports. In contrast, S Y S e has a blocking network, since
tree-like arrangement of gates. For the 16 × 16 M P S ,
there are 960 gates in the simple 16 × 16 crossbar switch,
contention also occurs inside the IN. If two or more pro-
30 gates in a demultiplexer/multiplexer, and 48 gates in
cessors compete for the same o u t p u t link of a SE, only
the 4 x 4 SE (assuming the SE uses a crossbar construction).
Performance
one request is successful and the remaining requests are
Since the switch-fault model assumption is be-
dropped.
ing used, if 6, denotes the failure rate of the 16 × 16
In determining the b a n d w i d t h of a given configu-
crossbar switch, then 6~/960 is the gate failure rate,
ration of the multiprocessor system, the independent-
6d = ~a/32 is the demultiplexer/multiplexer failure rate,
reference model assumptions stated in (14] for analysis
and 6n ----6,/20 is the 4 x 4 SE failure rate.
of circuit-switched networks are used.
Let p~, denote
the probability that a processor issues a request during
181
a p a r t i c u l a r m e m o r y request cycle, a n d Pout d e n o t e t h e
Input Links
p r o b a b i l i t y t h a t a p a r t i c u l a r m e m o r y receives a request
0
1
at its i n p u t link. Since it is a s s u m e d t h a t requests are
Output Links
--
--0
--
--1
--
m n -
n o t buffered in t h e IN, n o r are m u l t i p l e requests acc e p t e d at a m e m o r y on a n y cycle, c o m p u t a t i o n of t h e
m e m o r y b a n d w i d t h for t h e M P S is a c c o m p l i s h e d in a
straightforward manner.
Over time, c o m p o n e n t s of t h e M P S fail, a n d as a re-
n-1
sult, t h e p e r f o r m a n c e of t h e s y s t e m decreases. For t h e
1
F i g u r e 4: n x n Switching E l e m e n t .
c r o s s b a r n e t w o r k , i m p l e m e n t e d as a single e l e m e n t in
S Y S , or u s i n g m u l t i p l e x e r s in S Y S 4 , we use t h e m o d e l
d e n o t e d as pl. T h i s value is also t h e p r o b a b i l i t y t h a t
developed by B h a n d a r k a r [3] to o b t a i n t h e average n u m -
t h e r e is a n i n p u t r e q u e s t for a SE in t h e n e x t stage. A
b e r of b u s y m e m o r i e s , or m e m o r y b a n d w i d t h , for each
r e c u r r e n c e r e l a t i o n exists for c o m p u t i n g t h e s e r e q u e s t
d e g r a d e d s y s t e m configuration.
probabilities:
For t h e O m e g a net-
work, we use a n e x t e n s i o n of tile p e r f o r m a n c e m o d e l
P~+I =
1-(1-n) .pl ,
(26)
in [141 .
For t h e n x n c r o s s b a r switch, t h e p r o b a b i l i t y t h a t
Note t h a t Pin is t h e p r o b a b i l i t y t h a t t h e r e is a request at
a p a r t i c u l a r processor requests a p a r t i c u l a r m e m o r y for
t h e first stage a n d p~ = po,a (the p r o b a b i l i t y t h a t t h e r e
a given n e t w o r k cycle is pi,-,/n. T h e p r o b a b i l i t y t h a t a
is a r e q u e s t for a p a r t i c u l a r m e m o r y ) . T h e b a n d w i d t h
p a r t i c u l a r m e m o r y is selected b y at least one processor
is c o m p u t e d as t h e p r o d u c t of t h e request p r o b a b i l i t i e s
is
for a p a r t i c u l a r m e m o r y a n d t h e n u m b e r of m e m o r i e s
Po,~t = 1 - (1 - Pin),.,
n
-
[14]:
(23)
"
B W n = N ( 1 - (1 - P ~ - I ) , ) .
T h e b a n d w i d t h for t h e s y s t e m is j u s t Pout t i m e s n, h e n c e
(27)
n
In t h e p r e s e n c e of failures, this e q u a t i o n m u s t b e m o d BW~b,r = n ( 1 - ( 1 - P l n -X-) , )
(24)
ified to a c c o u n t for graceful d e g r a d a t i o n .
In t h e p r e s e n c e of m e m o r y or processor failures, this
a n n x n SE, say link 0 in F i g u r e 4, a n d d e n o t e it by
e q u a t i o n m u s t b e modified since t h e n u m b e r of opera-
pi,.0. It m a y r e q u e s t a p a r t i c u l a r o u t p u t link w i t h equal
tional m e m o r i e s is not, in general, equal to t h e n u m ber of o p e r a t i o n a l processors. In
[3],
In t h e first
stage of t h e n e t w o r k , consider a p a r t i c u l a r i n p u t link to
probability, so it does n o t request a specific link w i t h
a detailed combi-
p r o b a b i l i t y (1 - p i , , o / n ) .
n a t o r i a l a n d M a r k o v i a n analysis was p e r f o r m e d to de-
Similarly, i n p u t link 1 does
t e r m i n e t h e b a n d w i d t h in t h e a s y m m e t r i c case. Let i
not request t h e s a m e link w i t h p r o b a b i l i t y (1 - pi,.1/n).
d e n o t e t h e n u m b e r of o p e r a t i o n a l processors a n d j de-
If t h e processor a t t a c h e d to t h e i n p u t link h a s failed,
n o t e t h e n u m b e r of o p e r a t i o n a l memories. F u r t h e r , let
t h e n Pi,.i = 0. Now, t h e p r o b a b i l i t y of a request for a
= r n i n { i , j } a n d rn = m a x { i , j } . T h e n f o r p i , = 1.0,
specific o u t p u t link, say i, as a result of t h e ( p e r h a p s
B h a n d a r k a r f o u n d t h e average n u m b e r of b u s y m e m o -
u n e q u a l ) r e q u e s t p r o b a b i l i t i e s by t h e i n p u t links is t h e n
ries in t h e s y s t e m is a c c u r a t e l y p r e d i c t e d by t h e f o r m u l a ,
c o m p u t e d as
m(1 - ( 1 - 1 ~ r e ) e ) . Therefore, t h e r e w a r d r a t e associated
Po~.t.i = 1 -
w i t h a p a r t i c u l a r s t r u c t u r e s t a t e (i,j) is
BW=bor = m(1 - (1 - 1/m)~).
(28)
T h e b a n d w i d t h of t h e SE is t h e n ,
(25)
n(Pout.i)
Next, consider t h e N x N O m e g a n e t w o r k w i t h switch-
BWsE =
ing e l e m e n t s of size n x n. N u m b e r t h e stage to w h i c h
0
if t h e SE h a s n o t failed,
otherwise.
(29)
T h e o u t p u t s of this SE serve as i n p u t s to t h e SEs in
t h e processors are a t t a c h e d as stage 1, a n d t h e last stage
to w h i c h t h e m e m o r i e s are a t t a c h e d as stage v.
H(1-~).
j=0
t h e n e x t stage.
The
At t h e final stage of t h e O m e g a net-
work, some m e m o r i e s m a y b e inoperable. T h e n e t w o r k
s w i t c h i n g e l e m e n t s are n x n crossbars, a n d t h e out-
b a n d w i d t h is c o m p u t e d as t h e s u m of t h e request p r o b -
p u t of a p a r t i c u l a r link of a s w i t c h i n g e l e m e n t c a n b e
abilities for t h e o p e r a t i o n a l memories.
182
Let No d e n o t e
t h e set of o p e r a t i o n a l m e m o r i e s , t h e n
BWn ~- ~ (po~t)i.
jENo
Architecture
Bandwidth
SYSm
SYSd
SYSn
10.3
10.3
8.4
(30)
In t h e next section, e q u a t i o n (25) is used to c o m p u t e
t h e r e w a r d r a t e for each s t a t e of the SYS~ a n d SYSd
MTTF
K = 12 K = 4
1322.3
1537.9
1497.2
3611.8
6708.6
6575.5
Cost
960
960
384
T a b l e 1: C o m p a r i s o n of M P S Models.
models. Similarly, e q u a t i o n (30) is used to c o m p u t e t h e
5.l
r e w a r d r a t e for each s t a t e of the SYS~ model.
Single-Valued
Measures
In T a b l e 1, we use t h r e e single-valued m e a s u r e s to com-
5
p a r e t h e t h r e e M P S models. Using e q u a t i o n s (25) a n d
Results
Numerical
(30), t h e b a n d w i d t h is c o m p u t e d using t h e a p p r o a c h deIn this section, we e x a m i n e t h e t r a n s i e n t reliability, m e a n
scribed in Section 4.2. If we consider b a n d w i d t h alone,
t i m e to failure, a n d expected r e w a r d r a t e at t i m e t for
SYS~ a n d SYSd are i n d i s t i n g u i s h a b l e , a n d S Y S n is
our t h r e e 16× 16 M P S models. For each model, we com-
t h e least preferred choice. Next, s y s t e m M T T F is com-
p u t e t h e sensitivities of these t h r e e m e a s u r e s to changes
p u t e d using e q u a t i o n s (6) a n d (7). B a s e d on t h e M T T F ,
in t h e c o m p o n e n t failure rates.
SYSd is t h e m o s t reliable s y s t e m , a n d SYS~ is the least
A given task for a m u l t i p r o c e s s o r s y s t e m m a y re-
reliable.
T h e cost of processors a n d m e m o r i e s is t h e
quire U processors a n d V m e m o r i e s where U a n d V
s a m e for all t h r e e models, so we use t h e cost of t h e IN
are b o t h less t h a n t h e t o t a l resources available on a
to c o n t r a s t t h e models. T h e cost of t h e IN is c o m p u t e d
f u l l y - o p e r a t i o n a l M P S . So given t h e task, a m u l t i p r o -
using a gate count.
cessor s y s t e m is o p e r a t i o n a l as long as U processors
t h e n u m b e r of gates n e e d e d by t h e o t h e r two systems.
can access V memories.
S Y S n requires less t h a n o n e - h a l f
As in [18], we consider t h e
Next we consider t h e sensitivity of t h e M T T F esti-
case w h e r e 4 processors a n d 4 m e m o r i e s are required
m a t e s given in Table 1 to changes in c o m p o n e n t failure
For S Y S , a n d SYSd, t h e cor-
rates. For each model, u s i n g e q u a t i o n (17), we c o m p u t e
( K = U = V -- 4).
r e s p o n d i n g CTMCs h a v e 170 states.
For SYS~, t h e
t h e s e n s i t i v i t y of M T T F w i t h respect to t h e processor
C T M C h a s m o r e t h a n 64000 s t a t e s before s t a t e l u m p -
failure rate, m e m o r y failure rate, a n d s w i t c h i n g element
ing, a n d 3970 s t a t e s after lumping. We have also com-
failure rate. Note t h a t each s y s t e m has a different n u m -
p u t e d results for t h e s y s t e m t h a t requires 12 processors
b e r of s w i t c h i n g e l e m e n t s a n d these SEs have different
a n d 12 m e m o r i e s ( K = 12). For brevity, m o s t of t h e
failure rates.
d a t a for this second case is o m i t t e d .
the cost m o d e l described in Section 3 w i t h al = ci = 1.
To find t h e s y s t e m bottlenecks, we use
We use failure d a t a from t h e analysis of t h e C . m m p
T h e p a r a m e t r i c sensitivities are multiplied by a fac-
s y s t e m [18]. By a p a r t s c o u n t m e t h o d , Sieworiek deter-
tor of A~/N~. T h e results are s h o w n in Table 2. T h e
m i n e d t h e failure rates p e r h o u r for t h e c o m p o n e n t s to
b o t t l e n e c k s for each s y s t e m configuration are italicized.
be:
Because SYS~ is m o s t sensitive to switch failures, for
Processor
-- 0.0000689
Memory
"7 = 0.0002241
Switch
5, = 0.0002024
this model, t h e switch is the reliability bottleneck. T h e
m e m o r i e s are t h e b o t t l e n e c k for t h e o t h e r two models.
Like Siewiorek, we a s s u m e t h a t c o m p o n e n t lifetime dis-
5.2
t r i b u t i o n s are e x p o n e n t i a l l y d i s t r i b u t e d .
Reliability
In t h e rest of this section, we first consider some
In F i g u r e 5, we plot t h e t i m e - d e p e n d e n t reliability curves
single-valued m e a s u r e s of n e t w o r k p e r f o r m a n c e a n d re-
for t h e t h r e e M P S models. Failure rates for t h e IN are
liability. We t h e n consider t i m e - d e p e n d e n t s y s t e m reli-
d e t e r m i n e d using t h e gate c o u n t m e t h o d discussed in
ability a n d its sensitivity. Finally, we consider E[X(t)],
t h e Section 4.1. We restrict our a t t e n t i o n to t h e case
t h e e x p e c t e d r e w a r d r a t e ( b a n d w i d t h ) at t i m e t, a mea-
K -- 4. Because SYS~ is vulnerable-' to a single-point
sure of d e g r a d a b l e s y s t e m p e r f o r m a n c e .
switch failure, R~(t) is significantly less t h a n Rd(t) or
R~ (t). M o d e l i n g t h e IN at t h e d e m u l t i p l e x e r / m u l t i p l e x e r
level increases t h e p r e d i c t e d reliability since t h e failure
of i n d i v i d u a l c o m p o n e n t s is n o t c a t a s t r o p h i c . Also, ob-
183
MPS
Failure R a t e P a r a m e t e r
"
Processors
" K=12
S Y S e " --21.29
SYSa
-35.01
SYS n
-35.45
K=4
-2.06
-20.14
-34.84
Memories
Network
K=12
--1462.78
K=4
--2297.36
- 1974.08
- 1863.69
- 9069.,~5
- 8655.67
K=12
-
46es.e4
--0.93
--10.64
K=4
-
89899.40
--3.57
--39.74
Table 2: Scaled Sensitivity of M T T F w i t h l~espect to P a r a m e t e r s ( x ((X~/Ni) x 10s)).
1.0
7X10-5
~ e ~ v
f'
,\
SYSi
0
6000
Time(Hours)
6000
Figure 5: Comparison of the Reliabilities of the Three
M P S Models for K = 4.
serve t h a t Rn(t) < Rd(t).
"
12000
Time(Hours)
12000
Figure 6: Scaled P a r a m e t r i c Sensitivity of Unreliability
- - Simple Crossbar Model (SYSo).
In the K = 12 case (not
shown), the degree of separation between the reliability
tiplying by a factor of ,k~/Ni.
of S Y S o and the other two models is less pronounced.
sitivities have an opposite sign t h a n the sensitivities
Scaled parametric sensitivities for S Y S s and S Y S n
of system unreliability; an increase in failure rate in-
We note t h a t the sen-
creases unreliability but decreases the expected reward
are plotted in Figures 6 and 7. We omit the plot for
rate.
SYSd, because it is almost identical to the plot for
S Y S n . These parametric sensitivities are scaled by multiplying by the factor X~/N~. Regardless of mission time,
We also note that, unlike the sensitivity of un-
reliability, the processor failure rate sensitivity curve
is visible. A l t h o u g h it is unlikely t h a t enough processors would ever fail to cause total system failure, a few
all three systems are insensitive to small changes in the
processor failure rate. For S Y S , , switch failure is the
processor failures might occur, reducing system perfor-
reliability bottleneck. For S Y S d and S Y S a , increased
mance. In S Y S , , the switch is the performability bot-
fault-tolerance in the switch makes the memories the
tleneck. Because S Y S d and S Y S n have fault-tolerant
reliability bottleneck, regardless of mission time.
switches, regardless of mission time, memories are their
5.3
performability bottleneck.
Performability
For K = 4, Figure 8 plots E[X(t)], the expected system
b a n d w i d t h at t i m e t.
5.4
SYSd is superior to the other
A n a l y s i s w i t h a n A l t e r n a t e Sensitivity Measure
two systems. For small values of t, S Y S s is superior to
S Y S n , while the converse is true for m o d e r a t e values
As we m e n t i o n e d in Section 3, a second use of para-
of t.
metric sensitivities is in the identification of portions
This crossover occurs because for small K , up
to three SEs can fail in S Y S n without causing system
of a model t h a t need. refinement.
failure; while for S Y S , when the IN fails, the system is
cost function, as in the three previous subsections, here
Instead of using a
we consider relative changes, A),i/,k ~. This quantity is
down.
P a r a m e t r i c sensitivities for E[X(t)] of the M P S mod-
obtained by scaling the parametric sensitivities (multi-
els are plotted in Figures 9 and 10. We omit the plot
plying each S(t) by ,kl). Using this approach changes
for S Y S a because it is almost identical to the plot for
the results obtained for S Y S , .
S Y S n . These parametric sensitivities are scaled by mul-
measure used in Section 5.1, the M T T F of S Y S , was
184
W i t h the "cost-based"
12
1,4x10 "5
~~ f
~(--- Memories
¢.n-
SYS s
SYSd
E
&
0
i
6000
Time (Hours)
12000
0
Time (Hours)
20000
Figure 8: Comparison of the Expected Reward Rates at
time t for the Three MPS Models for K = 4.
Figure 7: Scaled Parametric Sensitivity of Unreliability
- - Omega Network Model (SYS~).
most sensitive to switch failures for both K = 4 and
K -- 12.
10000
To demonstrate the use of parametric sensitivity anal-
With the alternate scaling used here, the
ysis in the evaluation of competing system designs, we
M T T F of SYSo is most sensitive to switch failures for
considered three models of a multiprocessor system con-
K = 4, but for K = 12, it is most sensitive to mem-
structed from processors, shared memories, and an in-
ory failures. If we want to improve the M T T F model
terconnection network. For each model, we computed
for SYS~, then K is also a factor in determining what
the parametric sensitivity of mean time to failure, sys-
component of the model should be refined.
tem unreliability, and time-dependent expected reward
If we repeat the reliability sensitivity analysis with
rates.
the alternate scaling, S Y S , is initially most sensitive to
By scaling with respect to a cost function, we
were able to identify the reliability, performability, and
switch failures, but as mission time increases, exhaus-
M T T F bottlenecks in each system. The three mod-
tion of memory redundancy becomes a greater prob-
els produced different results. The differences between
lem. For t _> 4000, S Y S , reliability is most sensitive
the models highlight the need for detailed models and
to changes in the memory failure rate. For E[X(t)] of
shows the role of analytic modeling in choosing design
S Y S , , a similar crossover is observable at t = 4000.
alternatives and guiding the design refinements.
If we want to improve the reliability or performability models for SYSo for small t, the failure rate of the
References
switch should be more accurately determined. For large
[1] M. Avriel. Nonlinear Programming: Analysis and
Methods. Prentice-Hall, Englewood Cliffs, N J,
1976.
values of t, the failure rate of the memory system should
be more accurately determined.
6
[2] M. D. Beaudry. Performance Related Reliability for Computer Systems. IEEE Transactions on
Computers, C-27(6):540-547, June 1978.
Conclusions
System modelers often rely on single-valued measures
like M T T F .
[3] D. P. Bhandarkar. Analysis of Memory Interference in Multiprocessors. IEEE Transactions on
Computers, C°24 (9) :897-908, September 1975.
This oversimplification may hide impor-
tant differences between candidate architectures. Timedependent reliability analysis provides additional data,
[4] C. R. Das and L. N. Bhuyan. Reliability Simulation of Multiprocessor Systems. In Proceedings o/
but unless whole series of models are run, it does not
the International Conference on Parallel Processing, pages 591-598, August 1985.
suggest where to spend additional design effort. In this
paper, we discussed the use of Markov reward mod-
[5] P. M. Frank. Introduction to System Sensitivity
Theory. Academic Press, New York, 1978.
els and parametric sensitivity analysis. Markov reward
models allow us to model the performance of degrad-
[6] A. Goyal, S. Lavenberg, and K. Trivedi. Probabilistic Modeling of Computer System Availability.
Annals of Operations Research, 8:285-306, March
1987.
able systems. Parametric sensitivity analysis helps identify critical system components or portions of the model
that are particularly sensitive to error.
18S
0
0
z
¢Z-
7
-Zotwo--orof
oroooo
03
=e
E
$
-4.5X10 .4
i
6
i
I
6000
i
I
Time (Hours)
-4.0X10"
12000
Figure 10: Scaled Parametric Sensitivity of Performance
Level - - Omega Network Model ( S Y S a ) .
[17] D. P. Siewiorek.
Multiprocessors: Reliability
Modeling and Graceful Degradation. In Infoteeh
State of the Art Conference on System Reliability,
pages 48-73, Infotech International, London, 1977.
[7] P. Heidelberger and A. Goyal. Sensitivity Analysis of Continuous Time Markov Chains Using
Uniformization. In P. J. Courtois G. Iazeolla
and O. J. Boxma, editors, Proceedings of the
2nd International Workshop on Applied Mathematics and Performance/Reliability Models of
Computer/Communication Systems, pages 93-104,
Rome, Italy, May 1987.
[18] D. P. Siewiorek, V. Kini, R. Joobbani, and H. Bellis. A Case Study of C.mmp, Cm*, and C.vmp:
Part II - - Predicting and Calibrating Reliability of Multiprocessor. Proceedings of the IEEE,
66(10):1200-1220, October 1978.
[8] R. A. Howard. Dynamic Probabilistic Systems,
Volume 11: Semi-Markov and Decision Processes.
John Wiley and Sons, New York, 1971.
[19] M. Smotherman. Parametric Error Analysis and
Coverage Approximations in Reliability Modeling.
PhD thesis, Department of Computer Science, University of North Carolina, Chapel Hill, NC, 1984.
[9] J. G. Kemeny and J. L. Snell. Finite Markov
Chains. Van Nostrand-Reinhold, Princeton, N J,
1960.
D. Kuck. The Structure of Computers and Computations. Volume 1, John Wiley and Sons, NY,
1978.
[11]
D. H. Lawrie. Access and Alignment of Data in an
Array Processor. IEEE Transactions on Computers, C-24:1145-1155, December 1975.
[121
R. A. Marie, A. L. Reibman, and K. S. Trivedi.
Transient Solution of Acyclic Markov Chains. Performance Evaluation, 7(3):175-194, 1987.
[13]
J. F. Meyer. On Evaluating the Performability
of Degradable Computing Systems. I E E E Transactions on Computers, C-29(8):720-731, August
1980.
[14]
J. H. Patel. Performance of Processor-Memory Interconnections for Multiprocessors. IEEE Transactions on Computers, C-30(10):771-780, October
1981.
12000
Time (Hours)
Figure 9: Scaled Parametric Sensitivity of Performance
Level - - Simple Crossbar Model (SYSs).
[10]
6000
[20] M. Smotherman, R. Geist, and K. Trivedi. Provably Conservative Approximations to Complex Reliability Models. IEEE Transactions on Computers, C-35(4):333-338, April 1986.
[21] W. Stewart and A. Goyal. Matrix Methods in Large
Dependability Models. Research Report RC-11485,
IBM, November 1985.
[22] K. S. Trivedi. Probability and Statistics with Reliability, Queueing and Computer Science Applications. Prentice-Hall, Englewood Cliffs, N J, 1982.
[23] R. S. Varga. Matrix Iterative Analysis. PrenticeHall, Englewood Cliffs, N J, 1962.
[15] A. Reibman and K. Trivedi. Numerical Transient
Analysis of Markov Models. Computers and Operations Research, 15(1):19-36, 1988.
[16] A. L. Reibman and K. S. Trivedi. Transient Analysis of Cumulative Measures of Markov Chain Behavior. 1987. Submitted for publication.
186