Trinity College Dublin 1. The Problem 2. Motivating Examples 3

!
Reliability of Complex Networks
!
Louis JM Aslett ([email protected]) and Simon P Wilson ([email protected])
Trinity College Dublin
!""#$%&'#"()*+,,'"-(&.+(/0%1&
!45/67(/308#9'*0:(;1'<%3(=>&.(/+8&+0
!
"#! $#%! &'()! )#! *+,-+! ).+! #//#0)%(,),+*! #11+0+2! 3$! 4
9.'77+(8+*:!
!!
;+'77$! %(2+0*)'(2! ).+! /#)+(),'7! 1#0! $#%0! 3%*,(+**!
49#(#6$:!!
!
=+'0!7+'2,(8!,()+0('),#('7!*/+'>+0*!#(!).+*+!)#/,9*!*%9.
!
• "0@!;#3+0)!A@!",B#(C!D+'2+0!#1!).+!E7,6')+!E.'
).+!G7#3'7!4(5,0#(6+()!H'9,7,)$!IG4HJ?!!
• K()+7!*%*)',('3,7,)$!8%0%C!<)+/.+(!='0/+0?!!
• L,90#*#1)M*!E7#%2!E#6/%),(8!+B/+0)!"+30'!E.0'/
1. The Problem
2. Motivating Examples
3. Research Questions
Telecommunications networks are being utilised
ever more by governments, businesses and individuals to conduct mission critical activities. This
increasing reliance on telecoms infrastructure creates a strong demand for highly reliable networks
which suffer minimal outages: any outage can
have potentially serious economic impact on both
network providers and users.
Hardware/software interaction: 15th January 1990, telephones across the United States were inoperable
for most of the day due to a software bug in AT&T’s network causing a cascading switch failure.
The ultimate focus is on modelling the reliability of
a complex physical network where failure occurs
through the dependence or interaction between its
components – hardware, software and human –
that can cause a small failure in the network to escalate to a serious outage.
Bayesian networks (BN): enable efficient modelling of dependence in complex systems.
Components
Fans
Radio
Installation
Cable
Components
Antenna
Temperature
Supply
!
I. improving models
for hardware
K;E<4FC!
,(! 9#77'3#0'),#(! &,).! <%*)',('37+! 4(+08$! K0
L,90#*#1)C! KOF4D! '(2! ).+! P<! 463'**$! ,(! K0+7'(2! &,77!
QRRS! <$6/#*,%6! )#! *.#&9'*+! ).+! /#)+(),'7! #1! G0++(!
*+9)#0C! '*! &+77! '*! ).+! )'7+()! '(2! /#)+(),'7! #1! ).+! .
*%//#0)+2!3$!K;E<4F!'(2!<4K@!!
II. modelling hardware/software interactions
III. accounting for human error
• 1,174 flights cancelled and 85,000 passengers disrupted
Figure 1: AT&T Global Network
Telecom Importance: In 2004 the FCC ruled that carrier outage reports be kept secret in the interests of
“national defence and public safety”, due to the government’s view of the critical importance of reliable
telecom networks.
IV. unified model forF.,*!,*!'(!,2+'7!#//#0)%(,)$!1#0!'($#(+!&.#!&,*.+*!)#!8
hardware / software / hu#1! ).+! )0+(2*! '(2! ).+! )'7+()! ).')! &,77! 20,5+! ).+!
man interactions #//#0)%(,)$! ,(! K0+7'(2@! ! F#! 0+8,*)+0! 1#0! ).,*! +B9,),(8!
L9H++7$!')!7691++7$V,09*+)@,+!
!
Work has been underway
less than a year and is
currently focused on I/II.
!"#$%&'
"#$%&!'(%()#*&!+,-.*$/!0,#!1*$(.*(!2.3$.((#$.3!).4!5(*&.,/,36!
()*+,-'./-0
5. Proposed Solution
A BN is composed of a qualitative part (a directed
acyclic graph, see diagram above) and quantitative
part (conditional probability distributions). Parts
of the BN model may be specified, say through observed data, and the model updated.
Markov processes: provide a natural way to model
redundant systems which can be repaired without
affecting service. For example:
Full Operation
Duplex
Detected Failure
one unit failed
System Down
We have decided on the following milestones:
• 80% of Federal Aviation Authority communication between New York airports blocked
Power Amp
Service
Uncovered Failure
primary failed
System Down
N(2! 6++)! ).+! (+&! 8+(+0'),#(! #1! 0+*+'09.+0*! &.#! &,7
80#&).!'(2!).+!0+)%0(!)#!/0#*/+0,)$:!!
• 5 million blocked calls
• Kennedy, La Guardia and Newark airports closed for 4
hours
4. Current Techniques
Quality
Human error: 17th September 1991, all of New York lost service for 8 hours. Power was lost, primary
backup systems failed and engineers failed to respond to emergency alarms in time before secondary
power was exhausted. The effect was immense:
Covered Operation
one unit failed
Simplex
Failover failed
one unit failed
System Down
Uncovered Operation
backup failed
Simplex
No Operation
duplex failure
System Down
A redundant system may start fully operational,
become ‘degraded’ when there is a covered failure and return to full operation after repair without
service actually being affected.
The arcs in a Bayesian network represent ‘causality’, so there is no natural graph structure to explicitly represent a repairable redundant system because the arcs of a Markov process do not carry the
same interpretation. On the other hand, Markov
processes are distributionally inflexible with all failures assumed to be exponential, so extending this
approach to a unified hardware/software model is
not possible because software reliability is highly
non-exponential. We propose the use of Phase-type
(PH) distributions to overcome these current limitations evident in the literature.
A PH-distribution models the time until entering an
absorbing state of a continuous-time Markov process. For example, given the matrix of transition
rates

S s0
0 0
!


=

λ11
λ21
λ31
0
λ12
λ22
λ32
0
λ13
λ23
λ33
0
λ14
λ24
λ34
0





the time to entering state 4, which represents failure
of a repairable redundant sub-system, has density
fT (t) = αT exp(St)s0
where α is the vector of initial state probabilities
and exp is the matrix exponential operator.
This novel approach of using PH-distributions
as the conditional probability distributions in a
Bayesian network enables embedding an entire
Markov process as a single node within the BN,
improving the modelling accuracy of redundant
repairable sub-systems and moreover maintaining
the inferential and distributional flexibility afforded
by BNs.
In addition, PH-distributions present interesting inferential challenges in their own right, since there is
no closed form expression for the likelihood


n
∞
j
Y
X
(St
)
i 
T
s0
L(λ | t1 , . . . , tn ) =
α
j!
i=1
j=0
Thus at the time of writing, the task being addressed is application of the EM-algorithm within
BN inference, or use of alternative techniques to incorporate PH-distributions into the framework.
Funding
The authors would like to acknowledge the kind support of IRCSET, who awarded Louis Aslett a postgraduate scholarship to pursue his PhD.