Availability Quantification Fundamentals

Availability Quantification
Fundamentals
Predicting Network Availability
Section 2
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
1
1
Introduction
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
2
•Have you ever ridden a bicycle?
•Have you ever had a flat tire?
•How long did it take to fix that flat?
• 5 Minutes?
• An Hour?
• Two Weeks?
•How long do you think it would take Lance Armstrong racing in the Tour
De France to change a flat tire?
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
2
Availability and Downtime
Number of
Nines
% Availability
Two
99.0%
87.7 Hours
Three
99.9%
8.8 Hours
Four
99.99%
52.6 Minutes
Five
99.999%
5.3 Minutes
Seven
99.99999%
3.2 Seconds
Nine
99.9999999%
0.032 Seconds
PS-544
2989_05_2001_c5
Average Annual
Downtime
© 2001, Cisco Systems, Inc. All rights reserved.
3
•In the chart, you could find the number of nines for bicycle flats - if you
got those flat tires one time each year!
•This could be the beginning of a discussion about bicycle
availability….but it isn’t.
•Hello Everyone…my name is Chris Oggerino - I’m with Cisco Systems,
Inc.
•Today we are here to talk about something very important to you:
• Network Availability
•While this subject is large, we are going to focus on three major things:
• Prediction of network availability, simplification of the math and the
use of tools
• The Five contributors to network downtime and how to include them
in availability predictions
• The process for analyzing a network’s availability from your
customer’s perspective.
•You may notice a few differences between what I present and your
handouts. That is a result of optimizing the presentation for you after the
printing - I apologize for any inconvenience this may cause.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
3
Agenda
• Scope, Definitions and Equations
• Hardware Components
• Software
• Network Fail-over Mechanism
• Loss of Power—Environmental Concerns
• Human Error and Operational Process
• Process of Prediction (Learn, Discern, Divide
& Conquer)
• An Example
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
4
•The agenda items for this presentation are listed in the slide and what
they mean is:
• First, we have to limit what we talk about - we only have 2 hours.
• Next - we’re going to cover each of the basic items that contribute
to network downtime and we’re going to talk about how you include
those in your network availability predictions
• Finally, we’re going to look at the processes for predicting
availability and we’ll do an example or two - depending on time.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
4
Scope
Predicting Availability & Downtime
MTBF
MTTR
Percent
Method
Prediction
Evaluate
Results
Measurement
Change/Change
Management
Improvement
Planning
Operating
Hours
Failures
DPM
GAP Analysis
Compare
Results
PS-544
2989_05_2001_c5
Algebraic
Conversion
© 2001, Cisco Systems, Inc. All rights reserved.
5
•Building highly available networks is a complex process
•We are going to focus on the the “prediction” part of the process for the
next couple hours.
•We are going to use “The Percent Method” of predicting availability
today, because MTBF and MTTR plug directly into the percent availability
equation.
•My opinion is that the DPM method is far superior to the percent method
when you are measuring network availability…but we are not in that
phase right now.
•You should note that it is a simple exercise in algebra to convert back
and forth between the two methods. Even still, I like to use one for
prediction and one for measurement. In the GAP Analysis phase, you
would convert them back and forth to see how your results compared to
your predictions.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
5
Scope
Standard Disclaimers!
• We’re limiting ourselves to prediction
• We will be using the percent method for
describing availability
• The numbers here are for demonstration
purposes
• Learn, Discern, Divide & Conquer will be
used (in parts) before it is fully explained
in our process section
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
6
•The numbers used in this presentation are for demonstration purposes.
You will want to make sure you have accurate up to date numbers for
your calculations.
•In some cases, simplified networks and other shortcuts are used to
facilitate the learning process.
•The key thing to remember from this presentation are the following two
thoughts:
• 1) There are 5 contributors to downtime. Each of them should be
considered for inclusion in your availability studies on your
networks.
• 2) A simple process for computing availability can be phrased,
“Learn, Discern, Divide, and Conquer”
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
6
Definitions
Reliability, Availability and Serviceability
R.
PS-544
2989_05_2001_c5
A.
S.
© 2001, Cisco Systems, Inc. All rights reserved.
7
•Reliability
• Reliability determines how many times each year you have
operations workers fixing things
•Availability
• Availability determines your customers’ perspective of your
networks quality
•Serviceability
• Serviceability determines how long each outage takes to fix and
thus affects operational costs and availability
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
7
Definitions
MTBF, MTTF and MTTR
• MTBF—Mean Time Between Failures
• MTTF—Mean Time to Fail
• MTTR—Mean Time to Repair
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
8
• MTBF:
•MTBF will be used in the availability equation.
•Often confused with MTTF(Mean Time to Failure).
•MTBF information is available from most manufacturers
• MTTF
•MTTF and MTBF are often confused.
•Many times, you get MTBF numbers from manufacturers
which are actually MTTF
•For our purposes, we use both interchangeably because it
just doesn’t matter on reasonably available network
components.
• MTTR:
•MTTR is estimated based on service contract.
•MTTR will be used in the availability equation
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
8
Definitions
The Availability Equation
Availability =
MTBF
MTBF + MTTR
• You can simply read, “The uptime divided by the
total time” to create the percentage time your
network is operational
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
9
•The Availability Equation
• Technically, this should be MTTF divided by MTTF + MTTR which
is equal to MTBF.
• However, for our purposes, it won’t matter unless the availability of
the parts are down in the 1 nine area - and you are buying Cisco right? So, we don’t have to worry about that.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
9
Definitions
The Serial Availability Equation
n
Serial Availability =
p
Component Availability(i)
i=1
i = Index of component number
n = Number of components
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
10
•The Serial Equation is simply the multiplication of all the availability’s
together.
•That Greek symbol - that’s the capital “Pi” symbol and it means “Product”
or “Multiply all this stuff together”.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
10
Definitions
The Parallel Availability Equation
n
Parallel Availability = 1–
[p
{1– Component Availability(i) }
]
i=1
i = Index of component number
n = Number of components
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
11
•In the Parallel equation, what you are actually doing is to multiply the “unavailability’s” together and then subtract the result of all that from 1.
•This does assume a simple parallel system as opposed to a more
complex N+1 system for which we have an equation coming soon.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
11
Definitions
The “N+1” Availability Equation
Availability = nA (n – 1) x (1 – A) + An
A—Availability of individual devices
N—number of devices
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
12
•This generalized equation for computing N+1 availability is not something
we are going to spend a lot of time with.
•We’re going to use a spreadsheet tool written by Cisco called, “SHARC”
for this stuff.
•But you should have at least been exposed to the equations so you know
what we’re really doing by putting numbers into the spreadsheet!
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
12
Contributors to Downtime
Hardware
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
13
•Whenever you see this symbol - the dynamite, it means we are beginning
a new section on something that can cause network downtime. There are
4 more after this one.
•If you have wandered off during the calculations…this indicates we’re
getting back to a higher level perspective.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
13
Hardware
MTBF and MTTR
Cisco Uses Industry
Standards to Compute
Hardware MTBF
We Can Use Reasonable
Estimates for MTTR
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
14
•As you remember from our definitions in the first section, we are going to
need MTBF and MTTR in order to compute availability of a hardware
device using the standard availability equation.
•MTBF for Cisco hardware is done according to the Telcordia TR-332
standard. This is an industry standard and we simply use the Relex
software as do most companies.
•MTTR is something we can estimate. I like to assume that once we figure
out a board has broken, it doesn’t take all that long to swap it out.
Especially compared to the time it takes to get the new one. So for MTTR,
we can use typical scenarios such as “on-site spares”, 4 hours on-site
parts service, or next day advanced replacement. Each of these leads to
MTTRs such as 2 hours, 4 hours or 24 hours.
•As long as our method is consistent, comparison of network design and
component selection is accurate - even if the actual numbers are not
perfect.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
14
Hardware
Computing Component MTBF
• At the board level, we assume that all components
are required for proper operation
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
15
•Without going into the gory detail about FIT’s (Failures per 10 billion
hours) for each component and how temperature and use affects the
results for each component, we can state that the Relex software gives us
an industry acceptable standard MTBF for each Cisco component.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
15
System Availability
Learn
PS-544
2989_05_2001_c5
Product
MTBF
MTTR
CVA 122
326,928
8
CVA PWR
300,000
8
© 2001, Cisco Systems, Inc. All rights reserved.
16
•In order to figure out the availability of a system, you need to gather up
the MTBF and MTTR for each of the parts of that system.
•I like to put them into a little table - perhaps into an spreadsheet.
•This becomes a bigger requirement for more complex systems.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
16
System Availability
Discern
Motherboard
Power
Supply
• From the list of components, determine those that
are required for proper system operation—we call
this development of the Reliability Block Diagram
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
17
•In figuring out availability of a system (or a network as you will soon see),
you start with the smallest components and you work your way up to the
bigger picture.
•I like to call this method the learn and discern method. Later, you will see
that we Learn and Discern at the network level as well.
•This process I call “discern” is the process used to create a Reliability
block Diagram of the system (or network) you wish to study. More on that
later - in the process section.
•For now, it’s time to get into the actual mathematics of computing
hardware availability.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
17
Before We Go On…
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
18
•Who went to high school or college in the 70’s? Anybody remember the
HP Vs. TI rivalry at that time?
•How many of you are HP calculator users?
•How many of you are Texas Instrument calculator?
•OK - well we got a lot of math today so I need a couple volunteers to help
me out!
•I need an HP volunteer
•And a TI Volunteer.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
18
System Availability
The Parts
Component Availability:
Availability =
MTBF
MTBF + MTTR
326,928
Motherboard =
326,928 + 8
= 0.999976
Power Supply =
300,000
300,000 + 8
= 0.999973
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
19
•Let’s start the first example on computing availability for a device based
on the hardware.
•We use the availability equation, the MTBF and the MTTR for each
component.
•We’re assuming that the plastic enclosure doesn’t really cause us
downtime for our work here.
•As you can see, we have arrived at two different availability percentages
for the two different components in our simple system.
•We are now ready to combine those together for an end to end system
hardware availability prediction.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
19
System Availability
Combine Parts
End-to-End System Availability:
n
Serial Availability =
(Availability(i))
i=1
p
2
Serial Availability =
p
(Availability(i))
i=1
= (0.999976) * (0.999973)
= 0.999949
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
20
•By multiplying the availability above times the number of minutes in a
year, then subtracting that from the minutes in a year, we can compute
26.82 minutes of average annual downtime.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
20
Redundant System Availability
Learn and Discern
Power
Supply
Motherboard
Power
Supply
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
21
•Continuing along the same lines, let’s do a “learn & discern” on the same
system - but with dual redundant power supplies….
•Skipping the picture - here is what the RBD would look like.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
21
Redundant System Availability
The Parts
Component Availability:
Availability =
MTBF
MTBF + MTTR
326,928
326,928 + 8
= 0.999976
Motherboard =
300,000
300,000 + 8
= 0.999973
Power Supply =
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
22
•As a reminder - we still need the component availability results in this
next step - computing the parts.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
22
Redundant System Availability
Combine Parts
n
Parallel Availability = 1–
[p
]
(1– Component Availability(i))
i=1
2
Parallel Availability = 1–
[p
(1 – 0.999973(i))
i=1
]
Power Availability = 0.999999 (Truncated)
End to End = 0.999999 x 0.999976
End to End = 0.999975
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
23
•And now we combine parts.
•First we combine parallel parts, then we put those results into end to end
serial equations.
•(This is part of the divide and conquer method we explore further later)
•Our previous availability was 0.999949 with 26.82 minutes per year of
downtime.
•With dual redundant power supplies, we increase our availability to
0.999975 and thus reduce our annual downtime to 13.15 minutes - a
savings of about 13 minutes.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
23
Contributors to Downtime
Software
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
24
•Let’s consider the software contribution to network downtime.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
24
Software
MTBF and MTTR
Cisco Has Patented a New
Method for Producing
Software MTBF
We Can Use Reasonable
Estimates for MTTR
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
25
•Show Command Method
• Michael Shorts and I found that you could estimate MTBF by using
the “mean survival time” method by simply looking at large numbers
of routers using Cisco’s NATKIT tool. Our results were
approximately 10k and 45k for new and old IOS respectively.
•Inter-failure Analysis Method
• Scott Cherf has been watching the routers on Cisco’s internal
network for years measuring the time between problems. He has
actually patented his method. His results on similar equipment and
versions were approximately 10k and 45k for new and old IOS
respectively
•Two methods, One Result
• These two methods resulting in the same answers leads us to
believe that we have two points on a curve for IOS MTBF that are
shown on the next page - intuitively obvious.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
25
Software
Observed MTBF
Incidence of
Failure
FCS
1 Year
General
Deployment
10,000
PS-544
2989_05_2001_c5
25,000
45,000
© 2001, Cisco Systems, Inc. All rights reserved.
26
•As you can see, the results are exactly what you would expect:
• As SW gets more mature, it has less “problems”
• The curve shows high infant problem rates
• The curve eventually flattens as the major bugs are found and “new
and unusual uses” of the software are required to find more.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
26
Software
Estimating MTTR
• Assumed crash and run
• Automatic reload configured
• Time from crash to normal operation:
Smaller routers
6 minutes average = 0.1 hours MTTR
Larger routers
12 minutes average = 0.2 hours MTTR
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
27
•In computing MTTR for SW, we assume that you will configure the
product to “auto-reboot-on-crash” so that failures are minimized….at least
for the prediction part of high availability.
•We all know that in Real Life, you are going to end up doing that, but
then reconfiguring for a mem-dump in order to solve the problem - which
is going to increase downtime at least once in a while.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
27
Software
Availability
PS-544
2989_05_2001_c5
Environment
MTBF
MTTR
Availability
Small
Small Router,
Router,
Young
Young IOS
IOS
10,000
10,000
0.1
0.1
0.999990
0.999990
Small
Small Router,
Router, 11
Year
Year IOS
IOS
25,000
25,000
0.1
0.1
0.999996
0.999996
Small
Small Router,
Router,
GD
GD IOS
IOS
45,000
45,000
0.1
0.1
0.999998
0.999998
Large
Large Router,
Router,
Young
Young IOS
IOS
10,000
10,000
0.2
0.2
0.999980
0.999980
Large
Large Router,
Router, 11
Year
Year IOS
IOS
25,000
25,000
0.2
0.2
0.999992
0.999992
Large
Large Router,
Router,
GD
GD IOS
IOS
45,000
45,000
0.2
0.2
0.999996
0.999996
© 2001, Cisco Systems, Inc. All rights reserved.
28
•After all the research and assumptions are compiles, we believe that
these numbers are reasonable estimates of what you might expect from
Cisco software in fairly “normal” uses.
• Normal of course meaning:
•No overloading the box
•Reasonable routing tables
•Good configurations
•etc
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
28
Software in the RBD
Learn and Discern
Power
Supply
IOS
Motherboard
Power
Supply
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
29
•Above you see our small CPE router example RBD with IOS inserted into
the required components for successful operation.
•As with the other components, once you have an availability for it, the
software is just included like anything else.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
29
Software and System Availability
The Parts
25,000
IOS Availability =
25,000 = 0.1
= 0.999996
326,928
Motherboard =
326,928 + 8
= 0.999976
300,000
Power Supply =
300,000 + 8
= 0.999973
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
30
•As a reminder - we still need the component availability results in this
next step - computing the parts.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
30
Including Software’s Contribution
End to End Availability
Redundant Power = 0.999999
Motherboard = 0.999976
IOS Software = 0.999996
n
Serial Availability =
p
(Availability(i))
i=1
End to End = 0.999999 x 0.999976 x 0.999996
End to End = 0.999971
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
Hardware and
Software Contributors
31
•And now we combine parts.
•As you can see, we are really just multiplying the previous result by the
result of our software availability.
•The resulting downtime is 15.25 minutes which is a couple minutes more
downtime than the 13 we had previously calculated without including
software.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
31
What If We Could Make
the Motherboard Redundant—With IOS
PS-544
2989_05_2001_c5
IOS
Motherboard
Power
Supply
IOS
Motherboard
Power
Supply
© 2001, Cisco Systems, Inc. All rights reserved.
32
•Continuing along the same lines, let’s do a “learn & discern” on the same
system - but with dual redundant CPUs
•Skipping the picture - here is what the RBD would look like.
•As you can see, we have an entire section of redundancy that leads to
another entire section of redundancy. A key part of “divide and conquer” is
to calculate these two areas separately, then multiply those results
together in a simple serial equation base on “CPU Services” and “Power
Services”.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
32
Software and System Availability
The Parts
25,000
IOS Availability =
25,000 = 0.1
= 0.999996
326,928
Motherboard =
326,928 + 8
= 0.999976
300,000
Power Supply =
300,000 + 8
= 0.999973
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
33
•Again, we begin by listing the component availability’s.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
33
A Preview on Process
Serial/Parallel Constructs
Parallel
Availability = 1–
n
[p
1– Component Availability(i)
i=1
Serial
]
n
p
Availability =
Component Availability (i)
i=1
Process
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
34
•As a reminder, here are the two equations we will be using.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
34
Do the Math—
Compute Serial/Parallel Constructs
n
Availability =
p
Component Availability(i)
(i)
i=1
HW/SW Processor = 0.999996 x 0.999976
= 0.999972
n
Availability = 1–
[p
1– Component Availability(i)
(i)
i=1
n
Processing = 1–
[p
1– 0.999972 (i)
(i)
i=1
]
]
= 1 – (1 – 0.999972) 22
= 0.999999 (Truncated)
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
35
•First, we use the serial equation to calculate the availability of a CPU with
SW to get the .999972 intermediate result.
•Then we combine two CPUs with SW using the parallel equation to come
up with a zillion 9’s (about 8 or so) of availability - which we’ll truncate
down to 6 nines.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
35
Redundant System
Availability Combine Parts
Redundant Power = 0.999999
Redundant CPU = 0.999999
n
Serial Availability =
p
Component Availability(i)
i=1
End to End = 0.999999 x 0.999999
End to End = 0.999998
PS-544
2989_05_2001_c5
About 1 Minute Per Year
Downtime —Much Better
© 2001, Cisco Systems, Inc. All rights reserved.
36
•And now we combine parts…including that parallel part!
•As you can see, we are taking the results of the previous step and putting
them into this step…that is what I call dividing the big picture into smaller,
conquerable sections, then recombining those results to get the end to
end results and conquer the entire system (or network).
•Our first availability result was 0.999949 with 26.82 minutes per year of
downtime.
•With dual redundant power supplies, we increased our availability to
0.999975 and thus reduce our annual downtime to 13.15 minutes - a
savings of about 13 minutes.
•Now, with redundant power and CPU, we are able to get down to 5 nearly 6 - nines of availability and less than 1 minute per year of
downtime!
•Of course, this has now become a $20,000 CPE device - but uh - it’s very
available!!!
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
36
Contributors to Downtime
Environmental Concerns and Power
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
37
•OK - Let’s move on to Environmental / Power Contribution to Downtime
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
37
Contributors to downtime
Loss of Power
Power
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
38
•The key thing that the environment can cause is power loss.
•Floods, fires, earthquakes may drop your building - but then no one will
really notice that they can’t surf E-Bay!
•It’s when a small thing causes power outage - but the surrounding area is
still intact - that’s when expectations might exceed delivery!
•It is important to recognize the voice providers may have some
requirements during the toughest of environmental problems - but we’re
really concerned with the math here - not the policy.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
38
Power
MTBF/MTTR
• MTBF—based on prior observation
• MTTR—based on prior observation
• Your local energy provider
• www.nerc.com
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
39
•NERC - North American Electric Reliability Council
•NERC’s web site, www.nerc.com, provides some examples of power
failures by year in their databases. They have a group named
“Disturbances Analysis Working Group,” which has documented major
power outages over time. The link to their data is:
• http://www.nerc.com/dawg/database.html.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
39
Power
MTBF/MTTR
Availability = 1 – Unavailability
Minutes per Year = 525,960
Annual Power Downtime = 29 Minutes
Power Unavailability =
29
525,960
= 0.0000475
Power Unavailability = 1 – 0.0000475
= 0.999945
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
40
•Let’s assume we went to NERC’s web site, and we found that our area
has had an average annual power loss of 29 minutes per year for the last
20 years.
•Now, remember availability is a percentage of uptime to total time - so
the balance - unavailability would be 1 - availability.
•If we account for leap years in our minutes per year as we do above and
assume 29 minutes per year, then our calculations for power availability
would look like this.
•Given an availability of the “power component” we need to consider how
to include it in our network / product availability studies.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
40
Power
Learn and Discern
Compete System
Network Segment
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
Power
41
•Given an availability of power in a geographic region, we must now
LEARN about the network and DISCERN how the annual power will affect
it.
•Generally, I like to apply power availability on a “per site” basis. Such as
1x per CPE site and 1x per head-end site.
•You can think of this in terms of where you would put battery backups or
generates (or both!) to mitigate the problem.
•This diagram is intended to show that power will affect (in a serial
fashion) the complete system or network segment at a particular site.
•Let’s do a couple examples to clarify
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
41
Power
Learn
Service
Provider
Network
Home
Network
WAN Connection
Battery Backup
Plug
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
To Regular Power
42
•For our example on adding power loss into our equations, let us take a
simple example.
•We will consider the power contribution to downtime in a household that
uses our CPE router for connection to the Internet.
•Our goal will be to get a percentage availability to the internet for any PC
connected to the router.
•As you can see in the diagram, we need to account for the delivery of
service from our service provider, our home router and power. The power
is somewhat redundant in that we have a battery backup.
•We will assume that battery backup failure does not cause failure of the
router unless it happens during a power failure. This assumes a parallel
construction of the primary and backup power supplies - which may or
may not be the case.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
42
Power
Discern
Battery
Backup
A
Service
Provider
Home
Router
B
Regular
Power
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
43
•As is our standard process now, we create an RBD to show how we view
the components to be considered. We will assume that software and other
contributors are included and we are only adding in the power
computations.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
43
Power
The Parts
Item
Description
Value/Result
Service Provider
Total availability provided by the service
provider to the Internet
.99999
Home Router
The availability of the home router and its
power supply
.9999
Power Company
Power from the power company at 29 minutes
per year of downtime is 525931 ÷ 525960
Battery Backup
The availability of the battery backup device
PS-544
2989_05_2001_c5
.999945
.999
© 2001, Cisco Systems, Inc. All rights reserved.
44
•We will assume the values in the table for our calculations.
•29 minutes per year (average) power outage works out to about four and
a half nines of power availability.
•Mitigating that with a battery backup capable of eight hours is like running
the power and the back device in parallel. This is how we will perform this
calculation - this time.
•The idea may be somewhat oversimplified for the most technical of
analysis, but should suffice for most purposes.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
44
Power
Divide
Step 1: Parallel Power Computations
Power Company = .999945
Battery Backup = .999
Power Availability = 1 – [(1– .999) *(1 – .999945)]
= .999999 (Truncated Digits)
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
45
•As you can see, the numbers work out to about an hour of downtime
each year.
•Did any of the calculator holders perform the calculations for how much
downtime there would have been without the battery backup and using a
power supply with the same availability?
•My calculations are:
• .999945 * .99999 * .9999 = .99984
• Downtime = 525,960 - (1 - .99984)
•
= 86.78 Minutes Per Year
•This shows that even a cheap $99.00 battery backup will likely reduce
your power contribution to downtime a lot! Assuming that a failure of this
device doesn’t cause power loss.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
45
Power
Conquer
Step 2: Total Availability
Power = .999999
Service Provider = .99999
Home Router = .9999
Total Availability = 999999 x .99999 x .9999
= .999889
Average Annual Downtime = 525,960 (1 – .999889)
= 58.38 minutes
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
46
•As you can see, the numbers work out to about an hour of downtime
each year.
•Did any of the calculator holders perform the calculations for how much
downtime there would have been without the battery backup and using a
power supply with the same availability?
•My calculations are:
• .999945 * .99999 * .9999 = .99984
• Downtime = 525,960 - (1 - .99984)
•
= 86.78 Minutes Per Year
•This shows that even a cheap $99.00 battery backup will likely reduce
your power contribution to downtime a lot! Assuming that a failure of this
device doesn’t cause power loss.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
46
Contributors to Downtime
Human Error/Operations Process
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
47
•Human Error
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
47
Human Error/Operations Process
We All Make Mistakes
Annual
Downtime
Process Issue
Availability
Lack of Rollback Planning process for large upgrades
0.998178
16 Hours
Lack of process for controlling IP addresses for new
PC’s
.99932
6 Hours
Lack of test process before introducing new
product into a production network
.999
8 3/4 Hours
Lack of password control process—security breaches
.995
44 Hours
Allowing changes to routers without considerable
process and testing
.999
8 3/4 Hours
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
48
•In the table above are some examples of downtime caused by human
error. I bet we all have stories about folks that did something and then
suffered downtime.
•How many of you are familiar with an event involving network downtime
due to human error?
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
48
Human Error and Operations Process
Cisco’s ANS Team Can Help
with Minimizing Downtime
Due to Human Error and
Operations Process
• Human error and operations process
downtime is the toughest to predict
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
49
•One service available from Cisco is called, “ANS” Previously NSA.
•Those guys work with customers BEFORE major network upgrades or
growth in order to reduce potential hits to availability.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
49
Human Error and Operations Process
Methodology
• Estimate MTTR per likely event
• Estimate time between events and
call it MTBF
• Calculate availability result of event
• Multiply event availability’s together
• Use result in site calculations as
serial component
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
50
•This is a listing of the major steps I use to account for human error and
process issues when performing an availability analysis.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
50
Human Error and Operations Process
Computations
Service
Provider
The Internet
Home 1
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
Home 2
51
•Let’s do an example using a couple CPE sites and an SP central site that
connects customers to the Internet.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
51
Human Error and Operations Process
Computations
Common to All Users
Unique to Each User
I
SP
CPE
U
Internet
Service
Provider
CPE
Router
Home
User
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
52
•It is important to scope the possibility of “user contribution” to downtime.
•Note that if a user at the CPE site does something to break the network it will only affect them.
•If an SP employee makes a mistake, it could take down large numbers of
users.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
52
Human Error and Operations Process
The Parts
Network Component
Availability or Frequency Annual Downtime
The Service Provider Network
.99999
5.2 Minutes
Home Networks 1 and 2
.9999
52 Minutes
.999999
.86 Minutes
Error in Service Provider Network
35,064 Hours
60 Minutes
Error in Home Network 1
12,000 Hours
120 Minutes
The Internet
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
53
•The Central Site error will be a mistake made by the service provider.
The service provider mistake will take all of the customers out of service
for one hour. This mistake will happen one time every four years as
shown in the table.
•The CPE error will occur when the user in Home Network 1 does
something that make their network unable to connect to the service
provider network for two hours. Let us assume this error happens one
time per 12,000 hours.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
53
Human Error and Operations Process
Initial Computations
Base Internet Availability
Internet * SP * CPE1 = .999999 *.99999 *.9999
= .999889
Annual Downtime = [525,960 *(1 – .999889)]
= 58.4 Minutes
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
54
•First we start off by computing the availability of the network without
considering any human error.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
54
Human Error and Operations Process
CPE Computations
Downtime for CPE Human Error
MTBF = 12,000 Hours
MTTR = 2 Hours
12,000
Availability H.E =
12,000 + 2
= .99983
Annual Downtime = [525,960 * (1– .99983)]
= 87.6 Minutes
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
55
•Let’s compute the downtime and availability of CPE “human error” on
this network.
•It appears as though this user is going to cause themselves about 87.6
minutes per year of downtime.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
55
Human Error and Operations Process
CPE With Error Computations
Total Availability = Base Availability * Human Error Availability
Total Availability = 0.999889 * 0.99983
Total Availability = 0.99972
Total Annual Downtime = 525,960 * (1– .99972)
= 147 Minutes
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
56
•And then we combine them using the serial equation methodology.
•So far, we are up to 147 minutes per year of downtime and we’re not
really taking everything into consideration.
•We still need to compute the SP Human Error Contribution to downtime.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
56
Human Error and Operations Process
SP Error Computations
Availability as a Result of SP Mistake:
Availability =
35,064
35,064 + 1
= .99997
Total Availability = .99997 * .99972
= .99969
Total Downtime = [525,960 * (1 – .99969)]
= 163 Minutes
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
57
•And here you see it with the addition of the SP “human error” mistake.
•Figuring the availability of the human error at the SP to be .99997, we
quickly multiply that by our previous result and add a little more downtime
to get to 168 minutes per year.
•We’re starting to get pretty far away from “5 nines” here, so let’s keep this
human error thing in mind when we operate and upgrade our networks.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
57
Contributors to Downtime
Redundancy Protocols
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
58
•OK - let’s bring everyone back on track and start looking at the high level
perspective again.
•How long does it take OSPF to fail over when two routers are in parallel
and one crashes?
•OK, I know you need more information: - 10 routes!
•What about 1,000 routes?
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
58
Redundancy Protocols
Switchover Times
• Determine amount of time required for fail-over
mechanism to recover from a failed device
• Determine number of device failures
• Compute availability based on time to recover
(MTTR) and number of failures (annually) derived
from MTBF
• Include this in appropriate network segment
calculations as a serial component to the
availability analysis
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
59
•Here is a list of the major steps for including fail-over protocol times in
your availability calculations.
•The number of failures over any particular period of time is simply that
time divided by the MTBF…and you can throw in the MTTR if you want,
since that would be the total time between parts replacement MTPR.
•1 year would be 8766 / MTBF
•If MTBF is lower than 8766, then you would have more than 1 failure per
year.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
59
Redundancy Protocols
Cisco Supports All the Standard
Redundancy Protocols and Some
Extra’s that Are Proprietary
MTTR and MTBF Are
Going to Be Derived
Using Basic Algebra
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
60
•Since Cisco supports all of the major fail-over protocols and a few
proprietary protocols, you can be sure that times from microseconds to
minutes will be considered.
•The configuration of your network and or systems will affect these
computations.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
60
Redundancy Protocols
Learn & Discern
2
A
1
6
3
PS-544
2989_05_2001_c5
4
© 2001, Cisco Systems, Inc. All rights reserved.
B
5
61
•Let us perform a quick example. We will assume that we have a network
which produces the above RBD.
•As you can see, there are 2 sets of devices in parallel. 4 routers are, in
fact, parallel devices. Therefore, we will have some steps in our process
involving computing parallel parts and then serial part to come up with our
predictions.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
61
Redundancy Protocols
The Parts
• OSPF routing
• MTBF = 8757.234 hours
• MTTR = 8.766 hours
• 35 seconds for an OSPF
“complete fail-over”
• Data only network
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
62
•Our example will run OSPF routing protocol and the MTBF and MTTR
figures for the network along with some basic computations are listed
here.
•We magically used MTBF and MTTR numbers that made it so each
device would fail 1 time annually!!!
•And even more amazing, they have 3 nines of availability!
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
62
Redundancy Protocols
Divide & Conquer
Parallel Parts
Router Pairs 2, 3 and 4, 5 = [ 1– (1– .999) * (1 – .999)]
= .999999
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
63
•For each of segments depicting parallel routers, the availability is
computed using the parallel equation as shown here.
•We’re assuming the two sets of pairs are the same.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
63
Redundancy Protocols
Divide & Conquer
Without OSPF
Router 1
.999
Router 2, 3
.999999
Router 4, 5
.999999
Router 6
.999
End to End
.998
Base Downtime = 526,960 * (1 – .998)
= 1051.92
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
64
•Since all 6 routers are the same, we compute end to end by using the
serial equation and including our “paired” results.
•Our base downtime looks a little high - but then 3 nines is a little low for a
router!
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
64
Redundancy Protocols
Divide & Conquer
Failure/Year
MTBF = 8756.234
MTTR = 8.766
Hours/Year = 8766
Hours in Period
= 1 Failure per Year
per Router
MTBF + MTTR
OSPF Fail-Overtime = 35 Second
= .58 Minutes
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
65
•Now since we magically chose a router that failed exactly 1 time each
year, our calculations are fairly simple to calculate the number of failures
per year and the resulting OSPF fail-over’s that will occur.
•35 seconds is equal to .58 minutes per year.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
65
Redundancy Protocols
Conquer
Router Pair Availability = .999999
OSPF Downtime = 1.16 Minutes
OSPF Availability = 525,960 – 1.16
525,960
= .999998
Router Pair Availability = .999998 * .999999
= .999997
End to End = .999 * .999997 * .999997 * .999
= .99799
Downtime w/OSPF = 525,960 (1– .99799)
= 1057.2 Minutes
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
66
•Since a router pair availability is the only thing that will be “failed over”,
we can simply affect the availability of the router pairs by multiplying that
result by the availability of the fail-over protocol.
•And so you see that with routing protocol fail-over time considered, we
get a few more minutes of downtime than we calculated without it. If you
remember, our previous result (excluding fail-over time) was 1051.92
minutes - which is 6 minutes difference - after all the rounding errors are
introduced!
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
66
The Process of Availability Prediction
Divide and Conquer/Learn and Discern
• So how do we put
this all together
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
67
•OK - the bombs are done - we’re ready to start wrapping things up with
the overall process and an example.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
67
The Process of Availability Prediction
Divide and Conquer/Learn and Discern
• Complex tasks will be
broken into smaller,
simpler tasks:
• “Learn and Discern”
• “Divide and Conquer”
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
68
•The divide-and-conquer algorithm is represented by the following steps:
•Step 1 Determine the scenarios and create RBDs for each scenario to be
analyzed. Make sure to include redundancies.
•Step 2 Perform calculations for each network component in the RBD.
•Step 3 For each scenario— Perform calculations on serial sections,
contained within parallel sections, to determine an availability figure for
the section.
• — Perform calculations on parallel sections into an availability
figure for the parallel section.
• — Repeat as required until the end-to-end result can be achieved
via a single serial end-to-end calculation.
•Step 4 For each scenario, multiply all sections (including results from
Step 3) into the end-to-end availability result.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
68
The Process of Availability Prediction
Learn…
P1
S1
C1
IP
R1
P2
IP
PSTN
P3
S2
R2
IP
Building B
IP
PS-544
2989_05_2001_c5
P4
© 2001, Cisco Systems, Inc. All rights reserved.
C2
69
•Let’s take a look at a small network that could represent (partially) what it
might look like if we had a couple of small buildings running VoIP
telephones.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
69
The Process of Availability Prediction
Discern
On-Net to On-Net
C1
P1
S1
P2
R1
S2
R2
C2
On-Net to Off-Net
C1
P1
S1
PSTN
R1
R2
PS-544
2989_05_2001_c5
S2
Phone
C2
© 2001, Cisco Systems, Inc. All rights reserved.
70
•The first thing to notice is that there are a variety of different RBD’s
possible depending on who an individual user is attempting to call.
•Above you can see that the required equipment for an “on-net to on-net”
call is different when compared to an “on-net to off-net” call.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
70
The Process of
Availability Prediction Divide
System
Availability
IP Phone
0.99995
Switch
0.99980
Router
0.99985
Call Manager
0.999
PSTN
0.9997
System
Availability
Redundant Call Management
?????
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
71
•After “learn and discern”, it’s a good idea to list the availability of the
parts.
•Since “redundant call management” is one of the things we will be
computing, let’s leave it blank for now.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
71
The Process of Availability Prediction
Conquer each System
• Learn
Learn about the system or network topology
• Discern
Do an RBD showing the required elements of the
system or network topology
• Divide
Divide the RBD into logical sections
• Conquer
Compute the results of each section in the RBD so they
can be combined in end-to-end computations
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
72
•Another reminder on process. Let’s repeat it one more time - learn,
discern, divide and conquer.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
72
The Process of Availability Prediction
Conquer Serial/Parallel Sections
• For each redundant part of
the network:
Learn, discern, divide, conquer
(recursion is rampant here)
• Repeat until ready for end-to-end
in serial equation
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
73
•Now we are into the recursion. As we divide up the larger network into
smaller parts, we may end up having to “Learn, Discern, Divide and
Conquer” subsections.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
73
The Process of Availability Prediction
Conquer “On-Net to Off-Net”
Parallel Call Management
C1
R2
S2
C2
Availability (CM1) = 0.999
Availability (CM2) = 0.99985 x 0.9998 x 0.999
= 0.99865
Availability = 1 – [(1 – 0.999) x (1 – 0.99866)]
Call Management = 99999865
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
74
•Above you can see the work to compute the availability of the section of
our network that provides “call management” function.
•Above is an entire learn discern, divide and conquer process providing a
result for “call management” services on this little example network.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
74
The Process of Availability Prediction
Conquer “On-Net to Off-Net”
On-Net to Off-Net RBD
C1
P1
S1
PSTN
R1
R2
S2
P2
C2
On-Net to Off-Net RBD (After Divide and Conquer)
P1
PS-544
2989_05_2001_c5
S1
R1
CM*
PSTN
P2
© 2001, Cisco Systems, Inc. All rights reserved.
75
•Let’s finish the process for the “On-Net to Off-Net” scenario.
•With the previous work done, we can simplify our RBD from the top of the
diagram to the bottom of the diagram. As you can see, the bottom
diagram is now a simple serial equation away from the answer.
•The previous result is going to be “CM*” above.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
75
The Process of Availability Prediction
Conquer “On-Net to Off-Net”
P1
S1
R1
CM*
PSTN
P2
Availability = P1 x S1 x R1 x CM* x PSTN * P2
.99996
Phone 1
.99980
Switch 1
.99985
Router 1
.999999
.9997
X .99996
Call Management
PSTN
Phone 2
Availability = 0.9993 End to End
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
76
•With all the parts listed and multiplied together, we get our end to end
answer - for this scenario.
•We would of course have to repeat the aggregation of the parts to
perform the calculations for the “On-Net to On-Net” scenario, but we will
have to skip that in the interest of time.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
76
Example
Our Last Section
• We’re on the
home stretch now
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
77
•We are now ready to use our learn, discern, divide and conquer process
along with our knowledge about all five contributors to downtime in order
to perform an example availability analysis.
•In the interest of time, this is going to be as simple as we can make it and
still show all the parts.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
77
Let’s Do an Example Data over Cable
• Data over cable (HFC)
• Simple example of providing
internet service
• From the home to the internet (backbone)
• Head-end failures have large impacts
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
78
•Popular Technology
•A Weak point - re-ranging
•Fixing a weak point with HCCP
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
78
Data over Cable Learn
Head-End
uBR 7246
CPE
Backbone
Upconverter
uBR 7246
Cable
Plant
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
79
•Let’s assume that a cable network looks like this diagram. The data flow
is from the home PC to the Internet - represented by the gray circle in the
right side of the diagram.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
79
Data over Cable
Discern
Cable Example Reliability Block Diagram
CPE
H-E
B-B
H-E
B-B
HFC
• Regions for power, human error, redundancy
• Protocols are boxed
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
80
•Learning from the diagram and discerning our path leads us to an RBD
that looks like this. We have a CPE section, an HFC section, a central site
section and an Internet section.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
80
Data over Cable
Divide
• CPE
Hardware, software, power, no human error, no
redundancy protocols
• HFC
Assumed availability of .99998
• Head-end and backbone
Hardware software, power, human error,
redundancy protocols
• Internet
Assumed availability of .99997
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
81
•For the CPE site, we are not going to compute human error. If we
computed the amount of downtime caused by human error at CPE sites,
where users turn the machines off, our calculators might run out of
batteries!! Just kidding, but we have to have some fun here.
•At our central site, at the point where the HFC connects to the Head-End
routers, we are going to do two calculations: With and Without HCCP
protocol.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
81
Data over Cable
Conquer the CPE (1)
IOS
122
CVAPWR
PWR
Part
MTBF
MTTR
CVA 122
325,928
CVAPWR
IOS
300,000
25,000
8
8
0.1
Power Availability = .999945
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
82
•Since our CPE device is not “redundant”, we don’t need to calculate any
fail-over mechanism contribution to downtime.
•Above, you can see the the listing for the parts of the CPE and the RBD
for it.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
82
Data over Cable
Conquer the CPE (2)
CBA 122 Availability = .999976
CVAPWR Availability = .999973
IOS Availability = .999996
X
Power Availability = .999945
CPE Availability = .99989
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
83
•Inputting the MTBF and MTTR numbers into the availability equation
produces the equation above.
•This is simply the serial availability equation applied to this scenario.
•The .99989 result (nearly 4 nines) is going to be used later in our end-toend calculations.
•Did anyone notice that using the CVA battery backup power supply
would significantly increase the availability of the CPE portion of our
network? It would virtually eliminate the 29 minutes of downtime due to
power outages.
•Next time, we’ll have to include that battery backup.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
83
Data over Cable
Conquer the Head-End
UBR7246 HW/SW Availability = 0.99992
Parallel UBR7246 Routers (HW/SW) = 1 – [ ( 1 – .99992)22 ]
= .9999999 (Truncated)
• Hardware, software done—still must consider
power, fail-over and human error—those will be
done in a later step
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
84
•As you can see from the calculations above, an oversimplified prediction
of downtime from UBR7246 routers by including only hardware and
software does not provide an accurate answer.
•Those of you with the calculators will have calculated that we will only
have a few seconds per year of downtime resulting in from a pair of
UBR7246’s.
•As you will find out later - this result is going to depend heavily on the failover mechanism used between them.
•NOTE: These calculations were done using the “Cisco SHARC”
spreadsheet which also calculates new MTBF and MTTR numbers
for us. Although they are not included here - they will be used in
subsequent sections.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
84
Data over Cable
Conquer the Backbone
Cisco 12000 HW/SW Availability = 0.9999
Parallel 12000 Routers (HW/Sw) = 1 – [ ( 1 – .9999)22 ]
= .9999999 (Truncated)
• Again, we see HW/SW calculations for a
redundant component coming in at near
perfect—and again, we will need to consider the
remaining contributors to downtime before we
draw any conclusions
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
85
•As with the UBR’s, we first do some easy math and get availability of the
hardware and the software.
•NOTE: These calculations were done using the “Cisco SHARC”
spreadsheet which also calculates new MTBF and MTTR numbers
for us. Although they are not included here - they will be used in
subsequent sections.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
85
Data over Cable
Conquering Fail-Over (1)
Conquering Fail-Over (1)
Head End System MTBF =
8,516
BackBone System MTBF = 52,044
Annual Head-End Failovers =
=
8,766
8,516
1.03
8,766
52,044
= 0.168
Annual BackBone Failovers =
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
86
•The first step in determining the contribution that fail-over protocols will
provide is to determine the predicted number of them that will happen.
•Here you see that there will be 1.03 failures per year from the 7246’s and
.168 failures from the 12,000’s. It’s interesting that devices with similar
Availability’s can have such different MTBF’s.
•This is due to redundancy and MTTR inherent in the devices.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
86
Data over Cable
Conquering Fail-Over(2)
Conquering Fail-Over (2)
OSPF Fail-Over = .5 minutes
Re-range = 5 minutes
1.03 X 5 minutes = 5.15 minutes
1.03 X .5 minutes = .515 minutes
.168 X .5 minutes =
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
.084 minutes
87
•If we assume a primary / backup relationship, whenever a primary fails,
we will suffer some downtime until the backup takes over.
•Given that the UBR7246 fail-over will take 5.15 minutes for the re-ranging
of the CPE’s and .515 minutes for the OSPF routing resumption
(annually), we can use the 5.15 minutes since that will be the gating factor
for our users. This is a little bit of a kludging of the process - but it’s
accurate enough for our purposes of including fail-over times where
previously we might have overlooked them.
•The 12000’s are simpler in that the .084 minutes of downtime will be the
factor.
•Our next step will be to use these annual times to compute annual
availability such that we can include “fail-over availability” into our end to
end calculations.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
87
Data over Cable Conquering Fail-Over (3)
Conquering Fail-Over (3)
HE Fail-Over Availability =
525,960 – 5.15
525,960
= .99999
BB Fail-Over Availability =
525,960 – .084
525,960
= .999999 (Truncated)
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
88
•It’s a simple matter to compute the availability of the fail-over contribution
to downtime.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
88
Data over Cable Conquering Power Loss
• In order to save time, we are going to
assume that the power mitigation
techniques at this company will mitigate
any power loss to less than 5 minutes per
year and we will simply include .99999 as
our power contribution to downtime in our
end to end calculations
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
89
•Power mitigation through battery backups and with generator support at
Service Provider central sites is common.
•With that in mind, I feel comfortable assuming a nice highly available
value for this.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
89
Data over Cable Conquering Human Error
• In order to save time, we will simply
assume no more than 15 minutes per
year of downtime caused by human
error and we will use .999971 as our
human error availability
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
90
•It is probably more in real life.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
90
VoIP over HFC End to End
End to End
End to End = .99989 (CPE)
.99998 (HFC)
.999999 (HE, HW/SW)
.999999 (BB, HW/SW)
.99999 (HE, Fail-over)
.999999 (BB, Fail-over)
.99999 (Power)
.999971 (Human Error)
X
.99997 (Internet)
End to End = .99979
Downtime = 525,960 * (1 – .99979)
= 110.45 minutes
PS-544
2989_05_2001_c5
© 2001, Cisco Systems, Inc. All rights reserved.
91
•Because of the distributed property of multiplication, we can simply
multiply all the availability’s together in our end-to-end serial equation as
above.
•Based on the assumptions in this example, our end user would
experience approximately 110.45 minutes per year of downtime.
© 2001, Cisco Systems, Inc. All rights reserved.
2989_05_2001_c5.scr
91

Download Report

Availability Quantification Fundamentals

Paperzz.com

Your Paperzz