Week 3 - Network Survivability

Network Survivability,
Reliability and Availability:
Protection & Restoration
Zilong Ye, Ph.D.
[email protected]
1
Reliability




Reliability is the probability that a system or component will
operate without any service-affecting failure for a period of
time t
Reliability is a monotonically decreasing probability function
of time, R(t)
A specific reliability number always implies an assumed
duration of time
Reliability is about
- How soon the next repair expenses might be incurred etc.
- but reliability itself does not consider
• the repeated cycles of failure
• repair time, and
• return to service which determine the availability of an
ongoing service
2
Availability


Availability is the probability that a system will be
found in the operating state at a random time in
the future
Availability inherently reflects a statistical
equilibrium between
- failure processes: mean time to (or between) failure
(MTTF/MTBF) and
- repair processes: mean time to repair (MTTR)
- in maintained repairable systems that are returned to
the operating state following any failure
MTTF
Availability =
MTTF + MTTR
3
Quantification of Availability
N-Nines
Downtime Time
Minutes/Year
99%
2-Nines
5,000 Min/Yr
99.9%
3-Nines
500 Min/Yr
99.99%
4-Nines
50 Min/Yr
99.999%
5-Nines
5 Min/Yr
99.9999%
6-Nines
.5 Min/Yr
Percent Availability
4
Survivability

Survivability of a network as a whole is
- the average fraction of failed working
capacity that can be restored by a
specified mechanism within the spare
capacity provided in a network
- A link may be fully survived with 100%
capacity or it could be partially survived
with <100% capacity
5
Market Drivers for Survivability



Customer Relations
Competitive Advantage
Revenue
- Negative - Tariff Rebates
- Positive - Premium Services
• Business Customers
• Medical Institutions
• Government Agencies


Impact on Operations
Minimize Liability
6
Failure Types & Other Motivations

Types of failure:
- Components: links, nodes, channels in WDM, active components,
software…
- Human error: backhaul fiber cut
• Fiber inside oil/gas pipelines less likely to be cut
- Systems: Entire COs can fail due to catastrophic events

deliberate attacks
7
Network Survivability: drivers

Availability: 99.999% (5 nines)  less than 5 min downtime per year

Since a network is made up of several components, the ONLY way
to reach 5-nines is to add survivability in the face of failures…
- Survivability = continued services in the presence of failures

Protection switching or restoration: mechanisms used to
ensure survivability
- Add redundant capacity, detect faults and automatically re-route traffic
around the failure

Protection: fast time-scale: 10s-100s of ms…
- implemented in a distributed manner to ensure fast restoration

Restoration: related term, but slower time-scale
8
Types of Fault-Recovery Mechanisms

Protection
- Backup resources (routes and wavelengths) precomputed and reserved in advance (before a failure
occurs) – simple but 50% overhead
- Faster recovery time
- What if pre-reserved resources also fail?

Restoration
- Routes and wavelengths discovered dynamically after
detection of a failure
- Resources allocation based current network state info
- More resource efficient
- Can recover as long as there’re redundant resources
- Slower recovery time
10
Restoration

Path Restoration
- Route can be computed after failure

Link Restoration
- Path is discovered at the end nodes of the failed link
- More practical than path restoration

Advantages & Disadvantages of Restoration
-
Usually can recover from multiplex element faults
More efficient usage of resource
Complex
Slow: require extra process time to setup path and
reserve resource
10
Comparison between Protection &
Restoration



Characteristic: Protection -- the resource are
reserved before the failure, they may be not
used; Restoration -- the resource are reserved
and used after the failure
Route: Protection -- predetermined; Restoration - can be dynamically computed
Resource Efficiency: Protection -- Low;
Restoration -- High
11
Comparison between Protection &
Restoration (Cont’)



Time used: Protection -- Short; Restoration -Long
Reliability: Protection -- mainly for single fault;
Restoration -- can survive under multiple faults
Implementation: Protection -- Simple; Restoration
-- Complex
12
Network Survivability Architectures
Network Survivability Architectures
Restoration
Self-healing
Network
Protection
Re-Configurable
Network
Mesh Restoration
Architectures
Protection
Switching
Linear Protection Ring Protection
Architectures
Architectures
Path-based
Link-based
Segment-based
13
Restoration in Mesh Networks
Central
Controller
DC
DCS
DCS
DCS
DCS
DC
DC
DCS
DCS
DC
DCS
Reconfigurable (or Rerouting)
Restoration Architecture (centralized)
DCS
Self Healing (distributed)
Restoration Architecture
Probing after restoration
DC = Distributed Controller
14
Protection Switching Terminology

1+1 architectures - permanent bridge at the source select at sink

m:n architectures - m entities provide protection for n
working entities where m is less than or equal to n
- allows unprotected extra traffic on the m entities
- most common - SONET linear 1:1 and 1:n

Coordination Protocol - provides coordination between
controllers in source and sink
- Required for all m:n architectures
- Not required for 1+1 architectures
15
Basic Ideas: Working and Backup Paths
16
Protection Switching: Terminology


Dedicated vs Shared: working connection
assigned dedicated or shared protection
bandwidth
- 1+1 is dedicated, 1:n is shared
Revertive vs Non-revertive: after failure is fixed,
traffic is automatically or manually switched back
- Shared protection schemes are usually
revertive
17
Different protection and restoration
schemes
Protection/Restoration Schemes
Restoration
Protection
Ring
Protection
Link
Path
Protection Protection
Mesh
Protection
Ring/Mesh
Protection
Path
Link
Restoration Restoration
Link
Path
Link
Path
Protection Protection Protection Protection
21
Path vs Link Protection
22
Working Path
DCS
DCS
Line or Link
Protection
DCS
DCS
DCS
DCS
Protection Path
• Control: Centralized or Distributed
• Route Calculation: Preplanned or Dynamic
• Type of Alternate Routing: Line or Path
Types of Fault-Recovery Mechanisms

Path Protection
- Two link (node) disjoint paths: primary (working) and
backup (protection) path
- Traffic rerouted through a link-disjoint backup route
once a link failure occurs on working path
- Usually, less resource required (using shorter routes)
- Lower end-to-end propagation delay for the recovered
route
- Backup path pre-reserved or pre-set up
- Backup paths of different connections may or may not
share common wavelengths on common links
20
Types of Fault-Recovery Mechanisms

Dedicated Path Protection
- Do not allow sharing among backup paths (resources)
- Backup paths pre-configured
- No switch configuration necessary along the backup
path when a failure occurs
- Fast recovery time
- Resources not efficiently utilized (100% redundancy)
21
Types of Fault-Recovery Mechanisms

Shared Path Protection
- Allow sharing among backup paths subject to certain
constraints
• Primary/active paths (AP) are link disjoint  backup
paths (BP) may share common link and wavelength
- Backup paths configured when a failure occurs since
backup paths may be shared; cannot commit resources
to a particular primary in advance
- Slower recovery time
- Resources utilization much better
- More signaling required to recover from the failure
22
Shared Path Protection

If and only if two APs are disjoint, their BPs
can share backup bandwidth (backup
bandwidth) on a common link (i.e., total
backup bandwidth = max{w1, w2}).
AP1(w1)
S1
BP1
D1
Link L(max{w1,w2})
BP2
S2
AP2(w2)
D2
23
Types of Fault-Recovery Mechanisms

Link Protection
- a light-path set up on a primary path
- For each link on the primary path, a backup detour is
reserved around the link
- No sharing – dedicated-link protection
• Wavelength used on backup loop dedicated to
specific link to be protected
- Shared-link protection
- Note, different connections on the same link might have
different backup detours for that link
24
Solution 1: Active-Path First




Find an active path (AP) first
Then find a disjoint backup path (BP)
How?
Remove the physical links and resources
that the active path travels, and then rerun the routing algorithm to find the
backup path.
60
Solution 2: Joint Path Selection




Select the active path and backup path in
a joint manner.
Joint optimization have better
performance compared to active path first
schemes in terms of the amount of
network resources required
How?
Use Suurballe’s algorithm to compute two
link- disjoint paths between (s, d)
simultaneously
60
Suurballe’s Algorithm

Given a graph G=(V,
E), find a pair of
edge-disjoint paths
from s to t such that
the total edge cost of
the two paths is
minimal among all
such path pairs
27
Dedicated-path protection – Heuristic
Algorithms





Remove links that do not have free wavelengths
Apply Suurballe’s algorithm to find a pair of paths
Choose the shorter path as the primary path and
the longer path as the backup
Assign a wavelength using First-Fit to each path
Guarantees the minimum total bandwidth (TBW)
= active BW (ABW) + backup BW (backup
bandwidth) for this request
30
Shared-Path Protection
Heuristic 1
 Use Suurballe’s algorithm to generate two routes
 Assign wavelengths while trying to share the
wavelengths on the backup paths as much as
possible
 Does not perform well in backup path sharing
since routing does not consider wavelength info
no backup bandwidth sharing potential
29
A Fast and Efficient Heuristic 3

Challenges
- Jointly optimize an AP/BP pair with shared path
protection is NP-hard
• using ILP is notoriously time consuming.
• also, only guarantee minimal TBW for each request,
but not minimal TBW for all requests.
- Heuristics such as active path first (APF) can only
achieve sub-optimal results:
• does not consider the yet-to-be-incurred backup
cost along the BP when selecting a (shortest) AP
30
Potential Backup Cost (PBC)


Uses a shortest path algorithm to find the AP
first
But, in selecting the AP, each capable link e
(Re≥w) will be assigned a cost of w+e(w), where
the second term is the potential backup cost (PBC)



then finds a shortest BP
combines the best of ILP and APF based
approaches
See Xu et al, Lightwave Technology, Journal of
Volume: 25 Issue: 8, 2251 – 2259, 2007
31
Protection in SRLG networks
32
Shared Risk Link Group (SRLG)



Widely recognized as an important concept in
survivable optical networks
A group of network links that share a common
physical resource (cable, conduit etc.)
Due to layered structure:
- Physical layer: Fiber spans (cable, conduit, et
al)
- Optical layer: Optical links and nodes (a
subset of the nodes in the physical layer
33
Layered Architecture of Optical Network
e6
1
5
2
4
3
(a) Optical Layer
1
g8
g7
g6
5
g5
2
g3
4
3
(b) Physical Layer
66
Protection in SRLG networks


Finding SRLG-disjoint path pair is more
complicated than finding a link/node-disjoint path
pair. In fact, the former is a NP-complete
problem.
If Backup BandWidth (backup bandwidth)
sharing is considered, SRLG protection
problem will become even more
complicated.
35
APF and Trap

Active Path First (APF), followed by an SRLGdisjoint BP
- attractive alternative (policy-based routing, optimal AP)
- But may fail to find such a BP more frequently (up to
30% of the time statistically) when considering SRLGs

Trap: can’t find an SRLG-disjoint BP
- Real Traps: unavoidable, topology-induced
- Avoidable Traps: algorithm-induced.

Only a few APF algorithms so far to deal with
avoidable traps.
36
Other APF-based Heuristic

K Shortest Paths (KSP)
- Finds the first K shortest paths between the source
and destination as candidate APs, and
- then test them in the increasing order of their costs,
until a SRLG disjoint BP is found or all of them have
been tested.
37
Proposed Trap Avoidance (TA) Algorithm



Similar to KSP: iteratively test candidate APs
and find one that has a SRLG-disjoint BP
But TA constructs one AP at a time, and
modifies it into a new AP for testing only if
necessary
TA uses a more intelligent method to avoid the
most “risky” link when modifying the AP
- KSP is oblivious/blind to “bad” links
- See Xu et al. IEEE/OSA Journal of Lightwave
Technology (JLT), Special Issue on Optical
Networks, Vol. 21, No. 11, pp. 2683-2693, 2003
70
Find a Candidate BP



All the directed links along AP assigned a cost of
infinity to prevent them from being used by BP.
All the remaining links that share at least one
SRLG with any link on AP (including the links
along the reversed AP) will be assigned a large
value M as cost.
Discourage any shortest-path algorithm to use
such “M” links for the candidate BP. But do not
forbid.
39
PROtection using MultIple
SEgments (PROMISE)
40
Basic idea of PROMISE
• Logically “divide” an AP into several,
possibly overlapping sub-path called
active segments (AS’s), and then protect
each AS with a backup segment (BS)
AS 1
AS 2
BS 1
AP
BS 2
41
Applications for PROMISE


First proposed for Non-SRLG networks
Particularly effective in dealing with either real or
avoidable traps
- Recall that traps are more likely in SRLG networks
42
Most Bandwidth Efficient: I

Reason: the flexibility it offers in choosing the
appropriate AS’s and corresponding BS’s.
- AP1 and AP2 are not link disjoint, so their BPs
cannot share backup bandwidth
- But in PROMISE, BS1,1 and BS2,1 can share
backup bandwidth
AS
1,1
AS
1,2
AP1
BS 1,1
AP2
BS 2,1
80
Most Bandwidth Efficient: II


Inter-Sharing: Sharing between BS’s for
different connections
Intra-Sharing: Sharing between BS’s of the
same connection, e.g. BS1 and BS2 share
backup bandwidth on link c
AS1
AS2
b
c
a
BS1
1
2
e
AS3
f
BS2
d
3
g
BS3
4
5
6
44
Faster Recovery and More Resilient


Faster Recovery: Protects each AS using a shorter BS
instead of protecting the entire AP using a longer BP (as
in path protection)
More Resilient/Robust: Tolerate more multiple failures
than path protection (with the same or lower bandwidth
consumption).
Overall Reliability
BS1
•Link failure prob:
p
s
x
2
y
BS2
x = y = p = q =0.8
Path Protection: 1-(1-0.8 2)2  0.87
d
PROMISE: (1-(1-0.8)2)2 0.92
q
45
Other Benefits of PROMISE

Can Succeed When Other Approaches Fail
- Routing policies, QoS constraints (e.g., hop limit on the AP
and BP), or just APF
- Real/Avoidable Traps in SRLG networks

Readily be applied to MPLS networks by extending
the existing protocols for local repair/recovery in
MPLS networks
46
Key Challenges in PROMISE


Joint optimization of AP selection and the set of protecting
BS's is extremely complex
Even if AP is found first as in APF-based heuristic,
- How to optimally divide AP into AS’s (then corresponding BS's)
- Harder than modeling the general multi-commodity flow problem:
number of BS’s, and the source and destination for each BS are
not known beforehand.
47