Availability

Availability - ICT

Business continuity engineering: design fundamentals
(rules, elements and methodologies)
Sergio De Falco - Italy
Excerpt from the text same title- copyright: ISBN 978-88-91084-29-3
The full book(italian language) can be bought only on-line
website: www.ilmiolibro.it
_______________________________________________________________
Sergio currently works as an independent ICT consultant and information system designer, and
has gained considerable multinational experiences working for many years for IBM and other
large ICT companies.
Any design of an information system, regardless of the level of continuity to be ensured, must first
of all identify the desired functionalities and then use them to plan the architecture and physical
structure in terms of hardware (server, storage, peripherals), software(basic and application),
networking, site preparation, IT security.
A knowledge of basic rules, basic parameters and basic know-how of general ICT designing is
taken for granted and is preparatory to what is in this text. Here, the described design techniques are
strictly related to the specific implementation of information systems that must guarantee a high
level of business continuity.
1. What we mean by business continuity designing
In extreme synthesis it means "designing an information system that operates with a very, very
small number of interruptions and with a very, very high security level of the data processed".
There are 3 main parameters that characterize this continuity:
> Availability
> Recovery Time Objective(RTO)
> Recovery Point Objective(RPO)
AVAILABILITY
defined as percentage ratio between the "regular operation time" (Tr) of a system and its "mission
time" (Tm):
A = Tr/Tm x 100
value normally between 90.000 % and 99.999 %.
For most companies information system availability amounting to three nine is considered
sufficient.
Availability
90.000 % - one nine
99.000 % - two nines
99.900 % - three nines
99.990 % - four nines
99.999 % - five nines
Failure time per year
36.5 days
3.65 days
8.76 hours
52.56 minutes
5.26 minutes
1
This so defined Availability depends:
-on the overall architecture of the information system;
-on the reliability of all its components , i.e. on the probability that in a given period of time these
components do not fail; reliability in turn depends on the following parameters: λ(failure rate),
MTTF (Mean Time To Failure), MTTR(Mean Time To Repair) and MTBF(Mean Time Between
Failures);
-on the provided environmental and digital protections, described by two corresponding indexes:
Environmental Immunity IMMe and Digital Immunity IMMd, used as corrective factors for the
global availability index.
RTO
Recovery Time Objective, i.e. the maximum time allowed for full recovery of the system.
This depends:
-on methods and technologies for saving data;
-on the efficiency of support and maintenance services;
-on implemented disaster recovery platforms;
RPO
Recovery Point Objective, i.e. the maximum interval of time allowed between the production of
data and its safe saving; consequently it provides a measure of the maximum amount of data that
the system may lose due to sudden failures.
This depends:
-on methods and technologies for saving data.
2. A short reminder of the structure of standard information systems
A reminder of the standard structure of information systems, usually represented as a stack of
interconnected tangible and intangible elements, may be useful:
-the application software is at the top of the stack, i.e. the set of programs that ensure the
performance of the functions required by the user, the only "raison d'etre" of the entire system;
-immediately below there is the basic software, i.e. the set of programs that enables connection of
the application software with the hardware;
-then the hardware, i.e. the equipment that physically performs the required functions;
-finally the network, connecting all the physical devices, which allows data flow;
-in order to operate properly and regularly, all these different components require side support
services such as hardware and software assistance and maintenance, as well as underlying
environmental protection equipment.
These are therefore the only objects of the "design" at issue.
2
3. How to proceed step by step
In order to start the required design the first step is to set down the desired values of the three basic
parameters described above: Availability, RTO and RPO.
To do this, the following is necessary:
-
a preliminary survey on all operating functional and organizational characteristics of the
company. This analysis is called BIA (Business Impact Analysis);
-
an accurate analysis of the risks and threats which the company is exposed to. This
analysis is called RA (Risk Assessment);
-
identification, together with customer management, of the balance point between the
amount of investment in "continuity" and the value (economic and non-economic) of the
damage resulting from a possible crash of the system.
4. Business Impact Analysis and Risk Assessment
A.
B.
C.
D.
E.
F.
G.
analysis of the type of business of the company, to check if it is one of those that
need high continuity of operation, due to its business nature;
identification of main applications, to check if some of them are mission-critical or realtime;
analysis of the architecture of the existing information system, to verify its validity in
light of the new objectives of business continuity and to assess whether partial or complete
redesign is necessary;
survey of the existing hardware configuration to identify any components that are still usable
and identify those which must be improved by replacing them entirely, those that must be
integrated and those that must be strengthened;
identification of all potential risks, both environmental and digital, which the company
information system may encounter;
identification of all company locations, to see if there are risks and threats for the
locations other than the headquarters;
briefings with customer management to see if there are specific needs and
requirements in order to achieve the desired business continuity objective.
Once we have done all the listed pre-project activities we can then:

set the desired values for Availability (A), Recovery Time Objective (RTO) and
Recovery Point Objective (RPO);

proceed with redesign of the architecture, sizing of its new physical structure,
configuration of the necessary technical support services and planning of measures to
counter environmental and digital risks.
3
5. Obtaining the required Availability: calculation formulas
It is well known that he Availability Index of any physical system can be calculated using the
formula
A = (MTBF-MTTR)/MTBF
for information systems this value must be corrected through the previously described immunity
indexes IMMe and IMMd , to take account of environmental and digital threats :
A = [(MTBF-MTTR)/MTBF] x IMMe x IMMd
If we are not dealing with a system made up of only one component, but with a complex one made
up of several components, as is the case for information systems, the global Availability index (Atot)
of such a system will depend on the Availability indexes (An) of these components, calculated
according to the formulas:
-for a serial structure
Atot = A1 x A2 x ............. x An
-for a parallel structure
Atot = 1 -(1-A1) x (1-A2) x ............. x (1-An)
The MTBF values of the single components(server, storage, switch, firewall, etc) can be derived
from their data sheets (note that it is often difficult to know because ICT manufacturers are
generally reluctant to make them available, unless they are expressly required), the MTTR values
can be imposed as a "requirement" of the SLA(Service Level Agreement), while the IMMe and
IMMd indexes depend on the measures set to combat both, the environmental and the digital risks.
Therefore, said As the value of A set as the specific project requirement:
-the architecture and the physical structure of the information system we are designing
-the devices repair times, depending on IT department organization
-the measures to counter the environmental and digital risks
will be determined on the basis of an ITERATIVE PROCEDURE, for subsequent adjustments, so
that the value of the final Availability satisfies the identity:
Afinal ≥ As
4
6. Obtaining the required Availability: architecture
a. Server consolidation and virtualization
To concentrate on a single machine, or on a limited number of machines, the various functions
present in an information system, rather than distributing them among a wide variety of equipments,
beyond the advantages in terms of savings of various nature (energy - climate - volumetric - etc), is
particularly advantageous in terms of business continuity, since, as is obvious, the smaller the
number of elements in a process chain, the lower the risk of failures that may occur in a certain
fixed time, due to the very fact that there are fewer elements susceptible to failure.
It should however be noted that simple server "consolidation" in the absence of "virtualization", is
only possible for homogeneous operating systems. Closely related to the concept of server
consolidation is thus server virtualization, i.e. creation, by means of specific virtualization software,
on a single physical machine, of more logical machines.
In this regard, it should be observed that the virtualization software constitutes a new ring in the
technological chain of the system and therefore the use of this technique is convenient only if the
number of servers to consolidate and virtualize is considerable.
The flip side of the "server consolidation", however, which should not be ignored and should be
assessed case by case, is represented by the consideration that if the hardware where consolidation
is implemented goes out of service, all consolidated servers therein will go out of service and not
just one. In other words we are creating a "single point of failure", which always brings a high level
of risk.
One final observation regarding this architectural choice: when the virtualization is done in a
"cloud", or in "outsourcing" c/o specialized operators, not only have the advantage of reducing the
number of equipment to be implemented, but also that specialized structures are generally more
reliable.
b. Redundancy
Redundancy, i.e. the doubling or even the multiplying of one or more elements of a system, when in
fail-over, increases the availability of the duplicated elements, according to the formula of the
parallel structures previously highlighted .
 central hardware (server and storage) is the most critical point of the whole information
system. In this case "redundancy" is done by coupling between them 2 or even more
machines, and constantly aligning data and programs via existing parallelization software
This technique is called fail-over clustering. The automatic intervention is realized through
a cable called heartbeat that connects the devices in fail-over and transmits a continuous
synchronization signal. As long as the signal on the heartbeat cable is regular, the
redundant device remains in stand-by and does not operate, whereas it automatically
intervenes when this signal is lacking.
 basic software is another very critical element of the technological chain which constitutes
the system. Generally defined as the part of the software closer to the hardware, basic
software is the set of programs that governs and controls the operations of the entire
computer, providing the link between the hardware and the application programs. In this
case redundancy is achieved through the availability of versions which are perfectly and
constantly aligned with the one in operation. Anyway, even if these parallel versions can be
quickly activated, since they are not in fail-over, they are not highly effective.
5
 application software, is the set of programs that ensure the performance of operational
functions for which the entire information system was built up. In this case redundancy, as
for the basic software, is achieved through the availability of versions that are perfectly and
constantly aligned with the current one.
 in regard to antivirus devices, when redounded, it is a good practice to also install different
software designed by a different manufacturer on the redundant device, to increase the
spectrum of overall action. Thus malware not covered by one manufacturer, will be
intercepted by the software of another.
 the intelligent networking devices (routers, switches and firewalls), when necessary, as for
example in the case of applications such as VOIP, will be redundant in fail-over.
c. Dissemination of Application Server
This architectural solution is the exact opposite of server consolidation. This is to deploy the
different applications on more servers, located in different sites, far from each other. This will
eliminate the "single point of failure" problem and will reduce the risk that a crisis, physical or
digital, in one site blocks the entire company's information system.
A peculiar characteristic of this architecture is not so much the distribution of applications across
more machines, as the territorial dissemination of these machines across more sites.
d. Consolidation-dissemination
Mixed architecture provides for the consolidation of certain servers and the dissemination of others,
based on the overall configuration of the company information system, how critical the various
applications are and the interrelationships between them.
7. Obtaining the required Availability: structure
The quality of the products(design, materials and test procedures ) is the element that impacts in a
decisive way on the reliability of the products themselves and therefore on the reliability of the
whole system where they are used. Indeed hardware and networking devices of high quality, with
high MTBF, provide high reliability indexes and consequently high levels of availability.
Here a short overview of high quality commercial products.
a.
Server fault tolerant
Fault tolerance does not mean "immunity" against faults, but that the fault tolerant devices have
been designed to continue to operate even in the presence of failures of some of their components.
"Fault tolerant" servers are significantly more expensive even than high-end ordinary servers. They
have an internal highly redundant architecture, they use RAID disks to allow automatic data
recovery, they mount ECC-RAM(Error Correction Code) which is much more expensive than an
ordinary one, they use high quality components, tested in the factory with much more stringent
requirements than ordinary ones and finally, they use a basic software designed to automatically
switch from failed components to back-up ones.
b.
Blade server
Blade servers are another family of very robust and reliable devices, since they have a unique ultracompact structure, which contains all the components in common: power, processing and control
units. A blade server is a set consisting of a frame that houses a number of modular electronic
circuit boards (blades), each of which is a real server, complete with all the necessary elements for
guaranteeing the basic server functions.
The number, type and physical layout of these elements on the board vary from manufacturer to
manufacturer, as does the number of blades housed in each frame(from 8 in midsize servers up to
6
16 in large servers). Each of these blades contains processors, RAM, a network controller and
input/output ports. Very often these blades also have one or two Ata or SCSI disks, while a FC or
iSCSI bus adapter allows external storage to be used. Power and cooling are provided directly by
the chassis.
The possibility to boot the system from the outside allows a higher density of processors to be
obtained, eliminating the local disks, but, above all, offers higher reliability, since the operating
system can be loaded from multiple disks, and even higher operational flexibility, since it becomes
much easier to install virtualization software able to load virtual machines with different operating
systems and different applications on the different blades.
c.
SAN - Storage Area Network
The Storage Area Network is a mass storage solution consisting of one or more arrays of RAID
hard disks networked via fiber optic connections (switch/router and cables with FC-protocol Fiber
Channel or iSCSI-Internet SCSI ) at Gigabit/sec speed, and with an architecture that makes all
storage devices directly available to any server on the company LAN. A storage platform of this
type, has the following big advantages :
- all the computing power of the servers is used for the fixed functions since the data reside on the
SAN;
-no overloading of the LAN, since all traffic is managed by networking devices inside the SAN.
d.
UTM-Unified Threat Management
The UTM devices are all-inclusive devices, able to provide a number of security functions. They
are basically advanced Firewalls
e.
High end switches and routers
These are networking equipment which, in addition to many other advanced features, present very
low values of latency and jitter.
f.
High reliability application software
Software packages offer high reliability characteristics, when:
-they are widespread with a large number of users and therefore characterized by a large field test;
-they are legal and therefore not subject to any block by control institutions;
-they are produced by financially, technically and organizationally solid companies, which
consequently reduce the overall risk of failure and disappearance from the market, and guarantee
higher product quality and high levels of service and maintenance.
8. Obtaining the required Availability:organization
a. Security Manager
The Security Manager is a key figure in the management of any information system, and even more
in the management of information systems that must operate in a regime of business continuity.
This professional figure, through appropriate monitoring software, remote signaling devices, and in
the case of large systems, making use of a staff of technicians, has a number of specific important
tasks in relation to business continuity.
Here are some of them:
-monitoring the regular operation of all implemented devices(hardware, software and networking);
-checking compliance with Service Level Agreements by external and internal service providers;
-planning and activating manual procedures in case of system failure;
-guaranteeing data back-up;
-keeping account of IT incidents;
7
-keeping track of the applications and basic software logs;
-managing authentication credentials;
-managing the training of employees, both technical and users;
-managing a list of emergency phone numbers for immediate intervention;
The philosophy behind the establishment of the Security Manager and his/her staff in a company
organization, is to ensure the principle of the unity of command within an extremely critical sector
like that of IT security. Having a single body that monitors the drawbacks of the system and makes
decisions regarding the resolution of these problems, ensures speed, competence and effectiveness
of intervention. In essence, the Security Manager does not just prepare the appropriate security
measures, but follows and constantly monitors the status of the whole information system and
therefore necessarily has a dedicated and appropriate budget. For all these reasons, in the scale of
importance, the Security Manager is considered second only to the Chief Information Officer.
b. Service Level Agreements
Aspects related to 1st and 2nd level assistance and "on demand" maintenance, are important parts of
any document SLA. An information system that operates in business continuity, must impose
particularly rigorous supply conditions in terms of response time, availability of spare parts,
technical competence and professionalism of the staff, in order to allow compliance with the fixed
MTTR values.
c. Quality certifications
ISO 9000 certification, ensuring the quality of all business processes that govern the creation of
products and services, also validates organizational procedures relating to the management of the
information system.
9. Obtaining the required Availability: environmental contrast
measures (IMMe)
Here is a quick analysis of the risks and related contrast techniques, associated with the
environmental conditions of the sites where ICT equipment is located, bearing in mind that
fundamental reference certification for data centers is that provided by the International Uptime
Institute.
RISKS
-greatest natural disasters (devastating earthquakes, floods, large fires, landslides, etc)
-minor earthquakes
-atmospheric electrical discharges
-local flooding
-local fires
- overheating of premises
-premises with excessively high or low moisture
-power surges of non atmospheric origin
-power supply interruption
-severing of cables or other data network interruptions
-intrusion by unauthorized persons
-riots, strikes, violent demonstrations
8
COUNTERMEASURES to guarantee an environmental immunity index of 99.999 %
-big natural disasters
Disaster Recovery plan
-minor earthquakes
If data centers, network nodes, etc are to be located in quite new premises, then construction should
be quake-proof and, if vice versa, they are to be located in existing premises, then they should be
properly adapted
- atmospheric electrical discharges
Protection from lightning is normally provided by means of very simple devices such as lightning
rods that only prevent these discharges hitting a specific physical area. There are however critical
situations, where wider protection of these specific areas is required in the sense that they should be
safeguarded also from electromagnetic disturbances caused by lightning and not only from being
directly hit. There are cases moreover, where such electromagnetic interferences may be local,such
as, for example, in hospitals where medical devices operate at high field strengths(CAT, MRI,PET)
or industrial facilities where production cycles involve the generation of voltages and electric
currents of great intensity(aluminum production factories, glass production factories, etc). In all
these cases the protection to be provided is the installation of a classical Faraday cage that
completely isolates the protected area electromagnetically.
-local flooding
The first and primary measures to protect the ICT devices from the risk of local flooding are:
-not housing this equipment in basements or in premises even slightly below road level;
-placing it on platforms raised at least 15-20 cm;
-sheltering it, even if located in an indoor area, with a sloping roof.
Measures of a preventive nature are also to install water detectors which use sensors to detect the
presence of water in the environment and immediately report that fact with alarm systems with
sirens, leafhoppers, etc, or even directly alert the Security Manager by way of mobile text messages
and/or e-mails.
-local fires
The fire hazard is the most common and most dangerous due to its devastating effects.
The measures for fighting this risk can be both passive and active.
PASSIVE PROTECTION:
-floors and coatings made of fire-fighting materials
-physical firewalls
-security distances between ICT devices
ACTIVE PROTECTION:
- smoke and heat sensors
- smoke and heat extractors
- automatic shutdown devices
- manual extinguishers
- overheating of premises
The set of issues that relate to this environmental aspect is generally referred to with the acronym
HVAC (Heating-Ventilation-Air Conditioning), which in the case of large data centers plays an
important role leading to specific and advanced solutions.
The overheating of the premises where ICT equipment is located is generated by the heat produced
by the ICT equipment, by other equipment not specifically ICT that may be present and by the
weather outside, which can be particularly aggressive in the summer, as well as by all these causes.
9
The optimum range of the environmental temperature is 20°C - 25°C. The climate in the ICT
premises is therefore a fundamental aspect of the physical protection of computer equipment. The
calculation of the BTU (British Thermal Unit) of the air conditioning to be installed should be done
taking into account the volume of the premises, the time series of summer temperatures
characteristic of the geographical place where the premises are, the amount of heat produced by all
equipment on the premises as noted by the nominal values (technical data sheet) and also taking
into account the possible future installation of new equipment, as well as that provided by the initial
project.
A well-sized plant normally includes the redundancy of the cooling apparatus and a system that
automatically reports temperature increases to the Security Manager.
-premises with excessively high or low moisture
The range within which the values of ambient humidity can be considered acceptable is 30-60%. It
is up to the air conditioning system to ensure that these climatic conditions are maintained.
-non atmospheric electrical surges
Electrical surges of non atmospheric origin can have various causes such as the irregularity of
power, electrostatic discharge due to accumulation of electric charges by friction, etc.
The protections consist of the usual UPS which will be discussed more extensively below, and
recently also by surge arresters (SPD Surge Protective Device) once used only on high and medium
voltage (HV and MV) electricity networks, but now also widespread on low voltage (LV) ones.
-electrical power failure
A lack of power is of course an extremely serious risk and one of the main threats to business
continuity. The usual countermeasure consists in the installation of a static UPS (Uninterruptible
Power System) buffering a dynamic UPS (electrical generator), the latter large enough to power the
entire data center.
The static UPS, having a response time in fact equal to "zero", allows the electrical generator,
which conversely has a response time of a few minutes, to operate regularly with no interruption in
the power supply. In other words, the static UPS intervenes on the device in protection,
automatically when the power supply fails and simultaneously sends a signal to the dynamic UPS
that starts to work and once fully operating replaces the static UPS.
Greater security is achieved by providing redundancy of the main power system itself.
-severing of cables or other data network interruptions
Data networks, as is well known, consist of so-called passive materials and a large number of active
components (routers, firewalls, switches) located in multiple, distinct locations interconnected by a
set of cables , copper or optical fiber, as well as wireless access points, which together constitute
the wiring (campus, building and floor).
The main risk that the wiring runs, particularly underground cables, is accidental severing due to
work not performed with due caution, or even damage by mice or other interruptions.
The classical countermeasure, in addition to the use of rodent cables, is to duplicate the cabling as
much as possible.
In case of campus wiring, the best topologies for ensuring maximum continuity of operation are:
a. the "ring topology", where the head and the tail of the wiring are connected together to form a
ring, so that the data can travel in both directions of the ring to reach their destination. In this
way, any interruption at any point in the ring does not cause the discontinuity of the
connection.
10
b. meshed topology, considerably more expensive, where each node is connected to all the
others.
In a meshed topology, a LACP (Link Aggregation Control Protocol) switch configuration, allows
redundancy paths and a significant increase in the network throughput.
In the presence of two or more data centers, their connection will also be redounded with double
cable laying so that it does not follow the same path.
-intrusion of unauthorized persons
The simplest and most immediate measures for preventing the risk of intrusion by unauthorized
persons, is the presence of control personnel at the entrances.
More sophisticated measures include installing electronic authentication, such as keypad codes,
magnetic badge readers, eye(iris) readers, fingerprint readers and similar devices. Alarm systems
and video surveillance can also be added, the configuration of which can vary greatly in terms of
size, sophistication and cost.
-riots, violent labor demonstrations, acts of terrorism
The only measure to deal with these types of risk, in addition to those already mentioned in the
previous paragraph, is to provide private armed security guards and on-line connection to the police
station.
11
10. Obtaining the required Availability: digital contrast measures
(IMMd)
EXTERNAL RISKS
-malware infections
-hacker attacks
-access by unauthorized users
INTERNAL RISKS
-processor overload
-network overload
-updating/developing new software
COUNTERMEASURES for guaranteeing a digital immunity index of 99.999 %
-malware infections
a. border antivirus
Border antivirus may be a stand-alone device consisting of hardware with a specific antivirus
software or, more usually, a function among others on a firewall installed to ensure network
security. It is located on the physical border of the network to intercept and directly block the
intrusion of malware from the internet or other external networks.
b. local antivirus
Local antivirus consists of software installed directly on the device to be protected to block external
malware that for some reason has not been preliminarily blocked by the border antivirus and
malware that spreads internally via the network or via input devices.
c. network antivirus
Network antivirus consists of a hardware device with its own specific software delegated to the
automatic and continuous updating of the local antivirus installed on each host. It therefore does not
have a function of "contrast" but only of "upgrade".
d. autoimmune Operating Systems
The Operating Systems more susceptible to malware attacks, and by far the most vulnerable, are
those in the MS/Windows family.
Unix and Linux family Operating Systems, as well as the Mac(Apple) Operating Systems and the
proprietary IBM Operating Systems, on the other hand, are solid and virtually immune to viruses.
-hacker attacks
a. firewall
The classic protection against hacker attacks is a firewall: hardware-software device, which
guarantees many protection functions and moreover operates as a network divider.
b. cryptography
Encrypt data, from the ancient greek word kryptos (hidden) and graphein (write), means hiding
them i.e. making them unintelligible to anyone who is not authorized to access them, thus
preventing theft, malicious use and possible locking of the information system. The greater the
length of the encryption key (128-256) the greater the security that is achieved.
A widely used protocol for secure communications on the internet and intranet, is https, asymmetric
(private key + public key) based on SSL (Secure Socket Layer), designed especially for MIM (Man
in the Middle) counter attacks.
c. appliance
An appliance, is a combination of hardware, Operating System and application software, already
pre-assembled in factory, and used to perform specific application functions. The term "appliance"
12
in fact, comes from "application equipment" and indicates a device designed for one specific
function, which is not flexible and not multi-purpose.
Appliances with advanced functions are currently on the market to fight hackers, based on
particularly innovative techniques such as:
-"intelligence" of the big data analytics type;
- misleading "simulation" to protect the environment and deceive and divert any hacker. They are
also called "honey pots", i.e. environments that appear to contain information and/or devices of
possible interest to hackers, but which are actually traps isolated from the real information system.
Honey pots are tools generally used in very high security information systems with very high
criticality such as military ones.
d. dedicated internet access
Since intrusions by hackers come mainly from the internet a "last resort" to avoid them is to
completely isolate the company network from that world and create an autonomous and different
network for internet access, completely separate and disconnected from the internal LAN, with
workstations dedicated solely to this function.
There continues to be a risk though that internal users, create hidden unauthorized connections to
internet directly from their company PC client through independent private access via telephone
lines/Wi.Fi. connections. To counter these potential risks, thin client virtualized solutions must be
used or ordinary ones sterilized.
e. thin client virtualization
Thin clients are minimal PC with no moving components such as hard drives, CD, fans, etc, which
do not even have their own Operating System. Their operability is therefore dependent on a central
server, to which they are constantly connected. Their virtualization consists of configuring in data
center, as many virtual PCs as their physical counterparts, using appropriate virtualization software.
f. sterilization PC
Sterilizing a PC, means deactivating all input therein, including booting from C. In other words
transforming the PC/client into a sort of "thin client", or even in an old "green screen terminal,"
enabled to the functions provided by institutional software loaded on them. The technical
difficulties associated with the reactivation of these devices, makes it unlikely that users can act
independently in this regard.
-access by unauthorized users
a. authentication
The authentication credentials (User ID and Password) is a basic protective measure that is always
provided, even if it is easy to overcome. Profiling users, also called authorization, is another
protective measure that, by assigning specific rights to each user, tends to reduce the possibility of
damage due to erroneous or fraudulent actions.
Mechanisms called strong authentication systems based on the recognition of a personal attribute
possessed only by the user:
-a physical characteristic (biometric authentication) such as a fingerprint, hand geometry, iris or
retina, voice, etc;
-a dynamically generated password (one-time password) from a special device customized for each
user (token);
-a digital certificate attesting the identity of the user, usually stored on a smart card. Digital
certificates exploit the asymmetric encryption technique based on the use of public keys. In order to
use these mechanisms, a PKI (Public Key Infrastructure) must be referred to, i.e. an infrastructure
that issues digital certificates, and provides for their management (web publishing, revocation,
suspension). The use of digital certificates allows the implementation of extremely important
objectives in the field of computer security, such as the authenticity, integrity, confidentiality and
non-repudiation of messages. In this scenario, each user has a pair of keys (public and private) that
identifies it. The public key is placed in a directory published by the PKI that unequivocally attests
13
the membership user. The private key is instead kept secret by the user.
An access control system often implemented in conjunction with strong authentication is
represented by a server single sign-on (SSO). This technique is designed to facilitate access
management in those systems where the user is faced with a multitude of heterogeneous
workstations, servers and applications, and is forced to perform the authentication (login) whenever
it needs to access one of them. In these situations an SSO system presents the user with a single
instance of initial identification; then the system, using a Security Information Base interior,
provides an automatic log-in for all applications or systems. The SSO server manages
independently and automatically the logging of new assignments, renewals or cancellations by
direct conversation with the equipment and/or applications.
b. Proxy Server
The term "proxy" is a legal term indicating someone who acts on behalf of third parties with a
specific delegation. The Proxy Server is a machine that placed between two separate networks,
operates as an intermediary between those two environments, enabling communication between
them, but masking host addresses of one to users wishing to access from the other.
c. No company wireless networks at all
Limiting the deployment of company wireless networks definitely reduces the risk of unauthorized
access, especially those by BYOD( bring your own device).
The configuration of the so-called "secure perimeter", has significantly changed because of the
large use of mobile devices (laptops-tablets-smartphones-etc), that are also now frequently subject
to malware attacks. Employees of companies increasingly use devices such as those just mentioned,
which are owned by the user and not by the company and tend to evade the rules on general security
implemented by the company itself.
-processing overloads
Processing overloads of an occasional type, determined by special and contingent conditions
which affect ordinary operating, can weigh heavily on business continuity, and though not with real
blocks of the system, may however cause longer unacceptable response times.
These changed operating conditions may be due :
-to a sudden, unexpected increase in the number of simultaneously active users;
-to a concomitant and random activation of application functions that require great commitment of
hardware resources;
-to particularly complex database queries;
-to application software malfunctions;
a. load balancing
A protection technique widely used to combat this type of risk is the implementation of a load
balancer, i.e. an apparatus able to spread the load over multiple cluster connected machines and thus
reduce the impact of possible extemporaneous overloads.
A disadvantage of the use of this technique is the fact that the load balancer becomes a "single point
of failure" with respect to all the application servers, with the result that instead of increasing the
overall high availability, it may even reduce it.
b. parallel computing
Parallel technologies, once confined to very narrow areas such as scientific computing and
simulations in financial, biological and meteorological fields, are now present in the world of
business applications at an affordable cost.
Parallel computing, also at Data Base level, significantly shortens response time and safely absorbs
sudden overload, thus avoiding possible stalls of the system.
c. heuristic Data Bases
Heuristic Data Bases are classified as NoSQL(non Relational Data Base), which have a different
structure from traditional Data Bases since they operate using heuristic algorithms.
Developed to allow maximum,easier data integration, they reach such high performances in terms
14
of transaction responsiveness, making them very useful also to handle huge workloads.
A typical example of DBMS of this type is the MongoDB software, Open Source-free, licensed
under the GPL (General Public License). Currently heuristic DBMS are adopted by managers of
large Web sites and by multinational service companies such as e-Bay and the New York Times,
just to name a few.
-network overload
As for servers, we can also have temporary throughput overloads for networks, due to changed and
abnormal operating conditions.
For LAN connections:
a. LACP bonding
The most widely used measure countering this type of risk is the switch configuration named LACP
(Link Aggregation Control Protocol) which requires, as already mentioned, a meshed topology
where each network node is connected to all the others.
For WAN
b. duplicated connectivity
Contemporary connectivity supplied by two or more carriers.
The redundancy connection must have characteristics equal to the main one, to avoid, in case of its
takeover, a performance degradation such that would not allow for an effective operation.
c. connection to one or more IXPs
IXP-Internet Exchange Point, according to the official definition by the European Association of
Internet Exchange (EURO-IX), is a network infrastructure managed by an independent third party,
to support data traffic of Internet Service Providers as the carriers, the content providers, the host
providers, etc.
Typically an IXP is based on an Ethernet LAN available to their users that can exchange IP traffic
with other users present on the same IXP. The exchange of traffic is generally done on a VLAN
shared by all connected users (public peering) or a VLAN dedicated to the exchange of traffic
between only 2 users (private peering).
The connection to the IXP allows each user to use a single geographical stream (physical circuit) to
interconnect to a variety of networks of other operators (Autonomous System - AS), avoiding the
need to carry as many connections as there are AS with which to exchange traffic.
Besides the obvious economic benefits and management, the direct connection between the
operators via an IXP decreases the "distance" between the networks and therefore, as a result, offers
a better service to internet users.
-updating/developing new software
The upgrade of the existing software or the development of new, activities always present in any
information system, are particularly risky for interference that may be generated with the software
during normal operating. The usual countermeasure adopted to reduce this risk is to keep the
development environment separate from operating one, with hardware-software platforms kept
separate and independent.
Correspondingly the transfer of updated or new software, should be done:
-using
time
windows
like
those
provided
for
scheduled
maintenance;
-using any server back-up present, if the platform is operating in a highly redundant fail-over
system.
15
11. Obtaining the required Recovery Time Objective- RTO
To quickly bring an information system from a locked state due to a drawback that occurred during
its operation to normal operating involves:
a. removal of the incident (hardware failure, software failure, physical attacks, cyber attacks, etc);
b. data recovery;
c. a system restart;
which must all be performed as quickly as possible.
The first and last of the on-listed activities are linked to company organizational aspects, both with
regard to external measures (SLA for support and maintenance), and internal interventions (training
and competence of the technical data center employees).
The second is linked to the technologies and implemented data saving methods.
12. Obtaining the required Recovery Point Objective- RPO
The achievement of the desired RPO is closely related to the data storage technologies of provided
for the purpose, as shown in the following table:
Rescue methodology
Ordinary back-up
DB logging
Asynchronous replication
Synchronous replication
RPO values
hours
minutes
minutes/seconds
tending to zero
• " ordinary back-up " refers to the traditional batch process of saving data by copying them to
another medium, usually tape;
• "DB logging" means saving on-line on a different support only the data base logs which must be
safeguarded. These logs, tracing the changes of the record, allow us to rebuild the data base at the
time of the fault from the last back-up off-line;
• "asynchronous replication" means recording data both on a primary storage and on a secondary
one, the second located also at a great distance, with decoupled primary and secondary transactions;
• "synchronous replication" means recording data both on a primary storage and on a secondary
one. The second, for latency transmission reasons must be located near the primary one. The
replication process is only completed if the data has been written definitively on both storages,
primary and secondary. Therefore this is not a case of transaction decoupling.
16
13. A simple and easy "Availability" calculation example
Let's refer to an imaginary IT company that provides hosting services to a large number of
customers connected via WAN to the company data center.
Architecture solution
C1
C2
.................. Cn
Carrier 1
Customers
Carrier 2
IT company D.R. Site
IT company primary Site
network AV
utm
Switch PC LAN
Internet
Central switches
Link point to point
blade server
PC LAN
switches
blade-san
san
17
The primary site is completely duplicated by a disaster recovery site with its physical configuration
being quite identical to that of the Primary site.
The WAN is duplicated by means of two different carriers.
Automatic switching from carrier-1 to carrier-2 occurs on the client node when the switching
function there configured detects that the connecting device-1 has gone down.
Automatic switching from the primary site to the disaster recovery site occurs c/o the carrier, when
the switching function there configured detects that the connection device no longer communicates
with the primary site.
The implemented production equipment on both sites(primary and disaster recovery) are:
-one UTM
-two pairs of central switches
-one Blade Server in a Secure LAN, hosting customer applications
-one SAN, in a Secure LAN, hosting customer data
-one network Antivirus in DMZ
-a variable number of workstations in a PC LAN
-a point-to-point link between the Primary site and the Disaster Recovery site,
redundant on level 2 of the ISO/OSI model through two fiber pairs connected directly on the two
pairs of switches in fail-over located respectively at the primary site and at the disaster recovery
site. The function of this link is two-fold:
-to enable synchronous replication of data from the primary site to the DR Site
-to allow redundancy of the Blade Server and SAN present in the primary data center, through their
connection in fail-over with the corresponding equipment present on the DR site.
Redundancy of this critical equipment(Blade Server and SAN) is done on the remote mirror
disaster recovery platform by configuring on the trunk ports of switches, both primary and DR, a
suitable VLAN between counterpart devices defined in cluster on those two sites.
All switch redundancy, on the other hand, is done locally by connecting two identical devices in
fail-over.
18
The serial-parallel structure of the system is therefore:
Switch1
primary
WAN- 1
Users
UTM/P
Switch2
primary
WAN-2
Switch1
DR
link
P-DR
UTM/DR
Switch2
DR
Blade
primary
Switch 1
blade/san
Blade
DR
Switch 2
blade/san
physical
threats
SAN
primary
SAN
DR
digital
threats
19
Equipment and service sizing
Devices, networks, support services and threat countermeasures have all been determined for
subsequent adjustments, according to the iterative procedure previously mentioned, so that the
obtained Availability value can satisfy the required Availability.
Hereafter the final results of the said iterative procedure:

WAN: availability contractually agreed with both carriers
Awan = 0.99900

UTM(Primary and DR): top model device in " high reliability and availability configuration
" by Check Point Software Technologies Ltd
MTBF= 370,000 h

Switches: top model Nexus series 7000 in "high reliability and availability configuration" by
Cisco Inc
MTBF = 318,572 h

Fiber link point to point: availability contractually agreed with one of the selected carriers
Alink = 0.99900

Blade Server: HP C7000 in "high reliability and availability configuration"
MTBF = 382,500 h

SAN: Netapp E5560 in "high reliability and availability configuration"
MTBF = 316,444 h

Maintenance: SLA contractually agreed with service providers
MTTR = 8h

Corrective factors IMMe and IMMd: depending on countermeasures implemented
0.99999 and 0.99999
Availability calculation
Referring to the formulas:
Atot-serial
= A1 x A2 x ....................... x An
Atot-parallel
= 1 -(1-A1) x (1-A2) x ...... x(1-An)
An
= (MTBFn-MTTRn)/MTBFn
Atot = Awan x Autm/p x Aswitch/p x Alink x Autm/dr
x Aswitch/dr x Ablade x Aswitch-blade/san x Asan
x IMMe x IMMd
20
Awan
= 1-(1-Awan1) x (1-Awan2) = 1-(1-0.99900) x (1-0.99900) = 0.99999
Alink
= 0.99900
.Autm/p
= Autm/dr = (MTBF-MTTR)/MTBF = (370,000-8)/370,000 = 0.99997
Aswitch1
=
Aswitches
= 1- (1-Aswitch1) x (1-Aswitch2) =1- (1-0.99997) x (1-0.99997) = 1(better 0.99999)
Ablade1
=Ablade2 = (MTBF-MTTR)/MTBF = (382,500-8)/382,500 = 0.99997
Ablade s
=1-(1-A blade1) x (1-A blade2) = 1- (1-0.99997) x (1-0.99997) = 1(better 0.99999)
Asan1
= Asan2 = (MTBF-MTTR)/MTBF = (316,444-8)/316,444 = 0,99997
Asans
=1-(1-Asan1)x(1-Asan2) = 1-(1-0.99997) x (1-0.99997) = 1(better 0.99999)
Aswitch2 = (MTBF-MTTR)/MTBF = (318,572-8)/318,572 = 0.99997
Therefore:
Atot = 0.99999 x 0.99997 x 0.99999 x 0.99900 x 0.99997 x 0.99999 x 0.99999 x 0.99999
x 0.99999 x 0.99999 x 0.99999=
0.99886
that means about 10 hours per year of probable information system failure, i.e. less than 1 hour per
month, in the case of an information system active h24.
21

Download Report

Availability - ICT

Paperzz.com

Your Paperzz