Business continuity engineering: design fundamentals (rules, elements and methodologies) Sergio De Falco - Italy Excerpt from the text same title- copyright: ISBN 978-88-91084-29-3 The full book(italian language) can be bought only on-line website: www.ilmiolibro.it _______________________________________________________________ Sergio currently works as an independent ICT consultant and information system designer, and has gained considerable multinational experiences working for many years for IBM and other large ICT companies. Any design of an information system, regardless of the level of continuity to be ensured, must first of all identify the desired functionalities and then use them to plan the architecture and physical structure in terms of hardware (server, storage, peripherals), software(basic and application), networking, site preparation, IT security. A knowledge of basic rules, basic parameters and basic know-how of general ICT designing is taken for granted and is preparatory to what is in this text. Here, the described design techniques are strictly related to the specific implementation of information systems that must guarantee a high level of business continuity. 1. What we mean by business continuity designing In extreme synthesis it means "designing an information system that operates with a very, very small number of interruptions and with a very, very high security level of the data processed". There are 3 main parameters that characterize this continuity: > Availability > Recovery Time Objective(RTO) > Recovery Point Objective(RPO) AVAILABILITY defined as percentage ratio between the "regular operation time" (Tr) of a system and its "mission time" (Tm): A = Tr/Tm x 100 value normally between 90.000 % and 99.999 %. For most companies information system availability amounting to three nine is considered sufficient. Availability 90.000 % - one nine 99.000 % - two nines 99.900 % - three nines 99.990 % - four nines 99.999 % - five nines Failure time per year 36.5 days 3.65 days 8.76 hours 52.56 minutes 5.26 minutes 1 This so defined Availability depends: -on the overall architecture of the information system; -on the reliability of all its components , i.e. on the probability that in a given period of time these components do not fail; reliability in turn depends on the following parameters: λ(failure rate), MTTF (Mean Time To Failure), MTTR(Mean Time To Repair) and MTBF(Mean Time Between Failures); -on the provided environmental and digital protections, described by two corresponding indexes: Environmental Immunity IMMe and Digital Immunity IMMd, used as corrective factors for the global availability index. RTO Recovery Time Objective, i.e. the maximum time allowed for full recovery of the system. This depends: -on methods and technologies for saving data; -on the efficiency of support and maintenance services; -on implemented disaster recovery platforms; RPO Recovery Point Objective, i.e. the maximum interval of time allowed between the production of data and its safe saving; consequently it provides a measure of the maximum amount of data that the system may lose due to sudden failures. This depends: -on methods and technologies for saving data. 2. A short reminder of the structure of standard information systems A reminder of the standard structure of information systems, usually represented as a stack of interconnected tangible and intangible elements, may be useful: -the application software is at the top of the stack, i.e. the set of programs that ensure the performance of the functions required by the user, the only "raison d'etre" of the entire system; -immediately below there is the basic software, i.e. the set of programs that enables connection of the application software with the hardware; -then the hardware, i.e. the equipment that physically performs the required functions; -finally the network, connecting all the physical devices, which allows data flow; -in order to operate properly and regularly, all these different components require side support services such as hardware and software assistance and maintenance, as well as underlying environmental protection equipment. These are therefore the only objects of the "design" at issue. 2 3. How to proceed step by step In order to start the required design the first step is to set down the desired values of the three basic parameters described above: Availability, RTO and RPO. To do this, the following is necessary: - a preliminary survey on all operating functional and organizational characteristics of the company. This analysis is called BIA (Business Impact Analysis); - an accurate analysis of the risks and threats which the company is exposed to. This analysis is called RA (Risk Assessment); - identification, together with customer management, of the balance point between the amount of investment in "continuity" and the value (economic and non-economic) of the damage resulting from a possible crash of the system. 4. Business Impact Analysis and Risk Assessment A. B. C. D. E. F. G. analysis of the type of business of the company, to check if it is one of those that need high continuity of operation, due to its business nature; identification of main applications, to check if some of them are mission-critical or realtime; analysis of the architecture of the existing information system, to verify its validity in light of the new objectives of business continuity and to assess whether partial or complete redesign is necessary; survey of the existing hardware configuration to identify any components that are still usable and identify those which must be improved by replacing them entirely, those that must be integrated and those that must be strengthened; identification of all potential risks, both environmental and digital, which the company information system may encounter; identification of all company locations, to see if there are risks and threats for the locations other than the headquarters; briefings with customer management to see if there are specific needs and requirements in order to achieve the desired business continuity objective. Once we have done all the listed pre-project activities we can then: set the desired values for Availability (A), Recovery Time Objective (RTO) and Recovery Point Objective (RPO); proceed with redesign of the architecture, sizing of its new physical structure, configuration of the necessary technical support services and planning of measures to counter environmental and digital risks. 3 5. Obtaining the required Availability: calculation formulas It is well known that he Availability Index of any physical system can be calculated using the formula A = (MTBF-MTTR)/MTBF for information systems this value must be corrected through the previously described immunity indexes IMMe and IMMd , to take account of environmental and digital threats : A = [(MTBF-MTTR)/MTBF] x IMMe x IMMd If we are not dealing with a system made up of only one component, but with a complex one made up of several components, as is the case for information systems, the global Availability index (Atot) of such a system will depend on the Availability indexes (An) of these components, calculated according to the formulas: -for a serial structure Atot = A1 x A2 x ............. x An -for a parallel structure Atot = 1 -(1-A1) x (1-A2) x ............. x (1-An) The MTBF values of the single components(server, storage, switch, firewall, etc) can be derived from their data sheets (note that it is often difficult to know because ICT manufacturers are generally reluctant to make them available, unless they are expressly required), the MTTR values can be imposed as a "requirement" of the SLA(Service Level Agreement), while the IMMe and IMMd indexes depend on the measures set to combat both, the environmental and the digital risks. Therefore, said As the value of A set as the specific project requirement: -the architecture and the physical structure of the information system we are designing -the devices repair times, depending on IT department organization -the measures to counter the environmental and digital risks will be determined on the basis of an ITERATIVE PROCEDURE, for subsequent adjustments, so that the value of the final Availability satisfies the identity: Afinal ≥ As 4 6. Obtaining the required Availability: architecture a. Server consolidation and virtualization To concentrate on a single machine, or on a limited number of machines, the various functions present in an information system, rather than distributing them among a wide variety of equipments, beyond the advantages in terms of savings of various nature (energy - climate - volumetric - etc), is particularly advantageous in terms of business continuity, since, as is obvious, the smaller the number of elements in a process chain, the lower the risk of failures that may occur in a certain fixed time, due to the very fact that there are fewer elements susceptible to failure. It should however be noted that simple server "consolidation" in the absence of "virtualization", is only possible for homogeneous operating systems. Closely related to the concept of server consolidation is thus server virtualization, i.e. creation, by means of specific virtualization software, on a single physical machine, of more logical machines. In this regard, it should be observed that the virtualization software constitutes a new ring in the technological chain of the system and therefore the use of this technique is convenient only if the number of servers to consolidate and virtualize is considerable. The flip side of the "server consolidation", however, which should not be ignored and should be assessed case by case, is represented by the consideration that if the hardware where consolidation is implemented goes out of service, all consolidated servers therein will go out of service and not just one. In other words we are creating a "single point of failure", which always brings a high level of risk. One final observation regarding this architectural choice: when the virtualization is done in a "cloud", or in "outsourcing" c/o specialized operators, not only have the advantage of reducing the number of equipment to be implemented, but also that specialized structures are generally more reliable. b. Redundancy Redundancy, i.e. the doubling or even the multiplying of one or more elements of a system, when in fail-over, increases the availability of the duplicated elements, according to the formula of the parallel structures previously highlighted . central hardware (server and storage) is the most critical point of the whole information system. In this case "redundancy" is done by coupling between them 2 or even more machines, and constantly aligning data and programs via existing parallelization software This technique is called fail-over clustering. The automatic intervention is realized through a cable called heartbeat that connects the devices in fail-over and transmits a continuous synchronization signal. As long as the signal on the heartbeat cable is regular, the redundant device remains in stand-by and does not operate, whereas it automatically intervenes when this signal is lacking. basic software is another very critical element of the technological chain which constitutes the system. Generally defined as the part of the software closer to the hardware, basic software is the set of programs that governs and controls the operations of the entire computer, providing the link between the hardware and the application programs. In this case redundancy is achieved through the availability of versions which are perfectly and constantly aligned with the one in operation. Anyway, even if these parallel versions can be quickly activated, since they are not in fail-over, they are not highly effective. 5 application software, is the set of programs that ensure the performance of operational functions for which the entire information system was built up. In this case redundancy, as for the basic software, is achieved through the availability of versions that are perfectly and constantly aligned with the current one. in regard to antivirus devices, when redounded, it is a good practice to also install different software designed by a different manufacturer on the redundant device, to increase the spectrum of overall action. Thus malware not covered by one manufacturer, will be intercepted by the software of another. the intelligent networking devices (routers, switches and firewalls), when necessary, as for example in the case of applications such as VOIP, will be redundant in fail-over. c. Dissemination of Application Server This architectural solution is the exact opposite of server consolidation. This is to deploy the different applications on more servers, located in different sites, far from each other. This will eliminate the "single point of failure" problem and will reduce the risk that a crisis, physical or digital, in one site blocks the entire company's information system. A peculiar characteristic of this architecture is not so much the distribution of applications across more machines, as the territorial dissemination of these machines across more sites. d. Consolidation-dissemination Mixed architecture provides for the consolidation of certain servers and the dissemination of others, based on the overall configuration of the company information system, how critical the various applications are and the interrelationships between them. 7. Obtaining the required Availability: structure The quality of the products(design, materials and test procedures ) is the element that impacts in a decisive way on the reliability of the products themselves and therefore on the reliability of the whole system where they are used. Indeed hardware and networking devices of high quality, with high MTBF, provide high reliability indexes and consequently high levels of availability. Here a short overview of high quality commercial products. a. Server fault tolerant Fault tolerance does not mean "immunity" against faults, but that the fault tolerant devices have been designed to continue to operate even in the presence of failures of some of their components. "Fault tolerant" servers are significantly more expensive even than high-end ordinary servers. They have an internal highly redundant architecture, they use RAID disks to allow automatic data recovery, they mount ECC-RAM(Error Correction Code) which is much more expensive than an ordinary one, they use high quality components, tested in the factory with much more stringent requirements than ordinary ones and finally, they use a basic software designed to automatically switch from failed components to back-up ones. b. Blade server Blade servers are another family of very robust and reliable devices, since they have a unique ultracompact structure, which contains all the components in common: power, processing and control units. A blade server is a set consisting of a frame that houses a number of modular electronic circuit boards (blades), each of which is a real server, complete with all the necessary elements for guaranteeing the basic server functions. The number, type and physical layout of these elements on the board vary from manufacturer to manufacturer, as does the number of blades housed in each frame(from 8 in midsize servers up to 6 16 in large servers). Each of these blades contains processors, RAM, a network controller and input/output ports. Very often these blades also have one or two Ata or SCSI disks, while a FC or iSCSI bus adapter allows external storage to be used. Power and cooling are provided directly by the chassis. The possibility to boot the system from the outside allows a higher density of processors to be obtained, eliminating the local disks, but, above all, offers higher reliability, since the operating system can be loaded from multiple disks, and even higher operational flexibility, since it becomes much easier to install virtualization software able to load virtual machines with different operating systems and different applications on the different blades. c. SAN - Storage Area Network The Storage Area Network is a mass storage solution consisting of one or more arrays of RAID hard disks networked via fiber optic connections (switch/router and cables with FC-protocol Fiber Channel or iSCSI-Internet SCSI ) at Gigabit/sec speed, and with an architecture that makes all storage devices directly available to any server on the company LAN. A storage platform of this type, has the following big advantages : - all the computing power of the servers is used for the fixed functions since the data reside on the SAN; -no overloading of the LAN, since all traffic is managed by networking devices inside the SAN. d. UTM-Unified Threat Management The UTM devices are all-inclusive devices, able to provide a number of security functions. They are basically advanced Firewalls e. High end switches and routers These are networking equipment which, in addition to many other advanced features, present very low values of latency and jitter. f. High reliability application software Software packages offer high reliability characteristics, when: -they are widespread with a large number of users and therefore characterized by a large field test; -they are legal and therefore not subject to any block by control institutions; -they are produced by financially, technically and organizationally solid companies, which consequently reduce the overall risk of failure and disappearance from the market, and guarantee higher product quality and high levels of service and maintenance. 8. Obtaining the required Availability:organization a. Security Manager The Security Manager is a key figure in the management of any information system, and even more in the management of information systems that must operate in a regime of business continuity. This professional figure, through appropriate monitoring software, remote signaling devices, and in the case of large systems, making use of a staff of technicians, has a number of specific important tasks in relation to business continuity. Here are some of them: -monitoring the regular operation of all implemented devices(hardware, software and networking); -checking compliance with Service Level Agreements by external and internal service providers; -planning and activating manual procedures in case of system failure; -guaranteeing data back-up; -keeping account of IT incidents; 7 -keeping track of the applications and basic software logs; -managing authentication credentials; -managing the training of employees, both technical and users; -managing a list of emergency phone numbers for immediate intervention; The philosophy behind the establishment of the Security Manager and his/her staff in a company organization, is to ensure the principle of the unity of command within an extremely critical sector like that of IT security. Having a single body that monitors the drawbacks of the system and makes decisions regarding the resolution of these problems, ensures speed, competence and effectiveness of intervention. In essence, the Security Manager does not just prepare the appropriate security measures, but follows and constantly monitors the status of the whole information system and therefore necessarily has a dedicated and appropriate budget. For all these reasons, in the scale of importance, the Security Manager is considered second only to the Chief Information Officer. b. Service Level Agreements Aspects related to 1st and 2nd level assistance and "on demand" maintenance, are important parts of any document SLA. An information system that operates in business continuity, must impose particularly rigorous supply conditions in terms of response time, availability of spare parts, technical competence and professionalism of the staff, in order to allow compliance with the fixed MTTR values. c. Quality certifications ISO 9000 certification, ensuring the quality of all business processes that govern the creation of products and services, also validates organizational procedures relating to the management of the information system. 9. Obtaining the required Availability: environmental contrast measures (IMMe) Here is a quick analysis of the risks and related contrast techniques, associated with the environmental conditions of the sites where ICT equipment is located, bearing in mind that fundamental reference certification for data centers is that provided by the International Uptime Institute. RISKS -greatest natural disasters (devastating earthquakes, floods, large fires, landslides, etc) -minor earthquakes -atmospheric electrical discharges -local flooding -local fires - overheating of premises -premises with excessively high or low moisture -power surges of non atmospheric origin -power supply interruption -severing of cables or other data network interruptions -intrusion by unauthorized persons -riots, strikes, violent demonstrations 8 COUNTERMEASURES to guarantee an environmental immunity index of 99.999 % -big natural disasters Disaster Recovery plan -minor earthquakes If data centers, network nodes, etc are to be located in quite new premises, then construction should be quake-proof and, if vice versa, they are to be located in existing premises, then they should be properly adapted - atmospheric electrical discharges Protection from lightning is normally provided by means of very simple devices such as lightning rods that only prevent these discharges hitting a specific physical area. There are however critical situations, where wider protection of these specific areas is required in the sense that they should be safeguarded also from electromagnetic disturbances caused by lightning and not only from being directly hit. There are cases moreover, where such electromagnetic interferences may be local,such as, for example, in hospitals where medical devices operate at high field strengths(CAT, MRI,PET) or industrial facilities where production cycles involve the generation of voltages and electric currents of great intensity(aluminum production factories, glass production factories, etc). In all these cases the protection to be provided is the installation of a classical Faraday cage that completely isolates the protected area electromagnetically. -local flooding The first and primary measures to protect the ICT devices from the risk of local flooding are: -not housing this equipment in basements or in premises even slightly below road level; -placing it on platforms raised at least 15-20 cm; -sheltering it, even if located in an indoor area, with a sloping roof. Measures of a preventive nature are also to install water detectors which use sensors to detect the presence of water in the environment and immediately report that fact with alarm systems with sirens, leafhoppers, etc, or even directly alert the Security Manager by way of mobile text messages and/or e-mails. -local fires The fire hazard is the most common and most dangerous due to its devastating effects. The measures for fighting this risk can be both passive and active. PASSIVE PROTECTION: -floors and coatings made of fire-fighting materials -physical firewalls -security distances between ICT devices ACTIVE PROTECTION: - smoke and heat sensors - smoke and heat extractors - automatic shutdown devices - manual extinguishers - overheating of premises The set of issues that relate to this environmental aspect is generally referred to with the acronym HVAC (Heating-Ventilation-Air Conditioning), which in the case of large data centers plays an important role leading to specific and advanced solutions. The overheating of the premises where ICT equipment is located is generated by the heat produced by the ICT equipment, by other equipment not specifically ICT that may be present and by the weather outside, which can be particularly aggressive in the summer, as well as by all these causes. 9 The optimum range of the environmental temperature is 20°C - 25°C. The climate in the ICT premises is therefore a fundamental aspect of the physical protection of computer equipment. The calculation of the BTU (British Thermal Unit) of the air conditioning to be installed should be done taking into account the volume of the premises, the time series of summer temperatures characteristic of the geographical place where the premises are, the amount of heat produced by all equipment on the premises as noted by the nominal values (technical data sheet) and also taking into account the possible future installation of new equipment, as well as that provided by the initial project. A well-sized plant normally includes the redundancy of the cooling apparatus and a system that automatically reports temperature increases to the Security Manager. -premises with excessively high or low moisture The range within which the values of ambient humidity can be considered acceptable is 30-60%. It is up to the air conditioning system to ensure that these climatic conditions are maintained. -non atmospheric electrical surges Electrical surges of non atmospheric origin can have various causes such as the irregularity of power, electrostatic discharge due to accumulation of electric charges by friction, etc. The protections consist of the usual UPS which will be discussed more extensively below, and recently also by surge arresters (SPD Surge Protective Device) once used only on high and medium voltage (HV and MV) electricity networks, but now also widespread on low voltage (LV) ones. -electrical power failure A lack of power is of course an extremely serious risk and one of the main threats to business continuity. The usual countermeasure consists in the installation of a static UPS (Uninterruptible Power System) buffering a dynamic UPS (electrical generator), the latter large enough to power the entire data center. The static UPS, having a response time in fact equal to "zero", allows the electrical generator, which conversely has a response time of a few minutes, to operate regularly with no interruption in the power supply. In other words, the static UPS intervenes on the device in protection, automatically when the power supply fails and simultaneously sends a signal to the dynamic UPS that starts to work and once fully operating replaces the static UPS. Greater security is achieved by providing redundancy of the main power system itself. -severing of cables or other data network interruptions Data networks, as is well known, consist of so-called passive materials and a large number of active components (routers, firewalls, switches) located in multiple, distinct locations interconnected by a set of cables , copper or optical fiber, as well as wireless access points, which together constitute the wiring (campus, building and floor). The main risk that the wiring runs, particularly underground cables, is accidental severing due to work not performed with due caution, or even damage by mice or other interruptions. The classical countermeasure, in addition to the use of rodent cables, is to duplicate the cabling as much as possible. In case of campus wiring, the best topologies for ensuring maximum continuity of operation are: a. the "ring topology", where the head and the tail of the wiring are connected together to form a ring, so that the data can travel in both directions of the ring to reach their destination. In this way, any interruption at any point in the ring does not cause the discontinuity of the connection. 10 b. meshed topology, considerably more expensive, where each node is connected to all the others. In a meshed topology, a LACP (Link Aggregation Control Protocol) switch configuration, allows redundancy paths and a significant increase in the network throughput. In the presence of two or more data centers, their connection will also be redounded with double cable laying so that it does not follow the same path. -intrusion of unauthorized persons The simplest and most immediate measures for preventing the risk of intrusion by unauthorized persons, is the presence of control personnel at the entrances. More sophisticated measures include installing electronic authentication, such as keypad codes, magnetic badge readers, eye(iris) readers, fingerprint readers and similar devices. Alarm systems and video surveillance can also be added, the configuration of which can vary greatly in terms of size, sophistication and cost. -riots, violent labor demonstrations, acts of terrorism The only measure to deal with these types of risk, in addition to those already mentioned in the previous paragraph, is to provide private armed security guards and on-line connection to the police station. 11 10. Obtaining the required Availability: digital contrast measures (IMMd) EXTERNAL RISKS -malware infections -hacker attacks -access by unauthorized users INTERNAL RISKS -processor overload -network overload -updating/developing new software COUNTERMEASURES for guaranteeing a digital immunity index of 99.999 % -malware infections a. border antivirus Border antivirus may be a stand-alone device consisting of hardware with a specific antivirus software or, more usually, a function among others on a firewall installed to ensure network security. It is located on the physical border of the network to intercept and directly block the intrusion of malware from the internet or other external networks. b. local antivirus Local antivirus consists of software installed directly on the device to be protected to block external malware that for some reason has not been preliminarily blocked by the border antivirus and malware that spreads internally via the network or via input devices. c. network antivirus Network antivirus consists of a hardware device with its own specific software delegated to the automatic and continuous updating of the local antivirus installed on each host. It therefore does not have a function of "contrast" but only of "upgrade". d. autoimmune Operating Systems The Operating Systems more susceptible to malware attacks, and by far the most vulnerable, are those in the MS/Windows family. Unix and Linux family Operating Systems, as well as the Mac(Apple) Operating Systems and the proprietary IBM Operating Systems, on the other hand, are solid and virtually immune to viruses. -hacker attacks a. firewall The classic protection against hacker attacks is a firewall: hardware-software device, which guarantees many protection functions and moreover operates as a network divider. b. cryptography Encrypt data, from the ancient greek word kryptos (hidden) and graphein (write), means hiding them i.e. making them unintelligible to anyone who is not authorized to access them, thus preventing theft, malicious use and possible locking of the information system. The greater the length of the encryption key (128-256) the greater the security that is achieved. A widely used protocol for secure communications on the internet and intranet, is https, asymmetric (private key + public key) based on SSL (Secure Socket Layer), designed especially for MIM (Man in the Middle) counter attacks. c. appliance An appliance, is a combination of hardware, Operating System and application software, already pre-assembled in factory, and used to perform specific application functions. The term "appliance" 12 in fact, comes from "application equipment" and indicates a device designed for one specific function, which is not flexible and not multi-purpose. Appliances with advanced functions are currently on the market to fight hackers, based on particularly innovative techniques such as: -"intelligence" of the big data analytics type; - misleading "simulation" to protect the environment and deceive and divert any hacker. They are also called "honey pots", i.e. environments that appear to contain information and/or devices of possible interest to hackers, but which are actually traps isolated from the real information system. Honey pots are tools generally used in very high security information systems with very high criticality such as military ones. d. dedicated internet access Since intrusions by hackers come mainly from the internet a "last resort" to avoid them is to completely isolate the company network from that world and create an autonomous and different network for internet access, completely separate and disconnected from the internal LAN, with workstations dedicated solely to this function. There continues to be a risk though that internal users, create hidden unauthorized connections to internet directly from their company PC client through independent private access via telephone lines/Wi.Fi. connections. To counter these potential risks, thin client virtualized solutions must be used or ordinary ones sterilized. e. thin client virtualization Thin clients are minimal PC with no moving components such as hard drives, CD, fans, etc, which do not even have their own Operating System. Their operability is therefore dependent on a central server, to which they are constantly connected. Their virtualization consists of configuring in data center, as many virtual PCs as their physical counterparts, using appropriate virtualization software. f. sterilization PC Sterilizing a PC, means deactivating all input therein, including booting from C. In other words transforming the PC/client into a sort of "thin client", or even in an old "green screen terminal," enabled to the functions provided by institutional software loaded on them. The technical difficulties associated with the reactivation of these devices, makes it unlikely that users can act independently in this regard. -access by unauthorized users a. authentication The authentication credentials (User ID and Password) is a basic protective measure that is always provided, even if it is easy to overcome. Profiling users, also called authorization, is another protective measure that, by assigning specific rights to each user, tends to reduce the possibility of damage due to erroneous or fraudulent actions. Mechanisms called strong authentication systems based on the recognition of a personal attribute possessed only by the user: -a physical characteristic (biometric authentication) such as a fingerprint, hand geometry, iris or retina, voice, etc; -a dynamically generated password (one-time password) from a special device customized for each user (token); -a digital certificate attesting the identity of the user, usually stored on a smart card. Digital certificates exploit the asymmetric encryption technique based on the use of public keys. In order to use these mechanisms, a PKI (Public Key Infrastructure) must be referred to, i.e. an infrastructure that issues digital certificates, and provides for their management (web publishing, revocation, suspension). The use of digital certificates allows the implementation of extremely important objectives in the field of computer security, such as the authenticity, integrity, confidentiality and non-repudiation of messages. In this scenario, each user has a pair of keys (public and private) that identifies it. The public key is placed in a directory published by the PKI that unequivocally attests 13 the membership user. The private key is instead kept secret by the user. An access control system often implemented in conjunction with strong authentication is represented by a server single sign-on (SSO). This technique is designed to facilitate access management in those systems where the user is faced with a multitude of heterogeneous workstations, servers and applications, and is forced to perform the authentication (login) whenever it needs to access one of them. In these situations an SSO system presents the user with a single instance of initial identification; then the system, using a Security Information Base interior, provides an automatic log-in for all applications or systems. The SSO server manages independently and automatically the logging of new assignments, renewals or cancellations by direct conversation with the equipment and/or applications. b. Proxy Server The term "proxy" is a legal term indicating someone who acts on behalf of third parties with a specific delegation. The Proxy Server is a machine that placed between two separate networks, operates as an intermediary between those two environments, enabling communication between them, but masking host addresses of one to users wishing to access from the other. c. No company wireless networks at all Limiting the deployment of company wireless networks definitely reduces the risk of unauthorized access, especially those by BYOD( bring your own device). The configuration of the so-called "secure perimeter", has significantly changed because of the large use of mobile devices (laptops-tablets-smartphones-etc), that are also now frequently subject to malware attacks. Employees of companies increasingly use devices such as those just mentioned, which are owned by the user and not by the company and tend to evade the rules on general security implemented by the company itself. -processing overloads Processing overloads of an occasional type, determined by special and contingent conditions which affect ordinary operating, can weigh heavily on business continuity, and though not with real blocks of the system, may however cause longer unacceptable response times. These changed operating conditions may be due : -to a sudden, unexpected increase in the number of simultaneously active users; -to a concomitant and random activation of application functions that require great commitment of hardware resources; -to particularly complex database queries; -to application software malfunctions; a. load balancing A protection technique widely used to combat this type of risk is the implementation of a load balancer, i.e. an apparatus able to spread the load over multiple cluster connected machines and thus reduce the impact of possible extemporaneous overloads. A disadvantage of the use of this technique is the fact that the load balancer becomes a "single point of failure" with respect to all the application servers, with the result that instead of increasing the overall high availability, it may even reduce it. b. parallel computing Parallel technologies, once confined to very narrow areas such as scientific computing and simulations in financial, biological and meteorological fields, are now present in the world of business applications at an affordable cost. Parallel computing, also at Data Base level, significantly shortens response time and safely absorbs sudden overload, thus avoiding possible stalls of the system. c. heuristic Data Bases Heuristic Data Bases are classified as NoSQL(non Relational Data Base), which have a different structure from traditional Data Bases since they operate using heuristic algorithms. Developed to allow maximum,easier data integration, they reach such high performances in terms 14 of transaction responsiveness, making them very useful also to handle huge workloads. A typical example of DBMS of this type is the MongoDB software, Open Source-free, licensed under the GPL (General Public License). Currently heuristic DBMS are adopted by managers of large Web sites and by multinational service companies such as e-Bay and the New York Times, just to name a few. -network overload As for servers, we can also have temporary throughput overloads for networks, due to changed and abnormal operating conditions. For LAN connections: a. LACP bonding The most widely used measure countering this type of risk is the switch configuration named LACP (Link Aggregation Control Protocol) which requires, as already mentioned, a meshed topology where each network node is connected to all the others. For WAN b. duplicated connectivity Contemporary connectivity supplied by two or more carriers. The redundancy connection must have characteristics equal to the main one, to avoid, in case of its takeover, a performance degradation such that would not allow for an effective operation. c. connection to one or more IXPs IXP-Internet Exchange Point, according to the official definition by the European Association of Internet Exchange (EURO-IX), is a network infrastructure managed by an independent third party, to support data traffic of Internet Service Providers as the carriers, the content providers, the host providers, etc. Typically an IXP is based on an Ethernet LAN available to their users that can exchange IP traffic with other users present on the same IXP. The exchange of traffic is generally done on a VLAN shared by all connected users (public peering) or a VLAN dedicated to the exchange of traffic between only 2 users (private peering). The connection to the IXP allows each user to use a single geographical stream (physical circuit) to interconnect to a variety of networks of other operators (Autonomous System - AS), avoiding the need to carry as many connections as there are AS with which to exchange traffic. Besides the obvious economic benefits and management, the direct connection between the operators via an IXP decreases the "distance" between the networks and therefore, as a result, offers a better service to internet users. -updating/developing new software The upgrade of the existing software or the development of new, activities always present in any information system, are particularly risky for interference that may be generated with the software during normal operating. The usual countermeasure adopted to reduce this risk is to keep the development environment separate from operating one, with hardware-software platforms kept separate and independent. Correspondingly the transfer of updated or new software, should be done: -using time windows like those provided for scheduled maintenance; -using any server back-up present, if the platform is operating in a highly redundant fail-over system. 15 11. Obtaining the required Recovery Time Objective- RTO To quickly bring an information system from a locked state due to a drawback that occurred during its operation to normal operating involves: a. removal of the incident (hardware failure, software failure, physical attacks, cyber attacks, etc); b. data recovery; c. a system restart; which must all be performed as quickly as possible. The first and last of the on-listed activities are linked to company organizational aspects, both with regard to external measures (SLA for support and maintenance), and internal interventions (training and competence of the technical data center employees). The second is linked to the technologies and implemented data saving methods. 12. Obtaining the required Recovery Point Objective- RPO The achievement of the desired RPO is closely related to the data storage technologies of provided for the purpose, as shown in the following table: Rescue methodology Ordinary back-up DB logging Asynchronous replication Synchronous replication RPO values hours minutes minutes/seconds tending to zero • " ordinary back-up " refers to the traditional batch process of saving data by copying them to another medium, usually tape; • "DB logging" means saving on-line on a different support only the data base logs which must be safeguarded. These logs, tracing the changes of the record, allow us to rebuild the data base at the time of the fault from the last back-up off-line; • "asynchronous replication" means recording data both on a primary storage and on a secondary one, the second located also at a great distance, with decoupled primary and secondary transactions; • "synchronous replication" means recording data both on a primary storage and on a secondary one. The second, for latency transmission reasons must be located near the primary one. The replication process is only completed if the data has been written definitively on both storages, primary and secondary. Therefore this is not a case of transaction decoupling. 16 13. A simple and easy "Availability" calculation example Let's refer to an imaginary IT company that provides hosting services to a large number of customers connected via WAN to the company data center. Architecture solution C1 C2 .................. Cn Carrier 1 Customers Carrier 2 IT company D.R. Site IT company primary Site network AV utm Switch PC LAN Internet Central switches Link point to point blade server PC LAN switches blade-san san 17 The primary site is completely duplicated by a disaster recovery site with its physical configuration being quite identical to that of the Primary site. The WAN is duplicated by means of two different carriers. Automatic switching from carrier-1 to carrier-2 occurs on the client node when the switching function there configured detects that the connecting device-1 has gone down. Automatic switching from the primary site to the disaster recovery site occurs c/o the carrier, when the switching function there configured detects that the connection device no longer communicates with the primary site. The implemented production equipment on both sites(primary and disaster recovery) are: -one UTM -two pairs of central switches -one Blade Server in a Secure LAN, hosting customer applications -one SAN, in a Secure LAN, hosting customer data -one network Antivirus in DMZ -a variable number of workstations in a PC LAN -a point-to-point link between the Primary site and the Disaster Recovery site, redundant on level 2 of the ISO/OSI model through two fiber pairs connected directly on the two pairs of switches in fail-over located respectively at the primary site and at the disaster recovery site. The function of this link is two-fold: -to enable synchronous replication of data from the primary site to the DR Site -to allow redundancy of the Blade Server and SAN present in the primary data center, through their connection in fail-over with the corresponding equipment present on the DR site. Redundancy of this critical equipment(Blade Server and SAN) is done on the remote mirror disaster recovery platform by configuring on the trunk ports of switches, both primary and DR, a suitable VLAN between counterpart devices defined in cluster on those two sites. All switch redundancy, on the other hand, is done locally by connecting two identical devices in fail-over. 18 The serial-parallel structure of the system is therefore: Switch1 primary WAN- 1 Users UTM/P Switch2 primary WAN-2 Switch1 DR link P-DR UTM/DR Switch2 DR Blade primary Switch 1 blade/san Blade DR Switch 2 blade/san physical threats SAN primary SAN DR digital threats 19 Equipment and service sizing Devices, networks, support services and threat countermeasures have all been determined for subsequent adjustments, according to the iterative procedure previously mentioned, so that the obtained Availability value can satisfy the required Availability. Hereafter the final results of the said iterative procedure: WAN: availability contractually agreed with both carriers Awan = 0.99900 UTM(Primary and DR): top model device in " high reliability and availability configuration " by Check Point Software Technologies Ltd MTBF= 370,000 h Switches: top model Nexus series 7000 in "high reliability and availability configuration" by Cisco Inc MTBF = 318,572 h Fiber link point to point: availability contractually agreed with one of the selected carriers Alink = 0.99900 Blade Server: HP C7000 in "high reliability and availability configuration" MTBF = 382,500 h SAN: Netapp E5560 in "high reliability and availability configuration" MTBF = 316,444 h Maintenance: SLA contractually agreed with service providers MTTR = 8h Corrective factors IMMe and IMMd: depending on countermeasures implemented 0.99999 and 0.99999 Availability calculation Referring to the formulas: Atot-serial = A1 x A2 x ....................... x An Atot-parallel = 1 -(1-A1) x (1-A2) x ...... x(1-An) An = (MTBFn-MTTRn)/MTBFn Atot = Awan x Autm/p x Aswitch/p x Alink x Autm/dr x Aswitch/dr x Ablade x Aswitch-blade/san x Asan x IMMe x IMMd 20 Awan = 1-(1-Awan1) x (1-Awan2) = 1-(1-0.99900) x (1-0.99900) = 0.99999 Alink = 0.99900 .Autm/p = Autm/dr = (MTBF-MTTR)/MTBF = (370,000-8)/370,000 = 0.99997 Aswitch1 = Aswitches = 1- (1-Aswitch1) x (1-Aswitch2) =1- (1-0.99997) x (1-0.99997) = 1(better 0.99999) Ablade1 =Ablade2 = (MTBF-MTTR)/MTBF = (382,500-8)/382,500 = 0.99997 Ablade s =1-(1-A blade1) x (1-A blade2) = 1- (1-0.99997) x (1-0.99997) = 1(better 0.99999) Asan1 = Asan2 = (MTBF-MTTR)/MTBF = (316,444-8)/316,444 = 0,99997 Asans =1-(1-Asan1)x(1-Asan2) = 1-(1-0.99997) x (1-0.99997) = 1(better 0.99999) Aswitch2 = (MTBF-MTTR)/MTBF = (318,572-8)/318,572 = 0.99997 Therefore: Atot = 0.99999 x 0.99997 x 0.99999 x 0.99900 x 0.99997 x 0.99999 x 0.99999 x 0.99999 x 0.99999 x 0.99999 x 0.99999= 0.99886 that means about 10 hours per year of probable information system failure, i.e. less than 1 hour per month, in the case of an information system active h24. 21
© Copyright 2026 Paperzz