Fault-Tolerant Computer System Design ECE 695/CS 590 Topic 9: Validation Saurabh Bagchi ECE/CS Purdue University ECE 695/CS 590 1 Outline Introduction Validation methods Design phase – Fault simulation Prototype phase – HW or SW implemented fault injection Operational phase – Measurement and analysis of field systems ECE 695/CS 590 2 1 Challenges • Assessing the system dependability for – different technologies – different computers – different network topologies – different communication protocols • Validating the networked system characteristics – dependability: (i) user level, (ii) system level, and (iii) network level – availability validation: (i) detection and (ii) recovery ECE 695/CS 590 3 Experimental Analysis Different Design Phases Early Design Phase Prototype Phase Approach and Goals: Approach and Goals: • CAD environments used to evaluate design via simulation • Simulated fault injection experiments • Evaluate effectiveness of fault-tolerant mechanisms • Provide timely feedback to system designers Information produced fault models, error latency, error detection coverage, recovery time distribution Limitation: Simulations need accurate inputs and validation of results • System run under controlled workload conditions • Controlled fault injections used to evaluate system in presents of faults Information produced error latency, propagation detection distributions, availability Operational Phase Approach and Goal: • Measurement-based approach to study naturally occurring errors • Study systems in the field, under real workloads • Analyze collected error and performance data • Provide information on actual failure characteristics Information produced actual failure characteristics, Limitation: failure rates, time to failure Injected faults should distribution create/induce failure Limitation: scenarios representative of Reported errors are actual system operation representative of the systems studied 2 Evaluation - Experimental Methods Design phase – Fault simulation (simulated fault injection) – Electrical, logic, or functional level – Hierarchical simulation Issues: simulation time, level of simulation, fault conditions, accurate fault models Prototype phase – Fault injection in prototype systems – Hardware fault injection – Software fault injection – Radiation-based fault injection Issues: Fault models, joint HW /SW fault injection, injection into networked applications Operational phase – Study of naturally occurring faults in real environments – Essential for believable analysis of today’s complex systems – What can we say about future systems based on measurements from current systems Issues: HW/SW instrumentation, analysis tools ECE 695/CS 590 5 Key Issues in Fault Injection Effective fault injection mechanisms using hardware, software, and hybrid technology to accurately assess and validate networked systems Practical evaluation methods to accurately quantify fault effect and recovery mechanisms in complex environments Evaluation of error detection, diagnosis, and recovery techniques Quantification of confidence in the fault-injection based validation Usable fault tolerance benchmark for assessing systems and networks Common evaluation/validation framework ECE 695/CS 590 6 3 Metrics of Fault Injection Fault activation refers to the activation or access of injected faults. – For example, fault injected into a register is activated when that register is read before the register is overwritten. Fault activation level refers to the ratio of the number of activated faults (Fa) to the total number of injected faults (Fi). Therefore, fault activation level = Fa/Fi Metrics for a fault tolerant technique – Accuracy/Recall = (Number of activated faults that were dealt with)/(Total number of activated faults) – Precision = (Number of cases out of the denominator that were actual activated faults)/(Number of cases that were dealt with by the fault-tolerant technique) ECE 695/CS 590 7 Design Phase: Simulation at Different Levels Electrical level – transistor Logic level Function level – circuit circuit chip VLSI systems – VLSI system computer and network systems Levels of Simulated Fault Injection Fault Injection Electrical level Logic level Function level Change current Change voltage Stuck-at 0 or 1 Inverted fault Change CPU registers Network Flip memory bit Electrical Circuits ECE 695/CS 590 Physical Process Logic Gates Logic Operation Functional Units 8 4 Questions to Ask In Simulated Fault Injection In simulation-based fault injection, the whole system behavior is modeled and faults are injected to the simulated model of the systems This technique is often applied in the design phases to allow test and validation of error handling techniques before a physical prototype is available Fault models – Fault conditions, fault types, number of faults, fault times, fault locations Workload – Real applications, benchmarks, synthetic programs Simulation time explosion – Mix-mode simulation, importance sampling, concurrent simulation, accelerated fault simulation, hierarchical simulation ECE 695/CS 590 9 Simulated Fault Injection at Electrical Level Why is it needed? – Study the impact of physical causes – Simple stuck-at models do not represent many real types of faults Transistor Level Simulation Ionizing particles hit Vdd Device Physics Level Simulation B A SiO2 B AB n+ p+ p n+ channel stop n Fault model GND - is the angle of incidence ECE 695/CS 590 10 5 Simulated Fault Injection at Logic Level Fault Mode – Basic models • stuck-at (permanent) - forcing logic value for entire simulation duration • inverted fault (transient) - altering logic value momentarily – Fault dictionary approach • Use electrical level simulation to derive logic-level fault models • dictionary entry - input vector, injection time, fault location Transistor Level Description of 4-bit Adder A(3:0) A B Cin Input 0000 0000 0 B(3:0) Logic Level Fault Dictionary Current-Burst Fault Model S Cout ------F --FF-F- Cout Cin 34% 39% 7% 20% : Input 1111 1111 1 ---- F 23% ---F 1% --FF F 9% -F-- 33% -FFF 33% FFFFF 1% S(3:0) For all nodes, for all input combinations ECE 695/CS 590 F F F F 11 Simulated Fault Injection at Chip Level FOCUS & SPLICE! – A chip-level simulation environment – Acceleration: mix-mode simulation, importance sampling FOCUS Experimental Environment Target system description Mixed-mode hierarchical Fault simulation Fault description type of fault transient/ stuck-at location/time Fault tracing facility Automatic fault injection Trace Impact analysis Graphical Analysis Visual identifications Error propagation Manifestation ECE 695/CS 590 Statistical analysis Design feedback 12 6 Simulated Fault Injection at Function Level Diversity of Components – Object-oriented approach Fault Models – Various types - depending on the type of components – Examples • Single bit-flip for a memory or register fault • Message corruption for communication channel fault • Service interrupt for a node fault – More detailed fault models derived from lower-level simulation Impact of Software – Impact of faults is application dependent – Software effect can be studied at this level ECE 695/CS 590 13 Hierarchical Fault Simulation (Reference [1]) Example: LANai Processor of Myrinet Network Switch Hardware Local Network Chip Level Simulation Host 1 System Level Simulation Software Host 2 Myrinet Control Program Switch module i Host 3 module j module j Intrf. Host 4 Host Interface 256K memory Other details: Custom processor DMA, (LANai) Vdd A LANai Reg Ctrl Reg Logic Level Simulation ECE 695/CS 590 Electric Level Simulation AB Fault model ALU ADDER Reg B GND Device Physics Level Simulation Transistor Level Simulation p+ n+ channel stop n+ Alpha particles hit SiO2 p n 14 7 Prototype Phase: Software Implemented Fault Injection Advantages: flexibility, low cost Disadvantages: perturbation to workload, low time resolution Targets for software fault injection Software faults and errors – modify the text/data segment of the program Memory faults CPU faults Bus faults – flip memory bits – modify CPU registers, cache, buffers – use traps before and after an instruction to change the code or data used by the instruction and then restore them after the instruction is executed Network faults – modify or delete transmitted messages – introduce faults in network controllers, drivers, buffers ECE 695/CS 590 15 Characteristics of SWIFI ECE 695/CS 590 16 8 Runtime Characteristics of SWIFI (Continued) ECE 695/CS 590 17 Difficulties in using SWIFI Tools Often fault injection is needed to answer the following questions – Comparative studies – is my app more reliable on Android or iOS? – Dependability benchmarking – rank the various flavors of Linux by their reliability – System-level evaluation of large systems – comprising heterogeneous components – Special-purpose embedded systems – which have resource constraints To answer these questions, the SWIFI tool needs to be ported to different platforms Porting effort is high ECE 695/CS 590 18 9 NFTAPE (Reference [3]) NFTAPE is a tool for conducting automated fault/error injection-based dependability characterization Tool, which enables a user: (1) to specify a fault/error injection plan, (2) to carry on injection experiments, and (3) to collect the experimental results for analysis. Facilitates automated execution of fault/error injection experiments. Targets assessment of a broad set of dependability metrics, e.g., availability, reliability, coverage, mean time to failure. Operates in a distributed environment. Imposes minimal disturbance of target systems ECE 695/CS 590 19 NFTAPE Approach to Fault/Error Injection Lightweight Fault/Error Injectors (LWFIs) – Small (in terms of a code size) entities responsible for injecting faults/errors – Rely on other processes for services such as fault triggering, data logging, etc. – Example LWFIs: driver-based, target-specific API for implementing and invoking Triggers, LWFIs, and support programs – Less work to create components necessary for conducting fault/error injection experiments – Components conform to standard interface – Components can be reconfigured using the standard interface (e.g., swap triggers) Reusability – LWFIs and triggers collected in a library – Same configuration, startup, and logging process for different fault/error injection campaigns – One learning curve ECE 695/CS 590 20 10 NFTAPE Architecture Error Injection Targets Control Host LAN Campaign Script Process Manager Control Host Process Manager Log ECE 695/CS 590 Injector Process Application Process Injector Process Application Process 21 Control Host & Process Manager Control Host Processes a Campaign Script, a file that specifies a state machine or control flow followed by the control host during the fault injection campaign Simple yet general way to customize a fault injection experiment Experiments controlled by the Common Control Mechanism Implemented in Java to ensure portability Process Manager Daemon on each target node to manage processes on the target node(s) including process execution and termination processes include: injectors, workloads, applications, monitors all processes are treated the same as an abstract process object rather than a process of some specific type Facilitates communication between the Control Host and Target Nodes. ECE 695/CS 590 22 11 Fault Injectors - examples Driver-based injector - uses dedicated device driver to inject memory, registers, OS functions, I/O devices (e.g., Linux, Solaris) Ptrace-based injector (e.g., Linux, Lynx) - controlled injection to the target process memory and registers; Proc file based injector (e.g., Solaris) – controlled injection to the memory image of a process Network injector - employs dedicated software for controlling network hardware to inject faults into network cards/controllers; can intercept and corrupt messages Use of performance monitors (built into the CPU) to trigger fault injection Injection of delay faults/errors – controlled, temporary suspending of interrupts ECE 695/CS 590 23 State Machine of an Example Campaign Script Start Application Activate Trigger ST_run_app Initialization (variable, events), Start a logfile ST_init TRUE Condition_1 Condition_2 ST_Trigger_ON ST_start_fi_trig ER_condition_2 Start the Fault Injector & Trigger processes ST_error2 ER_condition_1 ST_error1 ECE 695/CS 590 Inject errors Deactivate the Trigger after a specified time TRUE Condition_3 TRUE ST_finish Terminate processes Exit Control Host 24 12 Hardware Fault Injector for Inducing Network Faults A versatile device for supporting injection of random and specific faults in a network environment. Design based on FPGA as the core structural component, which allows the device to be programmed to: – accept configuration commands generated either internally (i.e., by the device itself) or by an external system – provide other services e.g., data monitoring, statistics gathering, or prototyping new designs Design enables precise synchronization of fault injection hardware with target systems – Myrinet network and Fiber Channel - while running at-speed of the network Fault injector inserted in the data path (i.e., in the network) to decode the data patterns, modify the data if necessary (i.e., inject faults), and then retransmit the data (i.e., send the corrupted/modified data to the network). ECE 695/CS 590 25 Capabilities of HW Fault Injection Techniques Summary Injection method Synchroni zation with the system activity Pin-level at the signal Speculative level; not signal partitioning Injection at the system speed Targeting faults to specific locations Guarantees Example tools and of fault references activation Yes Yes Low speed Messaline [Arl89], Hybrid (signallevel & SWIFI) [Kan95] Full Yes Yes (if the delay induced by the inserted logic does not exceed that allow by the system) Yes Messaline [Arl89], Rifle[Mad94] Heavy Ion Radiation Speculative Yes No No [Kar94] Power supply disturbances Speculative Yes No No [Mir95] Electromagnetic interferences Speculative Yes No No [Kar97] Laser (LFI) Speculative Yes Yes No [Sam97] Built-in logic Full No Yes Yes FIMBUL [Fol98]] Pin-level by signal interception; Signal partition by inserting the injection logic into the path of the signal ECE 695/CS 590 26 13 Assembled Fault Injector ECE 695/CS 590 27 Conclusion Fault injection is a means to effectively test and stress fault tolerance mechanisms before they are deployed in the field on an actual system Fault injection can be done through hardware or software Software-implemented fault injection (SWIFI) is the dominant mode of system verification, especially for complex software systems It is common to compare multiple systems (or versions of the same system) using dependability attributes collected from a fault injection campaign ECE 695/CS 590 28 14 References 1. Stott, D. T., Ries, G., Hsueh, M. C., & Iyer, R. K. (1998). Dependability analysis of a high-speed network using software-implemented fault injection and simulated fault injection. Computers, IEEE Transactions on, 47(1), 108-119. 2. Z. Kalbarczyk, R. K. Iyer, G.L. Ries, J.U. Patel, M.S. Lee, and Y. Xiao “Hierarchical Simulation Approach to Accurate Fault Modeling for System Dependability Evaluation”, IEEE Transactions on Software Engineering, vol. 25, no.5, September/October 1999, pp.619-632. 3. Stott, D.T.; Floering, B.; Burke, D.; Kalbarczpk, Z.; Iyer, R.K., "NFTAPE: a framework for assessing dependability in distributed systems with lightweight fault injectors," Computer Performance and Dependability Symposium, 2000. IPDS 2000. Proceedings. IEEE International , vol., no., pp.91-100, 2000 ECE 695/CS 590 29 15
© Copyright 2026 Paperzz