Validation Fault-Tolerant Computer System Design ECE 695/CS 590

Fault-Tolerant Computer System Design
ECE 695/CS 590
Topic 9: Validation
Saurabh Bagchi
ECE/CS
Purdue University
ECE 695/CS 590
1
Outline

Introduction
Validation methods

Design phase

– Fault simulation

Prototype phase
– HW or SW implemented fault injection

Operational phase
– Measurement and analysis of field systems
ECE 695/CS 590
2
1
Challenges
• Assessing the system dependability for
– different technologies
– different computers
– different network topologies
– different communication protocols
• Validating the networked system characteristics
– dependability: (i) user level, (ii) system level, and (iii) network
level
– availability validation: (i) detection and (ii) recovery
ECE 695/CS 590
3
Experimental Analysis
Different Design Phases
Early Design Phase
Prototype Phase
Approach and Goals:
Approach and Goals:
• CAD environments used to
evaluate design via
simulation
• Simulated fault injection
experiments
• Evaluate effectiveness of
fault-tolerant mechanisms
• Provide timely feedback
to system designers
Information produced
fault models, error latency,
error detection coverage,
recovery time distribution
Limitation:
Simulations need accurate
inputs and validation of
results
• System run under
controlled workload
conditions
• Controlled fault injections
used to evaluate system
in presents of faults
Information produced
error latency, propagation
detection distributions,
availability
Operational Phase
Approach and Goal:
• Measurement-based
approach to study
naturally occurring errors
• Study systems in the
field, under real workloads
• Analyze collected error
and performance data
• Provide information on
actual failure characteristics
Information produced
actual failure characteristics,
Limitation:
failure rates, time to failure
Injected faults should
distribution
create/induce failure
Limitation:
scenarios representative of
Reported errors are
actual system operation
representative of the systems
studied
2
Evaluation - Experimental Methods

Design phase
–
Fault simulation (simulated fault injection)
–
Electrical, logic, or functional level
–
Hierarchical simulation

Issues: simulation time, level of simulation, fault conditions, accurate fault models

Prototype phase
–
Fault injection in prototype systems
–
Hardware fault injection
–
Software fault injection
–
Radiation-based fault injection

Issues: Fault models, joint HW /SW fault injection, injection into networked applications

Operational phase

–
Study of naturally occurring faults in real environments
–
Essential for believable analysis of today’s complex systems
–
What can we say about future systems based on measurements from current systems
Issues: HW/SW instrumentation, analysis tools
ECE 695/CS 590
5
Key Issues in Fault Injection

Effective fault injection mechanisms using hardware,
software, and hybrid technology to accurately assess
and validate networked systems

Practical evaluation methods to accurately quantify fault
effect and recovery mechanisms in complex
environments

Evaluation of error detection, diagnosis, and recovery
techniques

Quantification of confidence in the fault-injection based
validation

Usable fault tolerance benchmark for assessing systems
and networks

Common evaluation/validation framework
ECE 695/CS 590
6
3
Metrics of Fault Injection

Fault activation refers to the activation or access of injected
faults.
– For example, fault injected into a register is activated when that
register is read before the register is overwritten.


Fault activation level refers to the ratio of the number of
activated faults (Fa) to the total number of injected faults (Fi).
Therefore, fault activation level = Fa/Fi
Metrics for a fault tolerant technique
– Accuracy/Recall = (Number of activated faults that were dealt
with)/(Total number of activated faults)
– Precision = (Number of cases out of the denominator that were
actual activated faults)/(Number of cases that were dealt with by the
fault-tolerant technique)
ECE 695/CS 590
7
Design Phase: Simulation at Different Levels

Electrical level
– transistor

Logic level

Function level
– circuit
circuit
chip
VLSI systems
– VLSI system
computer and network systems
Levels of Simulated Fault Injection
Fault Injection
Electrical level
Logic level
Function level
Change current
Change voltage
Stuck-at 0 or 1
Inverted fault
Change CPU registers
Network
Flip memory bit
Electrical
Circuits
ECE 695/CS 590
Physical
Process
Logic
Gates
Logic
Operation
Functional
Units
8
4
Questions to Ask In Simulated Fault Injection



In simulation-based fault injection, the whole system behavior is
modeled and faults are injected to the simulated model of the
systems
This technique is often applied in the design phases to allow
test and validation of error handling techniques before a
physical prototype is available
Fault models
– Fault conditions, fault types, number of faults, fault times, fault locations

Workload
– Real applications, benchmarks, synthetic programs

Simulation time explosion
– Mix-mode simulation, importance sampling, concurrent simulation,
accelerated fault simulation, hierarchical simulation
ECE 695/CS 590
9
Simulated Fault Injection at Electrical Level

Why is it needed?
– Study the impact of physical causes
– Simple stuck-at models do not represent many real types of
faults
Transistor Level Simulation
Ionizing
 particles hit
Vdd
Device Physics
Level Simulation
B
A
SiO2
B
AB
n+
p+
p
n+ channel stop
n
Fault model
GND
 - is the angle of incidence
ECE 695/CS 590
10
5
Simulated Fault Injection at Logic Level

Fault Mode
– Basic models
• stuck-at (permanent) - forcing logic value for entire simulation duration
• inverted fault (transient) - altering logic value momentarily
– Fault dictionary approach
• Use electrical level simulation to derive logic-level fault models
• dictionary entry - input vector, injection time, fault location
Transistor Level Description of 4-bit Adder
A(3:0)
A
B Cin
Input 0000 0000 0
B(3:0)
Logic Level
Fault Dictionary
Current-Burst
Fault Model
S Cout
------F
--FF-F-
Cout
Cin
34%
39%
7%
20%
:
Input 1111 1111 1
---- F
23%
---F 1%
--FF F
9%
-F-- 33%
-FFF 33%
FFFFF
1%
S(3:0)
For all nodes, for all input combinations
ECE 695/CS 590
F
F
F
F
11
Simulated Fault Injection at Chip Level

FOCUS & SPLICE!
– A chip-level simulation environment
– Acceleration: mix-mode simulation, importance sampling
FOCUS Experimental Environment
Target system
description
Mixed-mode hierarchical
Fault simulation
Fault description type
of fault transient/
stuck-at location/time
Fault tracing
facility
Automatic fault injection
Trace
Impact analysis
Graphical Analysis
Visual identifications
Error propagation
Manifestation
ECE 695/CS 590
Statistical analysis
Design feedback
12
6
Simulated Fault Injection at Function Level

Diversity of Components
– Object-oriented approach

Fault Models
– Various types - depending on the type of components
– Examples
• Single bit-flip for a memory or register fault
• Message corruption for communication channel fault
• Service interrupt for a node fault
– More detailed fault models derived from lower-level simulation

Impact of Software
– Impact of faults is application dependent
– Software effect can be studied at this level
ECE 695/CS 590
13
Hierarchical Fault Simulation (Reference [1])
Example: LANai Processor of Myrinet Network Switch
Hardware
Local
Network
Chip Level Simulation
Host 1
System Level Simulation
Software
Host 2
Myrinet Control Program
Switch
module i
Host 3
module j
module j
Intrf.
Host 4
Host Interface
256K memory
Other
details:
Custom processor
DMA,
(LANai)
Vdd
A
LANai
Reg
Ctrl
Reg
Logic Level
Simulation
ECE 695/CS 590
Electric Level
Simulation
AB
Fault
model
ALU
ADDER
Reg
B
GND
Device Physics
Level Simulation
Transistor Level
Simulation
p+
n+ channel stop
n+
Alpha
particles hit
SiO2
p
n
14
7
Prototype Phase: Software Implemented Fault
Injection
Advantages: flexibility, low cost
 Disadvantages: perturbation to workload, low time
resolution
Targets for software fault injection
 Software faults and errors

– modify the text/data segment of the program

Memory faults

CPU faults

Bus faults
– flip memory bits
– modify CPU registers, cache, buffers
– use traps before and after an instruction to change the code or data used
by the instruction and then restore them after the instruction is executed

Network faults
– modify or delete transmitted messages
– introduce faults in network controllers, drivers, buffers
ECE 695/CS 590
15
Characteristics of SWIFI
ECE 695/CS 590
16
8
Runtime
Characteristics of SWIFI (Continued)
ECE 695/CS 590
17
Difficulties in using SWIFI Tools

Often fault injection is needed to answer the following
questions
– Comparative studies – is my app more reliable on Android or
iOS?
– Dependability benchmarking – rank the various flavors of Linux
by their reliability
– System-level evaluation of large systems – comprising
heterogeneous components
– Special-purpose embedded systems – which have resource
constraints


To answer these questions, the SWIFI tool needs to be
ported to different platforms
Porting effort is high
ECE 695/CS 590
18
9
NFTAPE (Reference [3])
NFTAPE is a tool for conducting automated
fault/error injection-based dependability
characterization

Tool, which enables a user: (1) to specify a fault/error injection
plan, (2) to carry on injection experiments, and (3) to collect the
experimental results for analysis.

Facilitates automated execution of fault/error injection
experiments.

Targets assessment of a broad set of dependability metrics, e.g.,
availability, reliability, coverage, mean time to failure.

Operates in a distributed environment.

Imposes minimal disturbance of target systems
ECE 695/CS 590
19
NFTAPE Approach to Fault/Error Injection

Lightweight Fault/Error Injectors (LWFIs)
– Small (in terms of a code size) entities responsible for injecting faults/errors
– Rely on other processes for services such as fault triggering, data logging, etc.
– Example LWFIs: driver-based, target-specific

API for implementing and invoking Triggers, LWFIs, and support
programs
– Less work to create components necessary for conducting fault/error injection
experiments
– Components conform to standard interface
– Components can be reconfigured using the standard interface (e.g., swap triggers)

Reusability
– LWFIs and triggers collected in a library
– Same configuration, startup, and logging process for different fault/error injection
campaigns
– One learning curve
ECE 695/CS 590
20
10
NFTAPE Architecture
Error Injection Targets
Control Host
LAN
Campaign
Script
Process
Manager
Control
Host
Process
Manager
Log
ECE 695/CS 590
Injector
Process
Application
Process
Injector
Process
Application
Process
21
Control Host & Process Manager
 Control Host
 Processes a Campaign Script, a file that specifies a state machine or
control flow followed by the control host during the fault injection
campaign
 Simple yet general way to customize a fault injection experiment
 Experiments controlled by the Common Control Mechanism
 Implemented in Java to ensure portability
 Process Manager
 Daemon on each target node to manage processes on the target node(s)
including process execution and termination
 processes include: injectors, workloads, applications, monitors
 all processes are treated the same as an abstract process object rather than a process of
some specific type
 Facilitates communication between the Control Host and Target Nodes.
ECE 695/CS 590
22
11
Fault Injectors - examples






Driver-based injector - uses dedicated device driver to inject memory,
registers, OS functions, I/O devices (e.g., Linux, Solaris)
Ptrace-based injector (e.g., Linux, Lynx) - controlled injection to the
target process memory and registers;
Proc file based injector (e.g., Solaris) – controlled injection to the
memory image of a process
Network injector - employs dedicated software for controlling network
hardware to inject faults into network cards/controllers;
can intercept and corrupt messages
Use of performance monitors (built into the CPU) to trigger fault
injection
Injection of delay faults/errors – controlled, temporary suspending of
interrupts
ECE 695/CS 590
23
State Machine of an Example Campaign Script
Start Application
Activate Trigger
ST_run_app
Initialization
(variable, events),
Start a logfile
ST_init
TRUE
Condition_1
Condition_2
ST_Trigger_ON
ST_start_fi_trig
ER_condition_2
Start the Fault
Injector &
Trigger processes
ST_error2
ER_condition_1
ST_error1
ECE 695/CS 590
Inject errors
Deactivate the Trigger
after a specified time
TRUE
Condition_3
TRUE
ST_finish
Terminate processes
Exit Control Host
24
12
Hardware Fault Injector for
Inducing Network Faults


A versatile device for supporting injection of random and specific
faults in a network environment.
Design based on FPGA as the core structural component, which
allows the device to be programmed to:
– accept configuration commands generated either internally (i.e., by the
device itself) or by an external system
– provide other services e.g., data monitoring, statistics gathering, or
prototyping new designs


Design enables precise synchronization of fault injection hardware
with target systems – Myrinet network and Fiber Channel - while
running at-speed of the network
Fault injector inserted in the data path (i.e., in the network) to decode
the data patterns, modify the data if necessary (i.e., inject faults), and
then retransmit the data (i.e., send the corrupted/modified data to the
network).
ECE 695/CS 590
25
Capabilities of HW Fault Injection Techniques
Summary
Injection method
Synchroni
zation with
the system
activity
Pin-level at the signal Speculative
level;
not signal partitioning
Injection at the
system speed
Targeting
faults to
specific
locations
Guarantees Example tools
and
of fault
references
activation
Yes
Yes
Low speed
Messaline [Arl89],
Hybrid (signallevel & SWIFI)
[Kan95]
Full
Yes
Yes
(if the delay induced
by the inserted logic
does not exceed that
allow by the system)
Yes
Messaline [Arl89],
Rifle[Mad94]
Heavy Ion Radiation
Speculative
Yes
No
No
[Kar94]
Power supply
disturbances
Speculative
Yes
No
No
[Mir95]
Electromagnetic
interferences
Speculative
Yes
No
No
[Kar97]
Laser (LFI)
Speculative
Yes
Yes
No
[Sam97]
Built-in logic
Full
No
Yes
Yes
FIMBUL [Fol98]]
Pin-level by signal
interception; Signal
partition by inserting
the injection logic
into the path of the
signal
ECE 695/CS 590
26
13
Assembled Fault Injector
ECE 695/CS 590
27
Conclusion




Fault injection is a means to effectively test and stress fault
tolerance mechanisms before they are deployed in the field
on an actual system
Fault injection can be done through hardware or software
Software-implemented fault injection (SWIFI) is the
dominant mode of system verification, especially for
complex software systems
It is common to compare multiple systems (or versions of
the same system) using dependability attributes collected
from a fault injection campaign
ECE 695/CS 590
28
14
References
1. Stott, D. T., Ries, G., Hsueh, M. C., & Iyer, R. K. (1998). Dependability
analysis of a high-speed network using software-implemented fault
injection and simulated fault injection. Computers, IEEE Transactions
on, 47(1), 108-119.
2. Z. Kalbarczyk, R. K. Iyer, G.L. Ries, J.U. Patel, M.S. Lee, and Y. Xiao
“Hierarchical Simulation Approach to Accurate Fault Modeling for
System Dependability Evaluation”, IEEE Transactions on Software
Engineering, vol. 25, no.5, September/October 1999, pp.619-632.
3. Stott, D.T.; Floering, B.; Burke, D.; Kalbarczpk, Z.; Iyer, R.K.,
"NFTAPE: a framework for assessing dependability in distributed
systems with lightweight fault injectors," Computer Performance and
Dependability Symposium, 2000. IPDS 2000. Proceedings. IEEE
International , vol., no., pp.91-100, 2000
ECE 695/CS 590
29
15