Class Syllabus for CS 001

Data Diverse Software Fault
Tolerance
Techniques
5.1 Retry Blocks
 The basic RtB technique is one of
the two original data diverse
software fault tolerance techniques.
 The hardware fault tolerance
architecture related to the RtB
technique is stand-by sparing or
passive dynamic redundancy.
 The RtB technique is the data
diverse
complement
of
the
recovery block (RcB) scheme .
 The
RtB
acceptance
technique
tests
(AT)
uses
and
backward recovery to accomplish
fault tolerance.
 The technique typically uses one
Data re-expression Algorithm(DRA)
and one algorithm.
 A watchdog timer (WDT) is also
used and triggers execution of a
backup algorithm if the original
algorithm does not produce an
acceptable result within a specified
period of time.
 The algorithm is executed using the
original system input.
 The primary algorithm's results are
examined by an AT that has the
same form and purpose as the AT
used in the RcB.
 If the algorithm results pass the AT,
then the RtB is complete.
 However, if the results are not
acceptable, then the input is reexpressed and the same primary
algorithm runs again using the new,
re-expressed, input data.
 This continues until the AT finds an
acceptable result or the WDT
deadline is violated.
 If the deadline expires, a backup
algorithm may be invoked to
execute on the original input data.
5.1.1 Retry Block Operation (RtB)
The RtB technique consists of an
executive, an AT, a DRA, a WDT, and primary and backup algorithms.
The executive orchestrates the operation
of the RtB, which has the general syntax:
ensure Acceptance Test
byPrimary Algorithm(Original
Input)
else by Primary Algorithm(Reexpressed Input)
else by Primary Algorithm(Reexpressed Input)
……
[Deadline Expires]
else by Backup algorithm(Original
Input)
else failure exception
Figure 5.1 illustrates the structure and
operation of the basic RtB with
a WDT. We examine several scenarios to
describe RtB operation:
• Failure-free operation;
• Exception in primary algorithm
execution;
• Primary's results are on time, but
fail AT; successful execution with
re-expressed input;
• All DRA options are used without
success; successful backup
execution;
• All DRA options are used without
success; backup executes, but fails
the AT.
Retry Block Example
2.3.1 Overview of Data Re-Expression
Data re-expression is used to obtain alternate (or diverse) input data by generating logically equivalent input data sets. Given initial data within the
program failure region, the re-expressed input data should exist outside that
failure region. A re-expression algorithm, R, transforms the original input x
to produce the new input, y = R(x). The input y may either approximate x or
contain x.s information in a different form. The program, P, and R determine the relationship between P(x) and P( y). Figure 2.3 illustrates basic data
re-expression. The requirements for the DRA can be derived from characteristics of the outputs.
Other re-expression structures exist. Re-expression with postexecution
adjustment (Figure 2.4) allows the DRA to produce more diverse inputs than
those produced using the basic structure. A correction, A, is performed on
P(y) to undo the distortion produced by the re-expression algorithm, R.
If the distortion induced by R can be removed after execution, then this
approach allows major changes to the inputs and allows copies of the program to operate in widely separated regions of the input space [45].
In another approach, data re-expression via decomposition and recombination (Figure 2.5), an input x is decomposed into a related set of inputs
Structuring Redundancy for Software Fault Tolerance 37
x Execute
P
Execute
P
Re-expression
y = R(x) P(y)
P(x)
Figure 2.3 Basic data re-expression method. (Source: [45], © 1988, IEEE. Reprinted with
permission.) New data re-expression methods may be developed by variation
on the basic method or by entirely new methods and algorithms.
and the program is then run on each of these related inputs. The results are
then recombined. Basic data re-expression and re-expression with postexecution adjustment allow for both exact and approximate DRAs (defined in the
following section).
/*-
3. Two-Pass Adjudicators
The TPA t/echnique developed by
Pullum [7.9], is a set of combination
data and design diverse software
fault tolerance techniques.
TPA is also a combination static and
dynamic technique based on the
recovery technique required.
The hardware fault tolerance
architecture related to the technique
is N-modular redundancy.
The processes can run concurrently
on different computers or
sequentially on a single computer,
but are designed to run concurrently.
The TPA technique uses a DM and
both forward and backward recovery
to accomplish fault tolerance.
The technique uses one or more
DRA and at least two variants of a
program.
The system operates like NVP
unless and until the DM cannot
determine a correct result
given the variant results.
If this occurs, then the inputs are run
through the DRA(s) to be reexpressed.
The variants re-execute using the reexpressed data as input (each input
is different, one of which may be the
original input value).
A DM examines the results of the
variant executions of this second
pass and selects the best result, if
one exists.
There are a number of alternative
detection and selection mechanisms
available for use with TPA. These
are discussed in Section 5.3.1.
5.3.1 Two-Pass Adjudicator
Operation
The basic TPA technique consists of an
executive, 1 to n DRA, n variants of
the program or function, and a DM. The
executive orchestrates the TPA
technique operation, which has the general
syntax:
Pass 1: run Variant 1
(original input),
Variant
input),…,
2(original
Variant n(original input)
if (Decision Mechanism
(Result(Pass 1, Variant
1),Result(Pass1, Variant
2), …,
Result(Pass 1, Variant
n)))
return Result
else
Pass 2: run DRA 1, DRA 2, …,
DRA n
run Variant 1(result of DRA 1),
Variant 2(result of DRA 2),…,
Variant n(result of DRA n)
if (Decision Mechanism
(Result(Pass 2, Variant
1),Result(Pass 2, Variant
2),…,
Result(Pass 2, Variant n)))
return Result
else failure exception
Operations
1 Failure-Free Operation.
2 Partial Failure
Scenario.Incorrect Results on
First Pass.
3 Failure Scenario.Incorrect
Results on Both Passes.
5.3.3 Two-Pass Adjudicator Example
Suppose we have a network of cities,
all of which must be visited, starting at
city node 1. The network of airline
flights and flight durations is shown in
Figure 5.8.
No revisiting of cities is allowed. The
objective is to determine the route (or
routes) that satisfies the requirements in
the least amount of time.
Table 5.5 provides the list of all
possible
routes
meeting
the
requirements.
Suppose a program is implemented
using TPA solution category I to
solve this type of problem. Let the
inputs (the flight network information)
defining the specific problem be those
shown in Figure 5.8.
The output of the variant operation
is the route letter or the list of cities.
If all variants are
operating correctly, a majority vote
will yield either route A or route B
as the
correct answer (because, in this
example, there are two correct
answers and
three correctly operating variants
for this input domain).
Consider the case in which one of
the variants fails for this input
domain and the resulting decision
vector is (A, B, C). A majority voter
would raise an exception in this
case, even though two of the
answers are correct.