The Complexity of Adding Failsafe Fault-Tolerance

The Complexity of Adding
Failsafe Fault-tolerance
Sandeep S. Kulkarni
Ali Ebnenasir
Motivations
Why automatic addition of fault-tolerance?
Why begin with a fault-intolerant program?



Reuse of the fault-intolerant program
Separation of concerns (functionality vs. faulttolerance)
Potential to preserve properties such as efficiency
One obstacle

Adding masking fault-tolerance to distributed
programs is NP-hard [ FTRTFT, 2000]
Motivation (Continued)
Approach for dealing with complexity


Heuristics [SRDS 2001]
Weaker form of tolerance
 Failsafe
 Safety only in the presence of faults
 Nonmasking


Safety may be temporarily violated
Restricting input
 Programs
 Specifications
Motivation (Continued)
Why failSafe Fault-Tolerance?


Simplify the design of masking
Partial automation of masking fault-tolerance
(using TSE’98)
Masking fault-tolerant
Automate
Failsafe fault-tolerant
Nonmasking fault-tolerant
Automate
Intolerant Program
Outline of the Talk
Problem of adding fault-tolerance
Difficulties caused by distribution
Complexity of failsafe fault-tolerance
Class of programs and specifications for
which polynomial synthesis is possible
Basic Concepts:
Programs and Faults
State space Sp
Program transitions deltap, faults deltaf
Invariant S, fault-span T
Specification spec: Safety is specified by transitions,
(sj, sk) that should not be executed
T
f
p/f
p
S
Problem Statement
Inputs: program p, Invariant S, Faults f, Specification
spec
Outputs: program p’, Invariant S’
Requirements: Only fault-tolerance is added; no new
functional behavior is added
Invariant of fault-intolerant program
No new transition here
Invariant of fault-tolerant program
New transitions may be added here
Difficulties with Distribution
Read/Write restrictions
Two Boolean variables a and b
Process cannot read b
Can we include the following transition?
a=0,b=0
a=1,b=0
• Only if we include the transition
a=0,b=1
a=1,b=1
Groups of transitions (instead of individual transitions)
must be chosen.
Reduction from 3-SAT
Included iff x0 is false
an = a0
a0
Included iff x0 is true
Included iff
xj is false
Included iff
xk is true
_
cj = xj \/ xk \/ xl
Included iff
xl is false
Dealing with the Complexity of
Adding Failsafe Fault-tolerance
For what class of problems, failsafe faulttolerance can be added in polynomial time
Restrictions on



Fault-tolerant programs
Specifications
Faults
Our approach for restrictions:

In the absence of faults, preserve all computations
of the fault-intolerant program
Restrictions on Programs and
Specifications
Monotonicity requirements

Capture the notion that safe assumptions
can be made about variables that cannot
be read
Focus on specifications and transitions
of fault-intolerant programs
Monotonicity of Specifications
Definition: A specification spec is positive
monotonic with respect to variable x iff:
 For every s0, s1, s’0, s’1:


The value of all other variables in s0 and s’0 are the same
The value of all other variables in s1 and s’1 are the same
Then
If
x = false
s0
x = false
s1
Does not violate safety
x = true
s’0
x = true
s’1
Does not violate safety
Monotonicity of Programs
Definition: Program p with invariant S is
negative monotonic with respect to variable x
iff:
 For every s0, s1, s’0, s’1:


The value of all other variables in s0 and s’0 are the same
The value of all other variables in s1 and s’1 are the same
x = false
s’0
x = true
s0
Invariant S
x = true
s1
X = false
s’1
Theorem
Adding failsafe fault-tolerance can be done in
polynomial time if either:
 Program is negative monotonic, and
 Spec is positive monotonic

Or
 Program is positive monotonic, and
 Spec is negative monotonic
If only one of these conditions is satisfied
then adding failsafe fault-tolerance is still NPhard

For many problems, these requirements are easily
met
Example:
Byzantine Agreement
Processes: General, g, and three non-generals j, k, and l
Variables




d.g : {0, 1}
d.j, d.k, d.l : {0, 1, ┴ }
b.g, b.j, b.k, b.l : {true, false}
f.g, f.j, f.k, f.l : {0, 1}
Fault-intolerant program transitions


d.j = ┴ /\ f.j = 0
d.j ≠ ┴ /\ f.j = 0
d.j := d.g
f.j := 1
Fault transitions


¬b.g /\ ¬b.j /\ ¬b.k /\ ¬b.l
b.j
b.j := true
d.j,f.j :=0|1,0|1
Example:
Byzantine Agreement (Continued)
Safety Specification:


Agreement: No two non-Byzantine non-generals can
finalize with different decisions
Validity: If g is not Byzantine, no process can finalize with
different decision with respect to g
Read/Write restrictions


Readable variables for process j:
 b.j, d.j, f.j
 d.g, d.k, d.l
Process j can write
 d.j, f.j
Example:
Byzantine Agreement (Continued)
Observation 1:

Positive monotonicity of specification with respect to b.j
Observation 2:

Negative monotonicity of program, consisting of the transitions of j,
with respect to b.k
Observation 3:

Negative monotonicity of specification with respect to f.j
Observation 4:

Positive monotonicity of program, consisting of the transitions of j,
with respect to f.k
Summary
Complexity analysis for failsafe faulttolerance


Reduction from 3-SAT
Restrictions on specifications and programs
for which polynomial synthesis is possible
 Several problems fall in this category


Byzantine agreement, consensus, commit, …
Necessity of these restrictions
Future Work
Simplifying the design of masking faulttolerance using the two-step approach
Refining boundary between classes for
which polynomial synthesis is possible
and for which exponential complexity is
inevitable
Using monotonicity requirements for
simplifying masking fault-tolerance
Thank You
Questions?
Future Work
Conclusion

Specifying the boundary
 Fault-tolerance addition can be done in polynomial time
 Exponential complexity is inevitable
 Goal: what problems can benefit from automation?

Necessity and sufficiency of monotonicity requirements
Future Work



How can we Change a non-monotonic program to a monotonic one
by modifying its invariant?
How can we Strengthen a non-monotonic specification to a
monotonic one?
How a nonmasking program can be designed manually to satisfy
monotonicity requirements?
Basic Concepts:
Fault-tolerant Program
Fault-tolerance in the presence of faults:
Failsafe: Satisfies its safety specification
Nonmasking: Satisfies its liveness specification
(safety may be violated temporarily)
Masking: Satisfies safety and liveness specification
The complexity of Adding
Failsafe fault-tolerance
Adding (failsafe/nonmasking/masking) faulttolerance in high atomicity model is in P
Adding masking fault-tolerance to distributed
programs is in NP
How about failsafe?
Adding Failsafe to distributed programs
is NP-hard!! (proof in the paper)

Reduction of 3-SAT to the problem of failsafe faulttolerance addition
Our Approach
Stepwise towards masking faulttolerance:

Automating the addition of failsafe
fault-tolerance
How hard is adding failsafe faulttolerance?
Polynomial time boundaries for failsafe
tolerance addition?
Sp’
Sp,