The Complexity of Adding
Failsafe Fault-tolerance
Sandeep S. Kulkarni
Ali Ebnenasir
Motivations
Why automatic addition of fault-tolerance?
Why begin with a fault-intolerant program?
Reuse of the fault-intolerant program
Separation of concerns (functionality vs. faulttolerance)
Potential to preserve properties such as efficiency
One obstacle
Adding masking fault-tolerance to distributed
programs is NP-hard [ FTRTFT, 2000]
Motivation (Continued)
Approach for dealing with complexity
Heuristics [SRDS 2001]
Weaker form of tolerance
Failsafe
Safety only in the presence of faults
Nonmasking
Safety may be temporarily violated
Restricting input
Programs
Specifications
Motivation (Continued)
Why failSafe Fault-Tolerance?
Simplify the design of masking
Partial automation of masking fault-tolerance
(using TSE’98)
Masking fault-tolerant
Automate
Failsafe fault-tolerant
Nonmasking fault-tolerant
Automate
Intolerant Program
Outline of the Talk
Problem of adding fault-tolerance
Difficulties caused by distribution
Complexity of failsafe fault-tolerance
Class of programs and specifications for
which polynomial synthesis is possible
Basic Concepts:
Programs and Faults
State space Sp
Program transitions deltap, faults deltaf
Invariant S, fault-span T
Specification spec: Safety is specified by transitions,
(sj, sk) that should not be executed
T
f
p/f
p
S
Problem Statement
Inputs: program p, Invariant S, Faults f, Specification
spec
Outputs: program p’, Invariant S’
Requirements: Only fault-tolerance is added; no new
functional behavior is added
Invariant of fault-intolerant program
No new transition here
Invariant of fault-tolerant program
New transitions may be added here
Difficulties with Distribution
Read/Write restrictions
Two Boolean variables a and b
Process cannot read b
Can we include the following transition?
a=0,b=0
a=1,b=0
• Only if we include the transition
a=0,b=1
a=1,b=1
Groups of transitions (instead of individual transitions)
must be chosen.
Reduction from 3-SAT
Included iff x0 is false
an = a0
a0
Included iff x0 is true
Included iff
xj is false
Included iff
xk is true
_
cj = xj \/ xk \/ xl
Included iff
xl is false
Dealing with the Complexity of
Adding Failsafe Fault-tolerance
For what class of problems, failsafe faulttolerance can be added in polynomial time
Restrictions on
Fault-tolerant programs
Specifications
Faults
Our approach for restrictions:
In the absence of faults, preserve all computations
of the fault-intolerant program
Restrictions on Programs and
Specifications
Monotonicity requirements
Capture the notion that safe assumptions
can be made about variables that cannot
be read
Focus on specifications and transitions
of fault-intolerant programs
Monotonicity of Specifications
Definition: A specification spec is positive
monotonic with respect to variable x iff:
For every s0, s1, s’0, s’1:
The value of all other variables in s0 and s’0 are the same
The value of all other variables in s1 and s’1 are the same
Then
If
x = false
s0
x = false
s1
Does not violate safety
x = true
s’0
x = true
s’1
Does not violate safety
Monotonicity of Programs
Definition: Program p with invariant S is
negative monotonic with respect to variable x
iff:
For every s0, s1, s’0, s’1:
The value of all other variables in s0 and s’0 are the same
The value of all other variables in s1 and s’1 are the same
x = false
s’0
x = true
s0
Invariant S
x = true
s1
X = false
s’1
Theorem
Adding failsafe fault-tolerance can be done in
polynomial time if either:
Program is negative monotonic, and
Spec is positive monotonic
Or
Program is positive monotonic, and
Spec is negative monotonic
If only one of these conditions is satisfied
then adding failsafe fault-tolerance is still NPhard
For many problems, these requirements are easily
met
Example:
Byzantine Agreement
Processes: General, g, and three non-generals j, k, and l
Variables
d.g : {0, 1}
d.j, d.k, d.l : {0, 1, ┴ }
b.g, b.j, b.k, b.l : {true, false}
f.g, f.j, f.k, f.l : {0, 1}
Fault-intolerant program transitions
d.j = ┴ /\ f.j = 0
d.j ≠ ┴ /\ f.j = 0
d.j := d.g
f.j := 1
Fault transitions
¬b.g /\ ¬b.j /\ ¬b.k /\ ¬b.l
b.j
b.j := true
d.j,f.j :=0|1,0|1
Example:
Byzantine Agreement (Continued)
Safety Specification:
Agreement: No two non-Byzantine non-generals can
finalize with different decisions
Validity: If g is not Byzantine, no process can finalize with
different decision with respect to g
Read/Write restrictions
Readable variables for process j:
b.j, d.j, f.j
d.g, d.k, d.l
Process j can write
d.j, f.j
Example:
Byzantine Agreement (Continued)
Observation 1:
Positive monotonicity of specification with respect to b.j
Observation 2:
Negative monotonicity of program, consisting of the transitions of j,
with respect to b.k
Observation 3:
Negative monotonicity of specification with respect to f.j
Observation 4:
Positive monotonicity of program, consisting of the transitions of j,
with respect to f.k
Summary
Complexity analysis for failsafe faulttolerance
Reduction from 3-SAT
Restrictions on specifications and programs
for which polynomial synthesis is possible
Several problems fall in this category
Byzantine agreement, consensus, commit, …
Necessity of these restrictions
Future Work
Simplifying the design of masking faulttolerance using the two-step approach
Refining boundary between classes for
which polynomial synthesis is possible
and for which exponential complexity is
inevitable
Using monotonicity requirements for
simplifying masking fault-tolerance
Thank You
Questions?
Future Work
Conclusion
Specifying the boundary
Fault-tolerance addition can be done in polynomial time
Exponential complexity is inevitable
Goal: what problems can benefit from automation?
Necessity and sufficiency of monotonicity requirements
Future Work
How can we Change a non-monotonic program to a monotonic one
by modifying its invariant?
How can we Strengthen a non-monotonic specification to a
monotonic one?
How a nonmasking program can be designed manually to satisfy
monotonicity requirements?
Basic Concepts:
Fault-tolerant Program
Fault-tolerance in the presence of faults:
Failsafe: Satisfies its safety specification
Nonmasking: Satisfies its liveness specification
(safety may be violated temporarily)
Masking: Satisfies safety and liveness specification
The complexity of Adding
Failsafe fault-tolerance
Adding (failsafe/nonmasking/masking) faulttolerance in high atomicity model is in P
Adding masking fault-tolerance to distributed
programs is in NP
How about failsafe?
Adding Failsafe to distributed programs
is NP-hard!! (proof in the paper)
Reduction of 3-SAT to the problem of failsafe faulttolerance addition
Our Approach
Stepwise towards masking faulttolerance:
Automating the addition of failsafe
fault-tolerance
How hard is adding failsafe faulttolerance?
Polynomial time boundaries for failsafe
tolerance addition?
Sp’
Sp,
© Copyright 2026 Paperzz