Failure Mode Assumptions
and Assumption Coverage
David Powell
Fault-Tolerance
Key questions
– How components may fail?
Prevention strategies
– At what rate they may fail?
The Amount of redundancy needed
– What are the important type of faults?
Types of redundancy needed
– The relation between dependability,
redundancy and faults?
General FT design guidelines
An F-T Paradox/Dilemma
More faulty
More redundancy
More possibility of faults
???
Solution- Some Key Steps
Classify, quantify and verify the
assumptions
Type of Failures
Overview
Single-user service
– Service Model
– Potential Errors
Multiple-user service
– Service Model
– Potential Errors
Single-user Service Model
Service items: si, i=1,2,…
Values of si: vsi
Observation time of si: tsi
Service Model:
Si= <vsi,
tsi>
An omniscient observer
Correctness Model
Service item si is correct iff
(vsi SVi) (tsi STi)
SVi and STi are respectively the specified
sets of values and times for service item si
Potential Errors
Arbitrary value error: si : vsi SVi
Noncode error: si : vsi CV (CV defines a
code)
Arbitrary timing error: si : tsi STi
Early timing error: si : tsi < min(STi)
Late timing error: si : tsi > max(STi)
Omission error: si : tsi =
Impromptu error: si: (vsi = ) (tsi = )
Multi-user Service Model
Service item si={si(1), si(2),…, si(n),}
Service model: <vsi(u), tsi(u)>, all i,u
New issues: “consistency”
Correctness Model
vsi(u)– the value of service item i on process u
vsi-- the value of service item i
SVi– the set of specified service item i
tsi(u)– the observation time of service item i on process u
STi(u) – the range of specified observation time of
service item i on process u
uv -- the time bound of related occurrences
Examples of Potential Errors
Consistent value error
Consistent timing error
Semi-consistent value error
Failure Mode Assumptions
Attempt to formalize the concept of an
assumed failure mode
By assertions on the sequences of service
items delivered by a component
Examples of Value Error Assertions
No value errors occur (Vnone)
i , vsi SVi
The only value errors that occur are noncode
value errors (Vn)
i , (vsi SVi) (vsi CV )
Arbitrary value error can occur (Varb)
i , (vsi SVi) (vsi SVi )
Examples of Timing Error
Assertions
No timing error occurs (Tnone)
The only timing errors are omission errors (TO)
The only timing errors are late timing errors (TL)
The only timing errors are early timing errors
(TE)
Arbitrary timing error can occur (Tarb)
Permanent omission/crash (Tp)
Bounded omission degree (TBk)
Timing Error Implications
Failure Mode Assertions(FMA)
A complete FMA entails an assertion on
errors occurring on both value and time
domains
By taking the Cartesian production of the
two domains, we get a family of FMA
FMA Implication Graph
So what?
The FMA classification and implication
graph can serve as a guideline to design
families of FT algorithms that can process
errors in increasing severity!
Assumption Coverage
Establishing a link between assumed
component failure mode and system
dependability
(The design a FT system relies on the
assumption they make)
(The dependability of a FT system is related
to the failure mode they assume)
Motivation
Components may fail
They may fail in a bad way leads to a
violation of assumptions of the system
The system, in turn, can fail
Question: to what degree can a
component FMA prove to be true in the
real system?
The Coverage of the Assumption
Definition
P(X) = Pr{ X= true | component failed}
P(Varb Tarb) = 1
P(Vnone Tnone) = 0
Coverage of an FT system
PS(X) =
Pr{ correct error processing |X= true}
*Pr{ X= true | component failed}
Influence of Assumption
Coverage on System
Dependability
A Case Study
The System
A system of n processors
Connected via unidirectional message-passing bus
Each processor carries out the same computation steps
The result of each processing step is communicated to
all other processors
Each process has a decision function (DF)
The DF is applied to the results received from other
processors
…
Each processor and its associated bus is viewed as a
single component
Fail-Silent Processor-bus
A fail-silent processor
–
–
–
–
Only has semi-consistent value errors
Always produces message on time
Or ceases to produce messages forever
If a message is delivered to a processor, it is to be delivered to
all processors with consistent fixed delay
Fail-Consistent Processor Bus
Only semi-consistent value errors may occur
Faulty processors may send erroneous values
Consistent timing error may occur
Fail-uncontrolled Processor Bus
Arbitrary timing error
Arbitrary value error
Implications of Assumption
Coverage
Failure mode relations
Coverage relations
Dependability Expressions From
Markov Models
r = e –λt
λ = failure rate
A Life-critical Application
System reliability objective: R > 1-10-9 over
10 hours
Single processor reliability:
– r = e-λt
– 1/λ = 5 years
A Money-Critical Application
It is about availability of the system rather
than reliability of the system
Please look at the paper for more details
Unavailability v.s. Coverage
Conclusion
A formalism for describing component
failure modes
Multiplicity of value and timing errors
The notion of assumption coverage
The relation between dependability,
availability and assumption coverage
Thank you
© Copyright 2025 Paperzz