Rapid Recovery for Systems with Scarce Faults

Rapid Recovery for Systems with
Scarce Faults
Farn Wang, Chung-Hao Huang, Sven
Schewe , Doron Peled
Fault Tolerance
• More than checking whether or not a system
satisfies a given property
– Bounded number of failures:
Check the max number of failures a system can
tolerant before exhibit an error
– Unbounded number of failures:
Given a bound on the number of “dense failures”.
Check if the system can be “fully recovered” after
these dense failures
Related Work
• Fault tolerance refers to various basic fault models,
such as a limited number of errors.
[H. Jin, K. Ravi, F. Somenzi]
• Robustness based on Hamming and Lewenstein
distance related to the number of past states.
[L. Doyen, T.A. Henzinger, A. Legay, D. Nickovic]
• Ratio Games: minimize the ratio between failures
induced by the environment and system errors
caused by them. [R. Bloem, K. Greimel, T.A. Henzinger, B.
Jobstmann]
The Game
• Transition system with failures = (S,,c,u,F)
–
–
–
–
–
S: states
: ∈S, initial state
c: c ⊆ (S\F) X (S\F)
u: u ⊆ (S\F) X S
F: F⊆S, error states
• 2 players:
– Protagonist: controls c
– Antagonistic(higher priority): controls u
• Given G ⊆ S\F be the “good region” that the
protagonist wish to remain in
The Game
• The game can be divided in 2 parts:
– Safety part: protagonist wants to stay in G, he loses
immediately when reaches s∉G
– Reachability part: Once antagonist choose an
uncontrolled transition, the game switch into this
phase. Protagonist wants to go back to G, he wins
immediately when reaches s∈S.
• k is the maximum number of uncontrolled
transitions that the antagonistic can choose
Dense Fault
• Two successive failures are in the same group
of dense failures if the sequence of states
separating them was not long enough for
recovery in the respective safety/reachability
game.
Safety and Reachability Objective
• sfrchk: 2S\F 2 S\F is a function sfrchk(G)
denotes the states from which the protagonist
wins the above game
• k is the maximum number of uncontrolled
transitions that the antagonistic can choose
Construction of sfrchk ---(1)
• sfrch0(G) = {G’⊆G|g∈G’, s∈G’. (g,s)∈c}
• sfrch0(G) can be constructed by greatest fixed
point
Construction of sfrchk ---(2)
• Given a region L⊆S\F, the controlled limited
attractor,
coneL(G)={AG|s∈L, succ(s)  A
implies s∈A} ,
is a set of states which there is a controlled
path to move to G without leaving L.
• coneL(G) can be constructed by greatest fixed
point
Construction of sfrchk ---(3)
• Given a set B⊆S, the fragile of B, frag(B), is the
set of states which has more than 1
uncontrolled successor in B.
• frag(B)={s∈S|b∈B, (s,b)∈ u}
• frag(B) can be easily constructed in linear time
Construction of sfrchk ---(4)
• Let L0=S\F
• Ai and Li are defined recursively as following:
– Ai = coneLi(G)
– Li = L0\frag(S\ Ai)
• sfrchk(G)=sfrch0(GLk)
Explanation
L0
G
A0
F
S\A0
F
G
k times
L1
G
Frag(S\
A0)
A0
F
G
S\A0
Frag(S
\A0)
F
Resilience of k-dense failures
• Resk(G) denotes the set of states that can
recover from infinite times of k-dense failures
• Resk(G) can be constructed by
sfrchk(sfrchk(….sfrchk(G] until the fixed point is
reached
Complexity
• If k is considered as an input, sfrchk(G) and
Resk(G) are PTIME-complete
Handling Recovery Delay
• The delay of the recovery should not be
modeled as controlled transitions
• Add a new type of transitions r
• In a state s, protagonist chooses a state from
succ(s) and all the state in sucr(s). The
antagonistic then choose a state from these
states or choose a state from sucu(s) to
overwrite the decision from protagonist
Handling Recovery Delay(con)
• coneL(G)={AG|s∈L, succ(s)  A
implies s∈A} (original version)
• coneL(G)={AG|s∈L, (succ(s) sucr(s)) 
A and sucr(s)⊆A implies s∈A}
Experiment model example
Experiment Result
Conclusion
• We introduced an approach for the development
of a control of safety critical systems that
maximizes the number of dense failures the
system can tolerate.
• Our technique guarantees to find the optimal
parameter k. Optimizing it is computationally
inexpensive, but provides strong guarantees: the
likelihood of having more than k failures appear
in short succession after a failure occurred are,
for independent errors, exponential in k.
Thank you!!
1
2
3
4