Seminarie Informatica

Seminarie Informatica
Fault-tolerant Systems:
The Software Viewpoint
A series of seminars coordinated by
Vincenzo De Florio
http://www.pats.ua.ac.be
The matter
• The exam
• The topics
• This lecture

Application-level fault tolerance provisions
25 October 2006
Seminarie Informatica - Lecture 1
2
Introduction to the exam
• Seminarie informatica

10 seminars on hot topics of computer science
 Topic of this cycle: software fault-tolerant systems
 Next 3 seminars: 15, 22 November; 6 December
 Next year seminars: to be announced on
http://www.win.ua.ac.be/~vincenz/si/0607.html
25 October 2006
Seminarie Informatica - Lecture 1
3
Introduction to the exam
• Oral discussion of 2 papers


A 5–6 page paper based on one or more of the topics of
the seminars
A paper with the analysis of a case study
• See later for examples
• Evaluation criteria:


Do the papers contain original ideas? Do they follow «too
strictly» the seminar?
Does the author understand the subject? Is (s)he able to
reason independently about the subject?
• Papers must be submitted by May 15, 2007

E-mail to [email protected]
25 October 2006
Seminarie Informatica - Lecture 1
4
The Topics
Dependability
=
the property of a system
such that reliance
can justifiably be placed
on the service it delivers
Fault tolerance
=
one of the means of dependability
25 October 2006
Seminarie Informatica - Lecture 1
5
The Dependability Tree
25 October 2006
Seminarie Informatica - Lecture 1
6
Fault tolerance (FT)
Fault-tolerant system is
system that continues to function
in spite of faults
defect IC
bug in program
operation fault
sensor drift
25 October 2006
hardware
software
operator
I/O
Seminarie Informatica - Lecture 1
7
Attributes of dependability
• Availability

Readiness for usage
 A(t) = probability that system is conform to
specification at time t
• Reliability
 Continuity of service
 R(t) = probability that system is conform to
specifications during [t0,t], provided that so
it is at t0
25 October 2006
Seminarie Informatica - Lecture 1
8
Attributes of dependability (2)
• Safety

Non-occurrence of catastrophic consequences on
environment
 S(t) = probability that a system is either conform
to specification, or reaches a safe halt, at time t
 Fail-safe systems
25 October 2006
Seminarie Informatica - Lecture 1
9
Attributes of dependability (3)
• Maintainability

Aptitude to undergo repairs and evolution
 M(t) = probability that system is back to
specifications at t if failed at t0
25 October 2006
Seminarie Informatica - Lecture 1
10
Attributes of dependability (4)
• Confidentiality

Non-occurrence of unauthorised disclosure of
information
• Integrity

Non-occurrence of improper alterations of
information
25 October 2006
Seminarie Informatica - Lecture 1
11
Related attributes
• Testability

Ability to test features of a system
 Related to maintainability
25 October 2006
Seminarie Informatica - Lecture 1
12
Related attributes
• Security

Integrity + availability + confidentiality
25 October 2006
Seminarie Informatica - Lecture 1
13
References
• Jean-Claude Laprie, “Dependable Computing and
Fault Tolerance: Concepts and Terminology”, in
Proc. of the 15th Int. Symposium on Fault-Tolerant
Computing (FTCS-15), Ann Arbor, Mich., June 1985,
pp.2-11
• Jean-Claude Laprie, “Dependability---Its Attributes,
Impairments and Means”, in Predictably Dependable
Computing Systems, ESPRIT Basic Research
Series, B. Randell and J.-C. Laprie and H. Kopetz
and B. Littlewood (eds.), Springer Verlag, 1995, pp.
3-18.
25 October 2006
Seminarie Informatica - Lecture 1
14
The lecture
• We now focus on application-level fault
tolerance
• Why do we need ALFT? Why do we need
software FT in the first place?
• We explain why
• We survey the existing methods and assess
their pros and cons against a set of properties
• Surprising conclusion: still an open problem
25 October 2006
Seminarie Informatica - Lecture 1
15
Software Fault Tolerance
• Human society more and more
expects
and
relies on
good quality of complex services supplied by
computers
25 October 2006
Seminarie Informatica - Lecture 1
17
Software Fault Tolerance
• Consequences of a failure in the ‘40s:
(Computers as fast solvers of numerical
problems)

Errors in computations, long downtimes
Incalculable penalty (catastrophes)
25 October 2006
Seminarie Informatica - Lecture 1
Performance &
ease of use
• Consequences of a failures nowadays:
(Computers controlling nuclear plants,
airborne equipment, healthcare…)
18
Software Fault Tolerance
• Traditional answer:
Hardware Fault Tolerance
• This is an important ingredient, but not
the only one needed today!
•
Complexity is also in
the SW layers
APPLICATION
SW
1. Hierarchies of complex
abstract machines
MW
OS
25 October 2006
Seminarie Informatica - Lecture 1
HW
19
Software Fault Tolerance
• Complexity is also in SW layers (cont.’ed)
Software is often networked and distributed
3. Relationships among software components are
often complex
4. Object model  Easier SW reuse 
Hidden + explicit Complexity
2.
25 October 2006
Seminarie Informatica - Lecture 1
20
Software Fault Tolerance
• In conclusion: “No amount of verification,
validation and testing can eliminate all faults
in an application and give complete
confidence in the availability and data
consistency of applications”
 Fault tolerance in SW is key
! SW failures can have the same extent in
consequences of failures in HW
25 October 2006
Seminarie Informatica - Lecture 1
Ariane 5 !
21
Problems of SW FT
APPLICATION
HL RUN-TIME
OS
HW
25 October 2006
The lighter the color,
the more general purpose
the (virtual) machine
The lighter the color,
the more complex
the problem of
expressing fault tolerance
Seminarie Informatica - Lecture 1
22
Problems of Application-level
Fault Tolerance
• “The only alternative and effective means for
increasing software reliability is that of
incorporating in the application software
provisions for SFT”
• The Application software has to manage

Functional aspects
 Fault tolerance (FT) aspects
at the same time / in the same space
25 October 2006
Seminarie Informatica - Lecture 1
23
Problems and properties of
Application-level Fault Tolerance
• Hazard : code intrusion

FT provisions are specified side by side with the
service
 Conflicting design concerns
 Overall design complexity gets increased
Larger development and maintenance costs &
times
Larger probability of introducing software bugs
25 October 2006
Seminarie Informatica - Lecture 1
24
Problems and properties of
Application-level Fault Tolerance
• Separation of design concerns ( SDC )
In what follows we call an “ALFT” a means to
express fault tolerance in the application software
 A criterion to compare ALFT’s is by their degree of
SDC

25 October 2006
Seminarie Informatica - Lecture 1
25
Problems and properties of
Application-level Fault Tolerance
• Hazard : porting code 
porting service

FT code assumes fault model = f(e)
1. If e changes, or
2. If the code is moved to another
environment e’
the QoS may degrade
25 October 2006
Seminarie Informatica - Lecture 1
26
Problems and properties of
Application-level Fault Tolerance
• Hazard: porting code 
porting service
• An interesting case: Ariane 5 501


Ariane 4 missions software re-used in
Ariane 5
The early part of the trajectory of Ariane 5
differed from that of Ariane 4 and resulted in
quite higher horizontal velocity values
This could be a case study for the exam
25 October 2006
Seminarie Informatica - Lecture 1
…370
Million
Euros
in the
sink
IRS IRS
FCC
27
Problems and properties of
Application-level Fault Tolerance
2. Problem: service portability

Porting FT comes not for free
 “Hardwired ” fault model = static environment
 More difficult to adapt / test / maintain
 More prone to Ariane 5 - effects
“ What is the most often overlooked risk in sw engineering?
That the environment will do something the designer never
anticipated ”
[J. Horning ]
25 October 2006
Seminarie Informatica - Lecture 1
28
Problems and properties of
Application-level Fault Tolerance
• Adaptability ( AD )

Does the ALFT provide means to adapt,
dynamically, to new environmental conditions?
 A criterion to compare 2 ALFT’s is by their degree
of AD
25 October 2006
Seminarie Informatica - Lecture 1
29
Problems and properties of
Application-level Fault Tolerance
3. Problem: adding complexity can decrease the
dependability

The ALFT (the means to express FT) must be
based on a simple strategy
 It must be syntactically adequate to host several
mechanisms
25 October 2006
Seminarie Informatica - Lecture 1
30
Problems and properties of
Application-level Fault Tolerance
• Hazard:
“Languages shape the way we think …” [Warf]
 “If all you have is a hammer, everything looks
like a nail”
[/usr/share/fortune]
‼ …but – is it really a nail?

• Syntactical Adequacy ( SA )

Does the ALFT provide simple means to host
many FT solutions?
 A criterion to compare 2 ALFT’s is by their degree
of SA
25 October 2006
Seminarie Informatica - Lecture 1
31
Summary
• Separation of design concerns ( SDC )
• Adaptability ( AD )
• Syntactical Adequacy ( SA )
 A “base” of attributes we can use to
compare ALFT’s with one another
12
10
8
6
1
2
3
4
2
25 October 2006
4
Seminarie0Informatica - Lecture 1
SDC
AD
SA
5
6
32
System structures for SFT
•
•
•
•
•
•
Single-version FT
Multiple-version FT
Object model
Linda Model
FT Languages
Recovery metaprogram
Each of these could be a case study for the exam
25 October 2006
Seminarie Informatica - Lecture 1
33
Single-version Fault Tolerance
• Single-version SFT = embedding in the user
application of a simplex system a set of error
detection / recovery features





Explicit code intrusion (bad SDC )
Increases size and complexity (bad SA )
Bad for transparency, maintainability, portability
Increases development times and costs
No support for dynamic adaptability (bad AD )
• Libraries

SwIFT, HATS, EFTOS …
25 October 2006
Seminarie Informatica - Lecture 1
34
Multiple-version Fault Tolerance
• Multiple-version SFT: NVP and RB
• Idea: redundancy of software: independently designed
versions of software

Randell (1975) : “All fault tolerance must be based on the
provision of useful redundancy, both for error detection and
error recovery. In software the redundancy required is not
simple replication of programs but redundancy of design”
• Assumption: random component failures. Correlated
failures sudden exhaustion of available redundancy

Again, Ariane 5 flight 501: two crucial components were
operating in parallel with identical hardware and software…
25 October 2006
Seminarie Informatica - Lecture 1
35
Multiple-version Fault Tolerance
#include <ftmacros.h>
...
ENSURE(acceptance-test) {
Alternate 1;
} ELSEBY {
Alternate 2;
} ... ENSURE;
25 October 2006
Seminarie Informatica - Lecture 1
36
Multiple-version Fault Tolerance
#include <ftmacros.h>
...
NVP VERSION{ block 1; SENDVOTE(v-pointer, v-size); }
VERSION{ block 2; SENDVOTE(v-pointer, v-size); }
…
ENDVERSION(timeout, v-size);
if (!agreeon(v-pointer)) error_handler();
ENDNVP;
25 October 2006
Seminarie Informatica - Lecture 1
37
Multiple-version Fault Tolerance
• Multiple-version SFT


Implies N-fold design costs, N-fold maintenance
costs
The risk of correlated failures is not negligible
Code intrusion is limited (Acceptable SDC )
System structure is fixed (Bad SA )
No support for dynamic adaptability (bad AD )

Can be combined with other means



25 October 2006
Seminarie Informatica - Lecture 1
38
Object-centred Strategies
• Strategies based on the object model

Metaobject protocols and reflection
• Open implementation of the run-time executive of an
OO-language
• Reflection, reification

Composition filters
• Each object has a set of “filters”. Messages sent to any
object are trapped by its filters. These filters possibly
manipulate the message before passing it to the object.
25 October 2006
Seminarie Informatica - Lecture 1
39
Object-centred Strategies

Active objects
• Objects that have control over the synchronisation of
incoming requests from other objects. Objects can
autonomously decide, e.g., to delay a request until it is
acceptable, i.e., until a guard is met
• FRIENDS, SINA, Correlate

Full separation of design concerns (Good SDC )
 No code intrusion
 Syntactically adequate - at least for a subset of FT
strategies (Acceptable SA )
25 October 2006
Seminarie Informatica - Lecture 1
40
Object-centred Strategies

Assumption: application written in extended OOlanguage
 Adaptability? (Questionable AD )
25 October 2006
Seminarie Informatica - Lecture 1
41
FT Linda Systems





Generative communication - messages are not
“sent”, they are stored in a public, distributed
shared memory
A shared relational database for storing and
withdrawing “tuples”
Tuples: lists of objects identified by their contents,
cardinality and type
A Linda process inserts, reads, and withdraws
tuples via blocking or non-blocking primitives
Synchronisation: presence / absence of a matching
tuple
25 October 2006
Seminarie Informatica - Lecture 1
42
Linda
 In master-worker applications

Dynamic load balancing, also in heterogeneous
clusters
 Inherently tolerates crash failures of workers
- Single-op atomicity
• Solutions:
Possible case study
for the exam

Atomic transactions with multiple TS ops
 Stable tuple space
 Tuple space checkpointing, etc.
25 October 2006
Seminarie Informatica - Lecture 1
43
Linda
• FT-Linda, Persistent Linda...





Full separation of design concerns (Good SDC )
No code intrusion
Syntactically adequate - at least for a subset of
FT strategies (Acceptable SA )
Assumption: application written in Linda
Adaptability? (Questionable AD )
25 October 2006
Seminarie Informatica - Lecture 1
44
FT Languages
• FT Languages
1.
Enhanced, pre-existing
• Examples:

FT-SR
• Fail-stop modules - “abstract unit of encapsulation”
• Atomic execution
• Composability

x-Linda (x = C, Fortran, C++, …)
25 October 2006
Seminarie Informatica - Lecture 1
45
FT Languages
• FT Languages
2.
Novel languages
• Examples:

Argus: distributed OO programming language
and operating system
• “Guardians”: objects performing user-definable actions
in response to remote requests
• Atomic transactions

FTAG: functional language based on attribute
grammars
25 October 2006
Seminarie Informatica - Lecture 1
46
FT Languages
• FTAG

Computation = collection of pure mathematical
functions, the modules.
 Each module has a set of input values, called
inherited attributes, and of output variables, called
synthesized attributes.
25 October 2006
Seminarie Informatica - Lecture 1
47
FTAG (cont.’d)

Primitive modules can be executed
 Non-primitive modules require other modules to
be performed first
 FTAG program = decomposing a “root” module
into its basic sub-modules and then applying
recursively this decomposition process to each of
the sub-modules (computation tree)
25 October 2006
Seminarie Informatica - Lecture 1
48
FTAG (cont.’d)

Natural support for redoing (replacing a portion of
the computation tree with a new computation)
 Natural support for replication (replicated
decomposition: a module is decomposed into N
identical sub-modules implementing the function
to replicate)
25 October 2006
Seminarie Informatica - Lecture 1
49
FT Languages
• Conclusions for FT languages


-

adequate separation of design concerns,
transparency (good SDC )
special purpose syntax (potentially good SA )
application must be written with non standard
language
bad portability
Adaptability ( AD ): unknown
25 October 2006
Seminarie Informatica - Lecture 1
50
RMP
• Recovery Metaprogram

Two cooperating processing contexts
 User-placed breakpoints in the user context bring
to the execution of a meta-program
 When the meta-program ends, control is returned
to the user program

Meta-program is to be written in CSP
25 October 2006
Seminarie Informatica - Lecture 1
51
RMP
• Adequate, e.g., for recovery blocks:

Breakpoint can trigger the execution of
• CHECKPOINT
• ALTERNATES
• ACCEPTANCE TESTS...
25 October 2006
Seminarie Informatica - Lecture 1
52
RMP
• RMP summary:



-
Full separation of design concerns
No code intrusion (Good SDC )
Syntactically adequate - at least for a subset of
FT strategies (Average SA )
The meta-program is written in a fixed, preexisting language (CSP)
Inefficient implementation (huge performance
overhead for switching execution modes)
No adaptability (Bad AD )
25 October 2006
Seminarie Informatica - Lecture 1
53
Summary
6
5
4
SDC
3
AD
2
SA
1
P
M
ua
g
La
ng
R
es
a
Li
nd
od
el
m
ec
t
O
bj
le
-v
er
si
o
n
si
o
M
ul
t ip
Si
ng
l
eve
r
-1
n
0
• No optimal solution exists yet
• Challenging research problem!
25 October 2006
Seminarie Informatica - Lecture 1
54
Conclusions – in search of optimum
• A dependable service is one that persists
even when, for instance, its corresponding
program experiences faults – to some agreed
upon extent
• An F-dependable service (resp.
F-dependable program, system…) is one that
persists despite the occurrence of faults as
described in F
• F is the fault model
25 October 2006
Seminarie Informatica - Lecture 1
55
Conclusions – in search of optimum
• F is the model of an environment (E)
• An F-dependable service may tolerate faults
in E and may not for those in E’
• What if F matches an environment E’?
• What if E changes into E’?
• What if an F-service is moved?
→ A failure may occur!
25 October 2006
Seminarie Informatica - Lecture 1
56
Conclusions – in search of optimum
• Adapting services
• X-dependable services, where X = f(E)
• X changes when

The service is moved
 The environment mutates
• Changes should occur automa[tg]ically (High AD)
• The expression of adaptability and dependability
concerns should not increase complexity “too
much” (High SA )
25 October 2006
Seminarie Informatica - Lecture 1
57
Conclusions
• Ideally, the code should be made of two
components:
(service, FT)
(Optimal SDC )
and FT should adapt dynamically w.r.t. e’
25 October 2006
Seminarie Informatica - Lecture 1
58
Conclusions
• Risks: this may call for complexity!

But generic architectures can be thought so as to
go for a limited complexity
 Optimizations are possible
• In a future seminar: a compliant architecture
that is being designed within PATS
25 October 2006
Seminarie Informatica - Lecture 1
59
Questions?
25 October 2006
Seminarie Informatica - Lecture 1
All citations by B. Randell if no author is specified
60