Seminarie Informatica Fault-tolerant Systems: The Software Viewpoint A series of seminars coordinated by Vincenzo De Florio http://www.pats.ua.ac.be The matter • The exam • The topics • This lecture Application-level fault tolerance provisions 25 October 2006 Seminarie Informatica - Lecture 1 2 Introduction to the exam • Seminarie informatica 10 seminars on hot topics of computer science Topic of this cycle: software fault-tolerant systems Next 3 seminars: 15, 22 November; 6 December Next year seminars: to be announced on http://www.win.ua.ac.be/~vincenz/si/0607.html 25 October 2006 Seminarie Informatica - Lecture 1 3 Introduction to the exam • Oral discussion of 2 papers A 5–6 page paper based on one or more of the topics of the seminars A paper with the analysis of a case study • See later for examples • Evaluation criteria: Do the papers contain original ideas? Do they follow «too strictly» the seminar? Does the author understand the subject? Is (s)he able to reason independently about the subject? • Papers must be submitted by May 15, 2007 E-mail to [email protected] 25 October 2006 Seminarie Informatica - Lecture 1 4 The Topics Dependability = the property of a system such that reliance can justifiably be placed on the service it delivers Fault tolerance = one of the means of dependability 25 October 2006 Seminarie Informatica - Lecture 1 5 The Dependability Tree 25 October 2006 Seminarie Informatica - Lecture 1 6 Fault tolerance (FT) Fault-tolerant system is system that continues to function in spite of faults defect IC bug in program operation fault sensor drift 25 October 2006 hardware software operator I/O Seminarie Informatica - Lecture 1 7 Attributes of dependability • Availability Readiness for usage A(t) = probability that system is conform to specification at time t • Reliability Continuity of service R(t) = probability that system is conform to specifications during [t0,t], provided that so it is at t0 25 October 2006 Seminarie Informatica - Lecture 1 8 Attributes of dependability (2) • Safety Non-occurrence of catastrophic consequences on environment S(t) = probability that a system is either conform to specification, or reaches a safe halt, at time t Fail-safe systems 25 October 2006 Seminarie Informatica - Lecture 1 9 Attributes of dependability (3) • Maintainability Aptitude to undergo repairs and evolution M(t) = probability that system is back to specifications at t if failed at t0 25 October 2006 Seminarie Informatica - Lecture 1 10 Attributes of dependability (4) • Confidentiality Non-occurrence of unauthorised disclosure of information • Integrity Non-occurrence of improper alterations of information 25 October 2006 Seminarie Informatica - Lecture 1 11 Related attributes • Testability Ability to test features of a system Related to maintainability 25 October 2006 Seminarie Informatica - Lecture 1 12 Related attributes • Security Integrity + availability + confidentiality 25 October 2006 Seminarie Informatica - Lecture 1 13 References • Jean-Claude Laprie, “Dependable Computing and Fault Tolerance: Concepts and Terminology”, in Proc. of the 15th Int. Symposium on Fault-Tolerant Computing (FTCS-15), Ann Arbor, Mich., June 1985, pp.2-11 • Jean-Claude Laprie, “Dependability---Its Attributes, Impairments and Means”, in Predictably Dependable Computing Systems, ESPRIT Basic Research Series, B. Randell and J.-C. Laprie and H. Kopetz and B. Littlewood (eds.), Springer Verlag, 1995, pp. 3-18. 25 October 2006 Seminarie Informatica - Lecture 1 14 The lecture • We now focus on application-level fault tolerance • Why do we need ALFT? Why do we need software FT in the first place? • We explain why • We survey the existing methods and assess their pros and cons against a set of properties • Surprising conclusion: still an open problem 25 October 2006 Seminarie Informatica - Lecture 1 15 Software Fault Tolerance • Human society more and more expects and relies on good quality of complex services supplied by computers 25 October 2006 Seminarie Informatica - Lecture 1 17 Software Fault Tolerance • Consequences of a failure in the ‘40s: (Computers as fast solvers of numerical problems) Errors in computations, long downtimes Incalculable penalty (catastrophes) 25 October 2006 Seminarie Informatica - Lecture 1 Performance & ease of use • Consequences of a failures nowadays: (Computers controlling nuclear plants, airborne equipment, healthcare…) 18 Software Fault Tolerance • Traditional answer: Hardware Fault Tolerance • This is an important ingredient, but not the only one needed today! • Complexity is also in the SW layers APPLICATION SW 1. Hierarchies of complex abstract machines MW OS 25 October 2006 Seminarie Informatica - Lecture 1 HW 19 Software Fault Tolerance • Complexity is also in SW layers (cont.’ed) Software is often networked and distributed 3. Relationships among software components are often complex 4. Object model Easier SW reuse Hidden + explicit Complexity 2. 25 October 2006 Seminarie Informatica - Lecture 1 20 Software Fault Tolerance • In conclusion: “No amount of verification, validation and testing can eliminate all faults in an application and give complete confidence in the availability and data consistency of applications” Fault tolerance in SW is key ! SW failures can have the same extent in consequences of failures in HW 25 October 2006 Seminarie Informatica - Lecture 1 Ariane 5 ! 21 Problems of SW FT APPLICATION HL RUN-TIME OS HW 25 October 2006 The lighter the color, the more general purpose the (virtual) machine The lighter the color, the more complex the problem of expressing fault tolerance Seminarie Informatica - Lecture 1 22 Problems of Application-level Fault Tolerance • “The only alternative and effective means for increasing software reliability is that of incorporating in the application software provisions for SFT” • The Application software has to manage Functional aspects Fault tolerance (FT) aspects at the same time / in the same space 25 October 2006 Seminarie Informatica - Lecture 1 23 Problems and properties of Application-level Fault Tolerance • Hazard : code intrusion FT provisions are specified side by side with the service Conflicting design concerns Overall design complexity gets increased Larger development and maintenance costs & times Larger probability of introducing software bugs 25 October 2006 Seminarie Informatica - Lecture 1 24 Problems and properties of Application-level Fault Tolerance • Separation of design concerns ( SDC ) In what follows we call an “ALFT” a means to express fault tolerance in the application software A criterion to compare ALFT’s is by their degree of SDC 25 October 2006 Seminarie Informatica - Lecture 1 25 Problems and properties of Application-level Fault Tolerance • Hazard : porting code porting service FT code assumes fault model = f(e) 1. If e changes, or 2. If the code is moved to another environment e’ the QoS may degrade 25 October 2006 Seminarie Informatica - Lecture 1 26 Problems and properties of Application-level Fault Tolerance • Hazard: porting code porting service • An interesting case: Ariane 5 501 Ariane 4 missions software re-used in Ariane 5 The early part of the trajectory of Ariane 5 differed from that of Ariane 4 and resulted in quite higher horizontal velocity values This could be a case study for the exam 25 October 2006 Seminarie Informatica - Lecture 1 …370 Million Euros in the sink IRS IRS FCC 27 Problems and properties of Application-level Fault Tolerance 2. Problem: service portability Porting FT comes not for free “Hardwired ” fault model = static environment More difficult to adapt / test / maintain More prone to Ariane 5 - effects “ What is the most often overlooked risk in sw engineering? That the environment will do something the designer never anticipated ” [J. Horning ] 25 October 2006 Seminarie Informatica - Lecture 1 28 Problems and properties of Application-level Fault Tolerance • Adaptability ( AD ) Does the ALFT provide means to adapt, dynamically, to new environmental conditions? A criterion to compare 2 ALFT’s is by their degree of AD 25 October 2006 Seminarie Informatica - Lecture 1 29 Problems and properties of Application-level Fault Tolerance 3. Problem: adding complexity can decrease the dependability The ALFT (the means to express FT) must be based on a simple strategy It must be syntactically adequate to host several mechanisms 25 October 2006 Seminarie Informatica - Lecture 1 30 Problems and properties of Application-level Fault Tolerance • Hazard: “Languages shape the way we think …” [Warf] “If all you have is a hammer, everything looks like a nail” [/usr/share/fortune] ‼ …but – is it really a nail? • Syntactical Adequacy ( SA ) Does the ALFT provide simple means to host many FT solutions? A criterion to compare 2 ALFT’s is by their degree of SA 25 October 2006 Seminarie Informatica - Lecture 1 31 Summary • Separation of design concerns ( SDC ) • Adaptability ( AD ) • Syntactical Adequacy ( SA ) A “base” of attributes we can use to compare ALFT’s with one another 12 10 8 6 1 2 3 4 2 25 October 2006 4 Seminarie0Informatica - Lecture 1 SDC AD SA 5 6 32 System structures for SFT • • • • • • Single-version FT Multiple-version FT Object model Linda Model FT Languages Recovery metaprogram Each of these could be a case study for the exam 25 October 2006 Seminarie Informatica - Lecture 1 33 Single-version Fault Tolerance • Single-version SFT = embedding in the user application of a simplex system a set of error detection / recovery features Explicit code intrusion (bad SDC ) Increases size and complexity (bad SA ) Bad for transparency, maintainability, portability Increases development times and costs No support for dynamic adaptability (bad AD ) • Libraries SwIFT, HATS, EFTOS … 25 October 2006 Seminarie Informatica - Lecture 1 34 Multiple-version Fault Tolerance • Multiple-version SFT: NVP and RB • Idea: redundancy of software: independently designed versions of software Randell (1975) : “All fault tolerance must be based on the provision of useful redundancy, both for error detection and error recovery. In software the redundancy required is not simple replication of programs but redundancy of design” • Assumption: random component failures. Correlated failures sudden exhaustion of available redundancy Again, Ariane 5 flight 501: two crucial components were operating in parallel with identical hardware and software… 25 October 2006 Seminarie Informatica - Lecture 1 35 Multiple-version Fault Tolerance #include <ftmacros.h> ... ENSURE(acceptance-test) { Alternate 1; } ELSEBY { Alternate 2; } ... ENSURE; 25 October 2006 Seminarie Informatica - Lecture 1 36 Multiple-version Fault Tolerance #include <ftmacros.h> ... NVP VERSION{ block 1; SENDVOTE(v-pointer, v-size); } VERSION{ block 2; SENDVOTE(v-pointer, v-size); } … ENDVERSION(timeout, v-size); if (!agreeon(v-pointer)) error_handler(); ENDNVP; 25 October 2006 Seminarie Informatica - Lecture 1 37 Multiple-version Fault Tolerance • Multiple-version SFT Implies N-fold design costs, N-fold maintenance costs The risk of correlated failures is not negligible Code intrusion is limited (Acceptable SDC ) System structure is fixed (Bad SA ) No support for dynamic adaptability (bad AD ) Can be combined with other means 25 October 2006 Seminarie Informatica - Lecture 1 38 Object-centred Strategies • Strategies based on the object model Metaobject protocols and reflection • Open implementation of the run-time executive of an OO-language • Reflection, reification Composition filters • Each object has a set of “filters”. Messages sent to any object are trapped by its filters. These filters possibly manipulate the message before passing it to the object. 25 October 2006 Seminarie Informatica - Lecture 1 39 Object-centred Strategies Active objects • Objects that have control over the synchronisation of incoming requests from other objects. Objects can autonomously decide, e.g., to delay a request until it is acceptable, i.e., until a guard is met • FRIENDS, SINA, Correlate Full separation of design concerns (Good SDC ) No code intrusion Syntactically adequate - at least for a subset of FT strategies (Acceptable SA ) 25 October 2006 Seminarie Informatica - Lecture 1 40 Object-centred Strategies Assumption: application written in extended OOlanguage Adaptability? (Questionable AD ) 25 October 2006 Seminarie Informatica - Lecture 1 41 FT Linda Systems Generative communication - messages are not “sent”, they are stored in a public, distributed shared memory A shared relational database for storing and withdrawing “tuples” Tuples: lists of objects identified by their contents, cardinality and type A Linda process inserts, reads, and withdraws tuples via blocking or non-blocking primitives Synchronisation: presence / absence of a matching tuple 25 October 2006 Seminarie Informatica - Lecture 1 42 Linda In master-worker applications Dynamic load balancing, also in heterogeneous clusters Inherently tolerates crash failures of workers - Single-op atomicity • Solutions: Possible case study for the exam Atomic transactions with multiple TS ops Stable tuple space Tuple space checkpointing, etc. 25 October 2006 Seminarie Informatica - Lecture 1 43 Linda • FT-Linda, Persistent Linda... Full separation of design concerns (Good SDC ) No code intrusion Syntactically adequate - at least for a subset of FT strategies (Acceptable SA ) Assumption: application written in Linda Adaptability? (Questionable AD ) 25 October 2006 Seminarie Informatica - Lecture 1 44 FT Languages • FT Languages 1. Enhanced, pre-existing • Examples: FT-SR • Fail-stop modules - “abstract unit of encapsulation” • Atomic execution • Composability x-Linda (x = C, Fortran, C++, …) 25 October 2006 Seminarie Informatica - Lecture 1 45 FT Languages • FT Languages 2. Novel languages • Examples: Argus: distributed OO programming language and operating system • “Guardians”: objects performing user-definable actions in response to remote requests • Atomic transactions FTAG: functional language based on attribute grammars 25 October 2006 Seminarie Informatica - Lecture 1 46 FT Languages • FTAG Computation = collection of pure mathematical functions, the modules. Each module has a set of input values, called inherited attributes, and of output variables, called synthesized attributes. 25 October 2006 Seminarie Informatica - Lecture 1 47 FTAG (cont.’d) Primitive modules can be executed Non-primitive modules require other modules to be performed first FTAG program = decomposing a “root” module into its basic sub-modules and then applying recursively this decomposition process to each of the sub-modules (computation tree) 25 October 2006 Seminarie Informatica - Lecture 1 48 FTAG (cont.’d) Natural support for redoing (replacing a portion of the computation tree with a new computation) Natural support for replication (replicated decomposition: a module is decomposed into N identical sub-modules implementing the function to replicate) 25 October 2006 Seminarie Informatica - Lecture 1 49 FT Languages • Conclusions for FT languages - adequate separation of design concerns, transparency (good SDC ) special purpose syntax (potentially good SA ) application must be written with non standard language bad portability Adaptability ( AD ): unknown 25 October 2006 Seminarie Informatica - Lecture 1 50 RMP • Recovery Metaprogram Two cooperating processing contexts User-placed breakpoints in the user context bring to the execution of a meta-program When the meta-program ends, control is returned to the user program Meta-program is to be written in CSP 25 October 2006 Seminarie Informatica - Lecture 1 51 RMP • Adequate, e.g., for recovery blocks: Breakpoint can trigger the execution of • CHECKPOINT • ALTERNATES • ACCEPTANCE TESTS... 25 October 2006 Seminarie Informatica - Lecture 1 52 RMP • RMP summary: - Full separation of design concerns No code intrusion (Good SDC ) Syntactically adequate - at least for a subset of FT strategies (Average SA ) The meta-program is written in a fixed, preexisting language (CSP) Inefficient implementation (huge performance overhead for switching execution modes) No adaptability (Bad AD ) 25 October 2006 Seminarie Informatica - Lecture 1 53 Summary 6 5 4 SDC 3 AD 2 SA 1 P M ua g La ng R es a Li nd od el m ec t O bj le -v er si o n si o M ul t ip Si ng l eve r -1 n 0 • No optimal solution exists yet • Challenging research problem! 25 October 2006 Seminarie Informatica - Lecture 1 54 Conclusions – in search of optimum • A dependable service is one that persists even when, for instance, its corresponding program experiences faults – to some agreed upon extent • An F-dependable service (resp. F-dependable program, system…) is one that persists despite the occurrence of faults as described in F • F is the fault model 25 October 2006 Seminarie Informatica - Lecture 1 55 Conclusions – in search of optimum • F is the model of an environment (E) • An F-dependable service may tolerate faults in E and may not for those in E’ • What if F matches an environment E’? • What if E changes into E’? • What if an F-service is moved? → A failure may occur! 25 October 2006 Seminarie Informatica - Lecture 1 56 Conclusions – in search of optimum • Adapting services • X-dependable services, where X = f(E) • X changes when The service is moved The environment mutates • Changes should occur automa[tg]ically (High AD) • The expression of adaptability and dependability concerns should not increase complexity “too much” (High SA ) 25 October 2006 Seminarie Informatica - Lecture 1 57 Conclusions • Ideally, the code should be made of two components: (service, FT) (Optimal SDC ) and FT should adapt dynamically w.r.t. e’ 25 October 2006 Seminarie Informatica - Lecture 1 58 Conclusions • Risks: this may call for complexity! But generic architectures can be thought so as to go for a limited complexity Optimizations are possible • In a future seminar: a compliant architecture that is being designed within PATS 25 October 2006 Seminarie Informatica - Lecture 1 59 Questions? 25 October 2006 Seminarie Informatica - Lecture 1 All citations by B. Randell if no author is specified 60
© Copyright 2025 Paperzz