Dependability, Reliability, and Robust Systems

SQA – SWE 333
Software Quality Assurance
Prof. Dr. Mohamed BATOUHE
Dept. of Software Engineering
CCIS – King Saud University
From Software Quality to Software Reliability Engineering
2
What is quality ?
Quality popular view:
– Something “good” but not quantifiable
– Something luxury and classy
Quality professional view:
– Conformance to requirement
(Crosby, 1979)



The requirements are clearly stated and the
product must conform to it
Any deviation from the requirements is regarded
as a defect
A good quality product contains fewer defects
– Fitness for use (Juran, 1970):


Fit to user expectations: meet user’s needs
A good quality product provides better user
satisfaction
3
Definition of Quality
ISO 8402 definition
of QUALITY:
ISO 9216 Model:
The totality of features and
characteristics of a product
or a service that bear on
its ability to satisfy stated
or implied needs
1.
2.
3.
4.
5.
6.
Quality characteristics
Functionality
Reliability
Usability
Efficiency
Maintainability
Portability
4
ISO 9126
Reliability - Attributes
Reliability
 Maturity: The capability of the software to avoid failure as
a result of faults in the software.
 Fault Tolerance: The capability of the software to
maintain a specified level of performance in case of
software faults or of infringement of its specified interface.
 Recoverability: The capability of the software to
reestablish its level of performance and recover the data
directly affected in the case of a failure.
5
Importance of Reliability and Robust Systems
 Correctness-critical applications
 Air-craft control, hospital monitor systems
 On-line transaction processing
 Internet services
 e.g., Google, Yahoo!, Amazon, Ebay, etc.
 Expectation of 24 x 7 availability, but service outages still
happen!
 Sorry....
We apologize for the inconvenience, but the system is
currently unavailable. Please try your request in an hour.
If you require assistance please call Customer Service at
1-866-325-3457.
6
Measuring Reliability
 Two ways to measure reliability
 Counting failures in periodic intervals

Observe the trend of cumulative failure count - µ().
 Failure intensity
 Observe the trend of number of failures per unit time – λ().
 µ()
 This denotes the total number of failures observed until execution
time  from the beginning of system execution.
 λ()
 This denotes the number of failures observed per unit time after 
time units of executing the system from the beginning. This is also
called the failure intensity at time .
 Relationship between λ() and µ()
 λ() = dµ()/d
7
Reliability Science
• Exploring ways of implementing “reliability” in software
products.
• Reliability Science’s goals:
Developing “models” (regression and aggregation
models) and “techniques” to build reliable software.
Testing such models and techniques for adequacy,
soundness and completeness.
8
What is Reliability?
 First definition
 Software reliability is defined as the probability of failure-free
operation of a software system for a specified time in a specified
environment.

Key elements of the above definition
 Probability of failure-free operation
 Length of time of failure-free operation
 A given execution environment

Example
 The probability that a PC in a store is up and running for eight hours
without crash is 0.99.
9
What is Reliability?
 Second definition
 Failure intensity is a measure of the reliability of a software
system operating in a given environment.

Example: An air traffic control system fails once in two years.
 Comparing the two
 The first puts emphasis on Mean Time To Failure - MTTF,
whereas the second on count.
10
Factors influencing Software Reliability
 A user’s perception of the reliability of a software depends upon
two categories of information.
 The number of faults present in the software.
 The ways users operate the system.
 This is known as the operational profile.
 The fault count in a system is influenced by the following.




Size and complexity of code
Characteristics of the development process used
Education, experience, and training of development personnel
Operational environment
11
Operational Profiles
 Developed at AT&T Bell Labs.
AT &T = American Telephone and Telegraphy
 An OP describes how actual users operate a
system.
 An OP is a quantitative characterization of
how a system will be used.
 Two ways to represent operational profiles
 Tabular
 Graphical
An example of operational profile of a library
information system.
Graphical representation of operational
profile of a library information system.
12
Operational Profiles
 Use of operational profiles
 For accurate estimation of the reliability of a system, test the
system in the same way it will be actually used in the field.
 Other uses of operational profiles
 Use an OP as a guiding document in designing user interfaces.

The more frequently used operations should be easy to use.
 Use an OP to design an early version of a software for release.
 This contains the more frequently used operations.
 Use an OP to determine where to put more resources.
13
Failure definition and classification
Failure:
 A system failure is an event that occurs when the delivered service deviates from correct service.
A failure is thus a transition from correct service to incorrect service, i.e., to not implementing
the system function.
 Any departure of system behavior in execution from user needs. A failure is caused by a fault
and the cause of a fault is usually a human error.
Classification of failures:
 Transient - only occurs with certain inputs
 Permanent - occurs on all inputs
 Recoverable - system can recover without operator help
 Unrecoverable - operator has to help


Non-corrupting - failure does not corrupt system state or data
Corrupting - system state or data are altered
14
Time Units
 Time is a key concept in the formulation of reliability.
Different forms of time are considered.
 Raw Execution Time ()
 In response systems
 Calendar Time (t)
 If the system has regular usage patterns - continuous systems
 Number of Transactions
 demand type transaction systems
15
Reliability improvement
 Reliability is improved when software faults which occur in the
most frequently used parts of the software are removed
 Removing x% of software faults will not necessarily lead to an
x% reliability improvement
In a study, removing 60% of software defects actually led to
a 3% reliability improvement
 Removing faults with serious consequences is the most
important objective
16
Reliability and Efficiency
 As reliability increases, system efficiency tends to
decrease
 To make a system more reliable, redundant code
must be included to carry out run-time checks, etc.
 This tends to slow it down
17
Software vs. Hardware
 Software reliability doesn’t decrease with time, i.e., software doesn’t wear out.
 Hardware faults are mostly physical faults, e.g., fatigue.
 Software faults are mostly design faults which are harder to measure, model,
detect and correct.
 Hardware failure can be “fixed” by replacing a faulty component with an
identical one, therefore no reliability growth.
 Software problems can be “fixed” by changing the code in order to have the
failure not happen again, therefore reliability growth is present.
 Software does not go through production phase the same way as hardware
does.
Conclusion: hardware reliability models may not be used identically for
software.
18
S/W Reliability Metrics
 Hardware metrics not directly applicable to software
 Software failures are often transient rather then
permanent
 Failures are different for many inputs
 If data is undamaged system can continue operating in
most cases
19
Reliability Metrics
 Metrics that are used for specifying software reliability
and availability:
 POFOD
 ROCOF
 MTTF, MTTR, MTBF
 AVAIL
 Depends on type of system used
20
Reliability Metrics
POFOD (Probability of failure on demand)
 Metric most appropriate for systems where services are
demanded at unpredictable or long intervals
 POFOD = 0.001
 For one in every 1000 requests the service fails
21
Reliability Metrics
ROCOF (Rate of occurrence of failures)
 Metric most appropriate for systems that have regular
demands
 ROCOF = 0.02
 Two failures for each 100 operational time units of
operation
22
Reliability Metrics
MTTF (Mean time to failure)
 Measure of the time between observed failures
 This is measured usually against a clock (e.g. CPU cycles)
and assumes:
 Constant execution load
 Measurement against some error threshold
 MTTF of 500 means that the time between failures is 500 time
units
 Metric most appropriate for systems that have long transactions
23
Reliability Metrics
AVAIL (Availability)
 Measures the fraction of time system is really available for
use
 Takes repair and restart times into account
 Relevant for non-stop continuously running systems (e.g.
traffic signal, telephone switching systems, ...)
 Where users expect the system to deliver a continuous
service
24
Availability
 Availability is the probability that the system will still be
operating to requirements at a given time.
 Availability is a function not only of how rarely a system
fails (reliability) but also of how quickly it can be repaired
(time to repair)
 Availability of 0.998 means software is available for 998 out of
1000 time units
25
Reliability Metrics
 Three kinds of measurements when assessing
reliability of a system:
 # of failures over # of requests
 Time between system failures
 Repair or restart time after a failure
26
Reliability Metrics
 MTTF: Mean Time To Failure
 MTTR: Mean Time To Repair
 MTBF: Mean Time Between Failures (= MTTF + MTTR)
Occurrences of failures
Repairs performed
Start of system
operation
Relationship between MTTR, MTTF, and MTBF.
27
Benefits of Software Reliability
 Comparison of software engineering technologies
 What is the cost of adopting a technology?
 What is the return from the technology -- in terms of cost and quality?
 Measuring the progress of system testing
 Key question: How many percentage of testing has been done?
 The failure intensity measure tells us about the present quality of the system:
high intensity means more tests are to be performed.
 Controlling the system in operation
 The amount of change to a software for maintenance affects its reliability.
Thus the amount of change to be effected in one go is determined by how
much reliability we are ready to potentially lose.
 Better insight into software development processes
 Quantification of quality gives us a better insight into the development
processes.
28
29
Dependability and its attributes
 Dependability:
 The ability to deliver service that can be justifiably be trusted.
 The ability to avoid service failures that are more frequent and more severe
than acceptable.
 It encompasses the following attributes:
 Availability: readiness for correct service
 Reliability: continuity of correct service
 Safety: absence of catastrophic consequences
 Integrity: absence of improper system alterations
 Maintainability: ability to undergo modifications and repairs
30
Means to attain Dependability
 Fault prevention: how to prevent the occurrence or introduction
of faults? - Good software design framework, languages, etc
 Fault removal: how to reduce the number or severity of faults
 Bug detection, testing, debugging, etc
 Fault tolerance: how to deliver correct service in the presence of
faults? - Survive faults, transparent recovery, etc
 Manual recovery
 Done by operators
 Fault forecasting: how to estimate the present number of faults,
the future incidence, and the likely consequences faults …
 Real Life Analogy?
31
Cost per error detected
Fault removal cost
32
Fault Tolerance
 In critical situations, software systems must be
fault tolerant.
 Fault tolerance is required where there are high availability
requirements or where system failure costs are very high.
 Fault tolerance means that the system can continue in
operation in spite of software failure.
 Even if the system has been proved to conform to its
specification, it must also be fault tolerant as there may be
specification errors or the validation may be incorrect.
33
Fault Tolerance Actions
 Fault detection
 The system must detect that a fault (an incorrect system state) has
occurred.
 Damage assessment
 The parts of the system state affected by the fault must be
detected.
 Fault recovery
 The system must restore its state to a known safe state.
 Fault repair
 The system may be modified to prevent recurrence of the
fault. As many software faults are transitory, this is often
unnecessary.
34
Software failure recovery
 Rebooting Techniques
 Whole-system rebooting, micro-rebooting, software rejuvenation
 General checkpointing and recovery
 Fail-over system, progressive retry, recovery block, n-version
programming
 Application-specific recovery
 Multi-process model, exception handling
 Recently proposed non-traditional mechanisms
 Failure-oblivious computing, reactive immune systems, Rx
35
Hardware fault tolerance
 Depends on triple-modular redundancy (TMR).
 There are three replicated identical components that
receive the same input and whose outputs are
compared.
 If one output is different, it is ignored and component
failure is assumed.
 Based on most faults resulting from component
failures rather than design faults and a low probability
of simultaneous component failure.
36
Hardware fault tolerance with TMR
37
TMR – Output selection
 The output comparator is a (relatively) simple
hardware unit.
 It compares its input signals and, if one is different
from the others, it rejects it. Essentially, the selection
of the actual output depends on the majority vote.
 The output comparator is connected to a fault
management unit that can either try to repair the
faulty unit or take it out of service.
38
Software analogies to TMR
 N-version programming
 The same specification is implemented in a number of
different versions by different teams. All versions compute
simultaneously and the majority output is selected using a voting
system.
 This is the most commonly used approach e.g. in many models of
the Airbus commercial aircraft.
 Recovery blocks
 A number of explicitly different versions of the same specification
are written and executed in sequence.
 An acceptance test is used to select the output to be transmitted.
39
N-version programming
40
N-Version programming
41
N-version programming
 As in hardware systems, the output comparator is a simple piece
of software that uses a voting mechanism to select the output.
 In real-time systems, there may be a requirement that the results
from the different versions are all produced within a certain time
frame.
 The different system versions are designed and implemented by
different teams. It is assumed that there is a low probability that
they will make the same mistakes. The algorithms used should
but may not be different.
 There is some empirical evidence that teams commonly
misinterpret specifications in the same way and choose the same
algorithms in their systems.
42
Design Diversity
 Different versions of the system are designed and
implemented in different ways. They therefore ought to
have different failure modes.
 Different approaches to design (e.g. object-oriented and
function oriented)
 Implementation in different programming languages;
 Use of different tools and development environments;
 Use of different algorithms in the implementation.
43
Recovery Blocks
44
Recovery Blocks
45
Recovery Blocks
 These force a different algorithm to be used for each
version so they reduce the probability of common
errors.
 However, the design of the acceptance test is difficult
as it must be independent of the computation used.
 There are problems with this approach for real-time
systems because of the sequential operation of the
redundant versions.
46
Defensive Programming
Example (1) – SafeSort Method
class SafeSort {
static void sort ( int [] intarray, int order ) throws SortError
{
int [] copy = new int [intarray.length];
// copy the input array
for (int i = 0; i < intarray.length ; i++)
copy [i] = intarray [i] ;
try {
Sort.bubblesort (intarray, intarray.length, order) ;
47
Defensive Programming
Example (2) - Safe Sort Method
if (order == Sort.ascending)
for (int i = 0; i <= intarray.length-2 ; i++)
if (intarray [i] > intarray [i+1])
throw new SortError () ;
else
for (int i = 0; i <= intarray.length-2 ; i++)
if (intarray [i+1] > intarray [i])
throw new SortError () ;
} // try block
catch (SortError e )
{
for (int i = 0; i < intarray.length ; i++)
intarray [i] = copy [i] ;
throw new SortError ("Array not sorted using bubble so
} //catch
} // sort
} // SafeSort
48
Software Fault Tolerance
using Recovery Blocks
Exception
Example: Sorting Data Safely … Class
extends
RobustSort
uses
BugSort
- main()
uses
uses
Vector
- BubbleSort
- InsertionSort
- QuickSort
- AcceptanceTest
49
Software Fault Tolerance
using Recovery Blocks
Class RobustSort …
public class RobustSort
{
static void main() throws BugSort {
System.out.println(" \n Testing Dependability \n");
Vector v = new Vector(3);
try { System.out.println(" Starting first sorting algorithm ");
v.bubblesort();
v.displayvector();
v.acceptancetest();
System.out.println(" Bubble sort algorithm OK ...");
} // end first try
50
Software Fault Tolerance
using Recovery Blocks
… Class RobustSort
catch (BugSort e1) {
try { System.out.println(" Starting second sorting algorithm ");
v.insertionsort(); v.displayvector(); v.acceptancetest();
System.out.println(" insertionsort algorithm OK ...");
} // end second try
catch (BugSort e2) {
throw new BugSort(" Impossible to sort this vector ");
}
}
System.out.println(" sorting OK ... ");
}
} // end of class RobustSort
51
Software Fault Tolerance
using Recovery Blocks
Class vector …
import java.util.*;
public class Vector {
int [] A;
Scanner input = new Scanner(System.in);
public Vector(int nbe) {
A = new int [nbe];
System.out.println("enter data ");
for (int i=0; i<nbe; i++) {
System.out.print(" enter element : "); A[i] = input.nextInt(); }
}
52
Software Fault Tolerance
using Recovery Blocks
Class vector …
public void bubblesort() {
int temp;
public void insertionsort() {
int value,i,j; boolean done;
for (i=1; i<=(A.length-1); i++) {
value = A[i]; j = i-1; done = false;
do
if (A[j] > value) {
A[j+1] = A[j--];
if (j<0)
done = true;
}
else
done = true;
while (!done);
for (int i=A.length-1; i>0; i--)
for (int j=0; j<i-1; j++) // insertion of bug
if (A[j] > A[j+1]) {
temp = A[j];
A[j] = A[j+1];
A[j+1] = temp;
}
}
A[j+1] = value;
}
}
53
Software Fault Tolerance
using Recovery Blocks
... Class vector
public void displayvector() {
System.out.println(" vector display ");
for (int i=0; i< A.length; i++)
System.out.println(A[i]);
}
public void acceptancetest() throws BugSort {
for (int i=0; i<A.length-1; i++)
if (A[i] > A[i+1]){
System.out.println(" Warning Wrong Sorting !!!!");
throw new BugSort(); // raise exception ...
}
}
} // end of class Vector
54
Software Fault Tolerance
Using Exception Handling
Exception
Example: Reading PositiveIntegers (int ≥ 0) Safely …
extends
SafeRead
uses
BugRead
- main()
uses
uses
PositiveInteger
- getpositiveinteger()
- PositiveInteger sum(PositiveInteger)
55
Software Fault Tolerance
using Exception Handling
Class SafeRead …
public class SafeRead {
public static void main() {
System.out.println(" \n Reading positive integers without failures (safely)!!! \n");
PositiveInteger p1 = new PositiveInteger();
System.out.println(p1.getPositiveInteger());
PositiveInteger p2 = new PositiveInteger();
System.out.println(p2.getPositiveInteger());
pn2.sum(p1);
System.out.print("\n\n the sum of the two positive numbers is =");
System.out.println(p2.getPositiveInteger());
}
} // end of class SafeRead
56
Software Fault Tolerance
using Exception Handling
Class PositiveInteger… Constructor …
import java.util.*;
public class PositiveInteger
{
private int x;
catch (BugRead e1) {
/* Constructor for objects of class PositiveInteger */
}
public PositiveInteger() {
catch (InputMismatchException e2) {
Scanner input = new Scanner(System.in);
boolean done = false;
do {
try {
System.out.print(" enter a positive integer: ");
x = input.nextInt();
if (x < 0)
throw new BugRead();
done = true;
}
System.out.println("\n please repeat again \n");
input.nextLine();
System.out.println(" input should be positive ");
System.out.println("\n please repeat again \n");
}
}
while (!done);
} // end of constructor
57
Software Fault Tolerance
using Exception Handling
... Class Positive Integer
public int getPositiveInteger() {
return x;
}
public void sum(PositiveInteger p) {
x = x + p.getPositiveInteger();
}
Class BugRead
public class BugRead extends Exception {
....
}
} // end of class PositiveInteger
58
Fault Forecasting
59
Fault Forecasting
How to determine number of remaining bugs?
Definition:
 Fault forecasting is conducted by performing an
evaluation of the system behavior with respect to fault
occurrence or activation
Fault Injection technique:
 The idea is to inject (seed) some faults in the program
and calculate the remaining bugs based on detecting
the seeded faults [Mills 1972]. Assuming that the
probability of detecting the seeded and nonseeded
faults are the same.
60
Fault Forecasting
Nr
Nd
Ns
ns
nd
61
Fault Forecasting
Example
62
Key points – Dependability
 Dependability in a system can be achieved through
fault avoidance, fault detection and fault tolerance.
 The use of redundancy and diversity is essential to the
development of dependable systems.
 The use of a well-defined repeatable process is
important if faults in a system are to be minimised.
 Some programming constructs are inherently error-
prone - their use should be avoided wherever possible.
63
References
 Williaw E. Lewis, “Software Testing And Continuous Quality Improvement”, Third Edition,
CRC Press, 2009.
 K. Naik and P. Tripathy: “Software Testing and Quality Assurance”, Wiley, 2008.
 Ian Sommerville, Software Engineering, 8th edition, 2006.
 Aditya P. Mathur,“Foundations of Software Testing”, Pearson Education, 2009.
 D. Galin, “Software Quality Assurance: From Theory to Implementation”, Pearson Education,
2004
 David Gustafson, “Theory and Problems of Software Engineering”, Schaum’s Outline Series,
McGRAW-HILL, 2002.
 Michael R. Liu, Handbook of Software Reliability Engineering, IEEE Computer Society Press,
McGraw-Hill, 2005.
64
65

Download Report

Dependability, Reliability, and Robust Systems

Paperzz.com

Your Paperzz