Crash failure

CS 386C
Understanding Fault-Tolerant
Distributed Systems
 A. Mok 2016
Dependable Real-Time System
Architecture
Application stack
Real-time services
Real-time scheduler
Basic Concepts
Terms: service, server, "depends" relation
• Failure semantics
• Failure masking:
– hierarchical
– group
• Hardware architecture issues
• Software architecture issues
Basic architectural building blocks
Service: collection of operations that can be
invoked by users
• The operations can only be performed by a
server which hides state representation and
operation implementation details. Servers can
be implemented in hardware or software,
Examples
– IBM 4381 processor service: all operations in a 4381
manual
– DB2 database service: set of query and update
operations
The “depends on” relation
• We say that server s depends on server t if s relies on
the correctness of t’s behavior to correctly provide its
own service.
• Graphic representation:
server s
(user, client)
server t (server, resource)
• System-level description:
Levels of
abstraction
Failure Classification
• Service specification: For all operations, specify state transitions and
timing requirements
• Correct service: meets state transition and timing specs
• Failure: erroneous state transition or timing in response to a request
• Crash failure: no state transition/response to all subsequent
requests.
• Omission failure: no state transition/response in response to a
request, distinguishable from crash failure only by timeout.
• Timing failure: either omission or the correct state transition occurs
too early or too late; also called performance failure.
• Arbitrary failure: either timing failure or bad state transition, also
known as Byzantine failure.
Failure Classification
Failure semantics specifies the behavior the system is forced to assume
when some component(s) fail.
Crash
Omission
Timing
Arbitrary
(Byzantine)
Examples of Failures
• Crash failure
– “Clean" operating system shutdown due to power outage
• Omission failure
– Message loss on a bus
– Impossibility to read a bad disk sector
• Timing failure
– early scheduling of an event due to a fast clock
– Late scheduling of an event due to a slow clock
• Arbitrary failure
– A "plus" procedure returns 5 to plus(2,2)
– A search procedure finds a key that was never inserted
– The contents of a message is altered on route
Server Failure Semantics
• When programming recovery actions when a server s fails, it is important
to know what failure behavior s can exhibit.
Example: server s, client r connected by network server n
d - max time to transport message
p - max time needed to receive and process a message
s
r
n
• If n, r can only suffer omission failures, then if no reply arrives at s within
2(d + p) time units, no reply will arrive at s. Hence s can assume "no
answer" to message.
• If n, r can exhibit performance failures, then s must keep local data to
discard replies to "old" messages.
• It is the responsibility of a server implementer to ensure that the specified
failure semantics is properly implemented, e.g., a traffic light control
system needs to ensure a "flashing red or no light" failure semantics.
Failure Semantics Enforcement
• Usually, failure semantics is implemented to the extent of satisfying
the required degree of plausibility.
Examples:
– Use error-detecting code to implement networks with performance
failure semantics.
– Use error-detecting code, circuit switching and real-time executive
implement networks with omission failure semantics.
– Use lock-step duplication and highly reliable comparator to implement
crash failure semantics for CPUs.
• In general, the stronger the desired failure semantics, the more
expensive it is to implement, e.g., it is cheaper to build a memory
without error-detecting code (with arbitrary failure semantics) than
build one with error-detecting code (with omission failure semantics).
Failure Masking:
Hierarchical Masking
•
•
Suppose server u depends on server r. If server u can provide its
service by taking advantage of r's failure semantics, then u is said to
mask r’s failure to its own (u’s) clients.
Typical sequence of events consists of
1. A masking attempt.
2. If impossible to mask, then recover consistent state and propagate
failure upwards.
Examples:
– Retry I/O on same disk assuming omission failure on bus (time
redundancy).
– Retry I/O on backup disk assuming crash failure on disk being
addressed (space redundancy).
Failure Masking:
Group Masking
• Implement service S by a group of redundant, independent servers
so as to ensure continuity of service.
• A group is said to mask the failure of a member m if the group
response stays correct despite m’s failure.
• Group response = f(member responses)
Examples of group responses
– Response of fastest member
– Response of "primary" member
– Majority vote of member responses
• A server group able to mask to its clients any k concurrent member
failures is said to be k-fault-tolerant.
k = 1: single-fault tolerant
k > 1: multiple-fault tolerant
Group Masking Enforcement
• The complexity of group management mechanism is a function of
the failure semantics assumed for group members
Example:
Assume memory components with read omission failure semantics
M
M’
read
write
– Simply "oring" the read outputs is sufficient to mask member failure
– Cheaper and faster group management
– More expensive members (stores with error correcting code)
Group Masking Enforcement
• The complexity of group management mechanism is a function of
the failure semantics assumed for group members
Example:
Assume memory components with arbitrary read failure semantics
read
write

M
M’
M”
– Needs majority voter to mask failure of member failure
– More expensive and slower group management
– Cheaper members (no e.c.c. circuitry)
Key Architectural Issues
•
•
•
•
Strong server failure semantics - expensive
Weak failure semantics - cheap
Redundancy management for strong failure semantics - cheap
Redundancy management for weak failure semantics - expensive
This implies:
Need to balance amount of failure detection, recovery and masking
redundancy mechanisms at various levels of abstraction to obtain best
overall cost/performance/dependability results
Example:
Error-detecting for memory systems at lower levels usually decreases the
overall system cost; error-correcting for communication systems at lower
levels may be an overkill for non-real-time applications (known as end-toend arguments in networking).
Hardware Architectural Issues
•
•
Replaceable hardware unit:
A set of hardware servers packaged together so that the set is a physical
unit of failure, replacement and growth. May be field replaceable (by field
engineers) or customer replaceable.
Goals:
– Allow for physical removal/insertion without disruption to higher level software
servers (removals and insertions are masked)
– If masking is not possible or too expensive, ensure "nice" failure semantics such
as crash or omission
•
Coarse granularity architecture:
A replaceable unit includes several elementary servers, e.g., CPU,
memory, I/O controller.
• Fine granularity architecture:
Elementary hardware servers are replaceable units.
Question: How are the replaceable units grouped, connected?
Coarse Granularity Example:
Tandem Non-Stop
Coarse Granularity Example:
DEC VAX cluster
Coarse Granularity Example:
IBM Extended Recovery Facility
IBM XRF Architecture:
Fine Granularity Example:
Stratus
Fine granularity Example:
Sequoia
O.S. Hardware Failure Semantics?
•
•
•
•
•
•
What failure semantics is specified for hardware
replaceable units that is usually assumed by
operating system software?
CPU - crash
Bus - omission
Memory- read omission
Disk - read/write omission
I/O controller - crash
Network - omission or performance failure
What failure detection mechanisms are
used to implement the specified hardware
replaceable unit's failure semantics?
Examples:
– Processors with crash failure semantics implemented
by duplication and comparison in Stratus, Sequoia,
Tandem CLX
– Crash failure semantics approximated by using errordetecting codes in IBM 370, Tandem TXP, VLX.
At what level of abstraction are hardware replaceable unit's failure
masked?
• Masking at hardware level (e.g., Stratus)
– Redundancy at the hardware level.
– Duplexing CPU-servers with crash failure semantics provides singlefault tolerance.
– Increases mean time between failure for CPU service.
• Masking at operating system level (e.g., Tandem process groups)
– Redundancy at the O.S. level.
– Hierarchical masking hides single CPU failure from higher level software
servers by restarting a process that ran on a failed CPU in a manner
transparent to the server.
• Masking at application server level (e.g., IBM XRF, AAS)
– Redundancy at the application level
– Group masking hides CPU failure from users by using a group of
redundant software servers running on distinct hardware hosts and
maintaining global service state.
Software Architecture Issues
Software servers:
Analogous to hardware replaceable units (units of
failure, replacement, growth)
Goals:
• Allow for removal/insertion without disrupting higher level
users.
• If masking impossible or not economical, ensure "nice"
failure semantics (which will allow higher level users,
possibly human to use simple masking techniques, such
as "login and try again")
What failure semantics is specified for software servers?
• If service state is persistent (e.g. ATM), servers are
typically required to implement omission (atomic
transaction, at-most-once) failure semantics.
• If service state is not persistent (e.g., network topology
management, virtual circuit management, low level I/O
controller),then crash failure semantics is sufficient.
• To implement atomic transaction or crash failure
semantics, the operations implemented by servers are
assumed to be functionally correct, a deposit of $100
must not credit customer's account by $1000.
How are software server failures masked?
• Functional redundancy (e.g., N-version
programming, recovery blocks)
• Use of software server groups (e.g., IBM XRF,
Tandem)
The use of software server groups raises a
number of issues are not well understood.
– How do clients address service requests to server
groups?
– What group-to-group communication protocols are
needed?
The Tandem Process-Pair
Communication Protocol
Goal:
To achieve at-most-once
semantics in the presence of
performance process and
communication failures.
1. C sends request to S; S
assigns unique serial number.
2. Client-server session number
0 replicated in (C, C'), (S, S').
3. Current serial number SN1
replicated in (S, S').
Current
message
counter
Current message counter records
the Id of the transaction request.
Since many requests from different
clients may be processed
simultaneously, each request has a
unique Id
Normal Protocol Execution (1)
S: if SN(message) = my session message
counter
then do
1. Increment current serial number to
SN2
2. Log locally the fact that SN1 was
returned to request 0
3. Increment session message counter
4. Checkpoint (session message
counter, log, new state) to S'
else
Return result saved for request 0
Current
message
counter
Normal Protocol Execution (2)
S‘ updates session message
count, records that SN1
was returned for request 0,
adopts new current serial
number and sends ack.
S after receipt of ack that S' is
now in synch, sends
response for request 0 to C
Normal Protocol Execution (3)
C updates its state with SN1,
then checkpoints (session
counter, new state) to C'
C‘ updates session counter
and state, then sends ack
to C.
Example 1:
S crashes before checkpointing to S'
• C resends requests to (new)
primary S
• S starts new backup S',
initializes its state to (0, SN1)
and interprets requests as
before
Example 2:
S' crashes before sending ack to S
S creates new backup S',
initializes it and resends
check-point message to S‘
Example 3:
C crashes after sending request to S
C' becomes primary, starts
new back-up and resends
request to S
S performs the check on
session number:
If sn(message) = my session
message counter
then ...
else return result saved for
request 0
Issues raised by the use of software server groups:
• State synchronization:
How should group members maintain consistency of
their local states (including time) despite member failures
and joins and communication failures?
• Group membership:
How should group members achieve agreement on
which members are correctly working?
• Service availability:
How is it automatically ensured that the required
number of members is maintained for each server group
despite operating system, server and communication
failures?
Example of
Software Group Masking
If all working processors agree on group state,
then the service S can be made available in spite
of two concurrent processor failures
• a fails: b and c agree that
a failed; b becomes
primary, c back-up.
• b fails: c agrees (trivially)
that b failed, c becomes
primary.
• a and b fail: c agrees that
a and b failed, c becomes
primary.
Problem of Disagreement
on Service State
Disagreement on state can cause unavailability when failure occurs:
If a fails then S becomes
unavailable despite the fact
that enough hardware for
running it exists.
Problem of Disagreement on Time
Disagreement on time will not detect performance failure
If clocks synchronized within 10
milliseconds are used to detect
excessive message delay, then
clocks out of synch (by 10
minutes) can lead to a late
system response
Message m arriving 10 minutes
later will still cause B to think
that m took only 30 milliseconds
for the trip which could be within
network performance bounds.
Problem of Disagreement on
Group Membership
Disagreement on membership
can create confusion:
Out-of-date membership
information causes unavailability
Server Group Synchronization Strategy
• Close state synchronization
– Each server interprets each request
– Sends result by voting if members have arbitrary failure semantics,
else result can be sent by all or by highest ranking member
• Loose state synchronization
–
–
–
–
Members are ranked, primary, first back-up, second back-up ...
Primary maintains current service state and sends results
Back-ups log requests (maybe results also). Periodically purges log.
Applicable only when members have performance failure semantics
One solution is to use atomic broadcast with tight clock
synchronization.
Requirements for Atomic Broadcast
and Membership
Safety properties, e.g.,
• All group members agree on the group membership
• All group members agree on what messages were
broadcast and the order in which they were sent
Timeliness properties, e.g.,
• There is a bound on the time required to complete an
atomic broadcast
• There are bounds on the time needed to detect server
failures and server reactivation
Service Availability Issues
What server group availability strategy would
ensure required availability objective?
– Prescribe how many members should a server group
have.
What mechanism should be used to
automatically enforce a given availability
strategy?
1. Direct mutual surveillance among group members
2. General srevice availability manager
Q&A