presentation5 - SFU computing science

Replicated Distributed Systems
By Eric C. Cooper
Overview








Introduction and Background (Queenie)
A Model of Replicated Distributed Programs
Implementing Distributed Modules and
Threads
Implementing Replicated Procedure Calls
(Alix)
Performance Analysis
Concurrency Control (Joanne)
Binding Agents
Troupe Reconfiguration
Background



Present a new software architecture for
fault-tolerant distributed programs
Designed by Eric C. Cooper
A co-founder of FORE systems – a leader
supplier of networks for enterprise and
service providers
Introduction




Goal: address the problem of constructing
highly available distributed programs
Tolerate crashes of the underlying
hardware automatically
Continues to operate despite failure of its
components
First approach: replicate each components
of the system


By von Neumann (1955)
Drawback: costly - use reliable hardware
everywhere
Introduction (contd)
Eric C. Cooper’s new approach:

Replication on per-module basis
 Flexible
& not burdening the programmer
 Provide location and replication transparency
to programmer

Fundamental mechanism
 Troupes
– a replicated module
 Troupe members - replicas
 Replicated procedure call (many-to-many
communication between troupes)
Introduction (contd)

Important Properties give this
mechanism flexibility and power:
1.
2.
3.
individual members of a troupe do not
communicate among themselves
unaware of one another’s existence
each troupe member behaves as no
replicas
A Model of Replicated
Distributed Programs (contd)
A model of replicated distributed program:
Replicated Distributed Program
State information
Troupe
Procedure
module
A Model of Replicated
Distributed Programs (contd)

Module
Package the procedure and state
information which is needed to
implement a particular abstraction
 Separate the interface to that
abstraction from its implementation
 Express the static structure of a program
when it is written

A Model of Replicated
Distributed Programs (contd)

Threads
 A thread ID – unique identifier
 Particular thread runs in exactly one
module at a given time
 Multiple threads may be running in the
same module concurrently
Implementing Distributed
Modules and Threads
No machine boundaries
 Provide location transparency – the
programmer don’t need to know the
eventual configuration of a program
 Module

 implemented
by a server whose address
space contains the module’s procedure and
data

Thread
 implemented
by using remote procedure
calls to transfer control from server to server
Adding Replication



Processor and network failure of the
distributed program
 Partial failures
Solution: replication
Introduce replication transparency at
the module level
Adding Replication (contd)
Assumption: troupe members execute
on fail-stop processors
 If not => complex agreement
 Replication transparency in troupe
model is guaranteed by:


All troupes are deterministic
 (same
input → same output)
Troupe Consistency
When all its members are in the same
state
 => A troupe is consistent
=> Its clients don’t need to know
that is replicated
  Replication transparency

Troupe Consistency (contd)
Execution of a remote procedure call (I)
Call P
P: proc
Call P
P: proc
Call P
P: proc
Client
Server
Troupe Consistency (contd)
Execution of a remote procedure call (II)
Call P
P: proc
Call P
P: proc
Call P
P: proc
Client
Server
Execution of Procedure call
As a tree of procedure invocations
 The invocation trees rooted at each
troupe member are identical
 The server troupe make the same
procedure calls and returns with the
same arguments and results
 All troupes are initially consistent
 All troupes remain consistent

Replicated Procedure Calls
Goal: allow distributed programs to
be written in the same as
conventional programs for centralized
computers
 Replicated procedure call is Remote
procedure call
 Exactly-once execution at all troupe
members

Circus Paired Message Protocol
Characteristics:
 Paired messages (e.g. call and return)
 Reliably delivered
 Variable length
 Call sequence numbers
 Based on the RPC
 Use UDP, the DARPA User Datagram
Protocol


Connectionless but retransmission
Implementing Replicated
Procedure Calls
Implemented on top of the paired
message layer
 Two subalgorithms in the many-tomany call

One-to-many
 Many-to-one


Implemented as part of the run-time
system that is linked with each user’s
program
One-to-many calls
Client half of RPC performs a one-tomany call
 Purpose is to guarantee that the
procedure is executed at each server
troupe member
 Same call message with the same
call number
 Client waits for return messages
 Waits for all the return messages
before proceeding in Circus

Synchronization Point
After all the server troupe members
have returned
 Each client troupe member knows
that all server troupe members have
performed the procedure
 Each server troupe member knows
that all client troupe members have
received the result

Many-to-one calls




Server will receive call messages from
each client troupe member
Server executes the procedure only once
Returns the results to all the client troupe
members
Two problems



Distinguishing between unrelated call
messages
How many other call messages are expected?
Circus waits for all clients to send a call
message before proceeding
Many-to-many calls

A replicated procedure call is called a
many-to-many call from a client
troupe to a server troupe
Many-to-many steps
1.
2.
3.
4.
5.
A call message is sent from each client
troupe member to each server troupe
member.
A call message is received by each
server troupe member from each client
server troupe member.
The requested procedure is run on each
server troupe member.
A return message is sent from each
server troupe member to each client
troupe member.
A return message is received by each
client troupe member from each server
troupe member.
Multicast Implementation
Dramatic difference in efficiency
 Suppose m client troupe members
and n server troupe members


Point-to-point
 mn

messages sent
Multicast
 m+ n
messages sent
Waiting for messages to arrive
Troupes are assumed to be
deterministic, therefore all messages
are assumed to be identical
 When should computation proceed?


As soon as the first messages arrives or
only after the entire set arrives?
Waiting for all messages
Able provide error detection and error
correction
 Inconsistencies are detected
 Execution time determined by the
slowest member of each troupe
 Default in Circus system

First-come approach
Forfeit error detection
 Computation proceeds as soon as the
first message in each set arrives
 Execution time is determined by the
fastest member of each troupe
 Requires a simple change to the oneto-many call protocol


Client can use call sequence number to
discard return messages from slow
server troupe members
First-come approach

More complicated changes required in the
many-to-one call protocol

When a call message from another member
arrives, the server cannot execute the
procedure again




Would violate exactly-once execution
Server must retain the return messages until
all other call messages have been received
from the client troupe members
Return messages is sent when the call is
received
Execution seems instantaneous to the client
A better first come approach
Buffer messages at the client rather
than at the server
 Server broadcasts return messages
to the entire client troupe after the
first call message
 A client troupe member may receive a
return message before sending the
call message
 Return message is retained until the
client troupe member is ready to send
the call message

Advantages of buffering at
client
Work of buffering return messages
and pairing them with call
messages is placed on the client
rather than a shared server
 The server can broadcast rather than
point-to-point communication
 No communication is required by a
slow client

What about error detection?
To provide error detection and still
allow computation to proceed, a
watchdog scheme can be used
 Create another thread of control after
the first message is received
 This thread will watch for remaining
messages and compare
 If there is an inconsistency, the main
computation is aborted

Crashes and Partitions
Underlying message protocol uses
probing and timeouts to detect
crashes
 Relies on network connectivity and
therefore cannot distinguish between
crashes and network partitions
 To prevent troupe members from
diverging


Require that each troupe member
receives majority of expected set of
messages
Collators




Can relax the determinism requirement by
allowing programmers to reduce a set of
messages into a single message
A collator maps a set of messages into a
single result
Collator needs enough messages to make a
decision
Three kinds



Unanimous
Majority
First come
Performance Analysis
Experiments conducted at Berkeley
during an inter-semester break
 Measured the cost of replicated
procedure calls as a function of the
degree of replication
 UDP and TCP echo tests used as a
comparison

Performance Analysis

Performance of UDP, TCP and Circus

TCP echo test faster than UDP echo test
 Cost
of TCP connection establishment
ignored
 UDP test makes two alarm calls and
therefore two settimer calls
 Read
and Write interface to TCP more
streamlined
Performance Analysis

Unreplicated Circus remote procedure
call requires almost twice the amount
of time as a simple UDP exchange
Due to extra system calls require to
handle Circus
 Elaborate code to handle multi-homed
machines

 Some
Berkeley machines had as many as 4
network addresses
 Design oversight by Berkeley, not a
fundamental problem
Performance Analysis
Expense of a replicated procedure call
increments linearly as the degree of
replication increases
 Each additional troupe member adds
between 10-20 milliseconds


Smaller than the time for a UDP
datagram exchange
Performance Analysis

Execution profiling tool used to
analyze Circus implementation in finer
detail
6 Berkeley 4.2BSD system calls account
for more than ½ the total CPU time to
perform a replicated call
 Most of the time required for a Circus
replicated procedure call is spent in the
simulation of multicasting

Concurrency Control
Server troupe controls calls from
different clients using multiple
threads
 Conflicts arise when concurrent calls
need to access the same resource

Concurrency Control

Serialization at each troupe member


Local concurrency control algorithms
Serialization in the same order among
members
Preserve troupe consistency
 Need coordination between replicated
procedure calls mechanism and
synchronization mechanism
 => Replicated Transactions

Replicated Transactions

Requirements
Serializability
 Atomicity

 Ensure
that aborting a transaction does not
affect other concurrently executed
transactions

Two-phase locking with unanimous
update


Drawback: too strict
Troupe Commit Protocol
Troupe Commit Protocol



Before a server troupe member commits
(or aborts) a transaction, it invokes the
ready_to_commit remote procedure call to
the client troupe – call-back
Client troupe returns whether it agrees to
commit (or abort) the transaction
If server troupe members serialize
transactions in different order, a deadlock
will occur

Detecting conflicting transactions is converted
to deadlock detection
An example of Troupe Commit
Protocol
Two server troupe members SM1 and
SM2
 Two client troupes C1 and C2
 C1 performs transaction T1 and C2
performs transaction T2

An example of Troupe Commit
Protocol

C1
Scenario 1: T1 and T2 are serialized
in the same order, say T1 first and
T2 second, on SM1 and SM2
1.ready_to
_commit
(true)
3.commit 4.ready_to
_commit
T1
(true)
SM1
SM2
2.true
6.commit 5.true
T2
C2
An example of Troupe Commit
Protocol

Scenario 1 (contd)
1.
2.
3.
4.
SM1 and SM2 call ready_to_commit
first at C1 passing true as argument
C1 returns true to both SM1 and SM2
SM1 and SM2 commit T1
SM1 and SM2 commit T2 by repeating
steps (1) – (3)
An example of Troupe Commit
Protocol

Scenario 2: T1 and T2 are serialized
in the different order, say SM1
wants to commit T1 and SM2 wants
to commit T2. If transactions are
committed, SM1 and SM2 will be
inconsistent
An example of Troupe Commit
Protocol

C1
Scenario 2 (contd)
1.ready_to
_commit
(true)
1.ready_to
_commit
(true)
SM1
SM2
C2
An example of Troupe Commit
Protocol

Scenario 2 (contd)




SM1 calls ready_to_commit at C1 and
SM2 calls ready_to_commit at C2
C1 will not return any value because it
is waiting for the call-back from SM2.
The same thing happens to C2.
Without returning values from C2, SM2
cannot commit T2 or proceed T1.
Neither can SM1.
DEADLOCK! => Different serialization
orders are detected
An example of Troupe Commit
Protocol

Scenario 3: T1 and T2 are serialized
in the different order. However,
committing them will NOT leave
SM1 and SM2 at inconsistent states


SM1 and SM2 calls two
ready_to_commit at C1 and C2 in
parallel.
Both server troupe members will
commit T1 and T2 in parallel after C1
and C2 return true.
An example of Troupe Commit
Protocol

C1
Scenario 3 (contd)
1.ready_to
_commit
(true)
3.commit 1.ready_to
T1 and T2 _commit
(true)
SM1
SM2
2.true
2.true
C2
Binding
Binding Agents for
Distributed
Programs
Modules and
instances
Binding Agents for
Replicated
Programs
Troupes
Bind addresses with a Identify a troupe with
server
an ID
Bind a set of
addresses with a set
of server troupe
members
Binding for Replicated Program

Cache Invalidation Problem


A client’s binding information becomes
stale
Causes:
1.
2.
3.
A server troupe member or an entire
troupe is no longer available
The specified interface is no longer
exported
A new troupe member is added
Binding for Replicated Program

Cache Invalidation Detection
1.
2.
3.
Paired message protocol can detect
missing troupe members
Remote procedure call level can detect
non-exported interface
Added troupe members CANNOT be
detected by clients alone => Need
help from binding agents
Binding for Replicated Program

How is a new troupe member added?
Assume the new member is already in
the same state as other members
 The new member calls the
add_troupe_member at the binding
agent
 The binding agent invokes the
set_troupe_id procedure at each troupe
member

Binding for Replicated Program

Result of adding a new troupe
member
The updated troupe contain the new
member
 The troupe ID is changed


Client will detect this update by
finding the original server troupe ID is
no longer valid
Binding for Replicated Program
Cache Invalidation Recovery
 Clients call rebind at the binding
agent

Clients update binding information
 Binding agent may garbage collect
unavailable server hinted by this call

Troupe Reconfiguration
Recovery from partial failure
 Replace broken troupe member with
a new one
 Similar to the problem of adding a
new troupe member

Troupe Reconfiguration Steps

1.
Atomic transaction:
Bring the new member into the
state consistent with other members

2.
get_state procedure call
Add the new member to binding
agent


add_troupe_member
set_troupe_id