Replicated Distributed Systems By Eric C. Cooper Overview Introduction and Background (Queenie) A Model of Replicated Distributed Programs Implementing Distributed Modules and Threads Implementing Replicated Procedure Calls (Alix) Performance Analysis Concurrency Control (Joanne) Binding Agents Troupe Reconfiguration Background Present a new software architecture for fault-tolerant distributed programs Designed by Eric C. Cooper A co-founder of FORE systems – a leader supplier of networks for enterprise and service providers Introduction Goal: address the problem of constructing highly available distributed programs Tolerate crashes of the underlying hardware automatically Continues to operate despite failure of its components First approach: replicate each components of the system By von Neumann (1955) Drawback: costly - use reliable hardware everywhere Introduction (contd) Eric C. Cooper’s new approach: Replication on per-module basis Flexible & not burdening the programmer Provide location and replication transparency to programmer Fundamental mechanism Troupes – a replicated module Troupe members - replicas Replicated procedure call (many-to-many communication between troupes) Introduction (contd) Important Properties give this mechanism flexibility and power: 1. 2. 3. individual members of a troupe do not communicate among themselves unaware of one another’s existence each troupe member behaves as no replicas A Model of Replicated Distributed Programs (contd) A model of replicated distributed program: Replicated Distributed Program State information Troupe Procedure module A Model of Replicated Distributed Programs (contd) Module Package the procedure and state information which is needed to implement a particular abstraction Separate the interface to that abstraction from its implementation Express the static structure of a program when it is written A Model of Replicated Distributed Programs (contd) Threads A thread ID – unique identifier Particular thread runs in exactly one module at a given time Multiple threads may be running in the same module concurrently Implementing Distributed Modules and Threads No machine boundaries Provide location transparency – the programmer don’t need to know the eventual configuration of a program Module implemented by a server whose address space contains the module’s procedure and data Thread implemented by using remote procedure calls to transfer control from server to server Adding Replication Processor and network failure of the distributed program Partial failures Solution: replication Introduce replication transparency at the module level Adding Replication (contd) Assumption: troupe members execute on fail-stop processors If not => complex agreement Replication transparency in troupe model is guaranteed by: All troupes are deterministic (same input → same output) Troupe Consistency When all its members are in the same state => A troupe is consistent => Its clients don’t need to know that is replicated Replication transparency Troupe Consistency (contd) Execution of a remote procedure call (I) Call P P: proc Call P P: proc Call P P: proc Client Server Troupe Consistency (contd) Execution of a remote procedure call (II) Call P P: proc Call P P: proc Call P P: proc Client Server Execution of Procedure call As a tree of procedure invocations The invocation trees rooted at each troupe member are identical The server troupe make the same procedure calls and returns with the same arguments and results All troupes are initially consistent All troupes remain consistent Replicated Procedure Calls Goal: allow distributed programs to be written in the same as conventional programs for centralized computers Replicated procedure call is Remote procedure call Exactly-once execution at all troupe members Circus Paired Message Protocol Characteristics: Paired messages (e.g. call and return) Reliably delivered Variable length Call sequence numbers Based on the RPC Use UDP, the DARPA User Datagram Protocol Connectionless but retransmission Implementing Replicated Procedure Calls Implemented on top of the paired message layer Two subalgorithms in the many-tomany call One-to-many Many-to-one Implemented as part of the run-time system that is linked with each user’s program One-to-many calls Client half of RPC performs a one-tomany call Purpose is to guarantee that the procedure is executed at each server troupe member Same call message with the same call number Client waits for return messages Waits for all the return messages before proceeding in Circus Synchronization Point After all the server troupe members have returned Each client troupe member knows that all server troupe members have performed the procedure Each server troupe member knows that all client troupe members have received the result Many-to-one calls Server will receive call messages from each client troupe member Server executes the procedure only once Returns the results to all the client troupe members Two problems Distinguishing between unrelated call messages How many other call messages are expected? Circus waits for all clients to send a call message before proceeding Many-to-many calls A replicated procedure call is called a many-to-many call from a client troupe to a server troupe Many-to-many steps 1. 2. 3. 4. 5. A call message is sent from each client troupe member to each server troupe member. A call message is received by each server troupe member from each client server troupe member. The requested procedure is run on each server troupe member. A return message is sent from each server troupe member to each client troupe member. A return message is received by each client troupe member from each server troupe member. Multicast Implementation Dramatic difference in efficiency Suppose m client troupe members and n server troupe members Point-to-point mn messages sent Multicast m+ n messages sent Waiting for messages to arrive Troupes are assumed to be deterministic, therefore all messages are assumed to be identical When should computation proceed? As soon as the first messages arrives or only after the entire set arrives? Waiting for all messages Able provide error detection and error correction Inconsistencies are detected Execution time determined by the slowest member of each troupe Default in Circus system First-come approach Forfeit error detection Computation proceeds as soon as the first message in each set arrives Execution time is determined by the fastest member of each troupe Requires a simple change to the oneto-many call protocol Client can use call sequence number to discard return messages from slow server troupe members First-come approach More complicated changes required in the many-to-one call protocol When a call message from another member arrives, the server cannot execute the procedure again Would violate exactly-once execution Server must retain the return messages until all other call messages have been received from the client troupe members Return messages is sent when the call is received Execution seems instantaneous to the client A better first come approach Buffer messages at the client rather than at the server Server broadcasts return messages to the entire client troupe after the first call message A client troupe member may receive a return message before sending the call message Return message is retained until the client troupe member is ready to send the call message Advantages of buffering at client Work of buffering return messages and pairing them with call messages is placed on the client rather than a shared server The server can broadcast rather than point-to-point communication No communication is required by a slow client What about error detection? To provide error detection and still allow computation to proceed, a watchdog scheme can be used Create another thread of control after the first message is received This thread will watch for remaining messages and compare If there is an inconsistency, the main computation is aborted Crashes and Partitions Underlying message protocol uses probing and timeouts to detect crashes Relies on network connectivity and therefore cannot distinguish between crashes and network partitions To prevent troupe members from diverging Require that each troupe member receives majority of expected set of messages Collators Can relax the determinism requirement by allowing programmers to reduce a set of messages into a single message A collator maps a set of messages into a single result Collator needs enough messages to make a decision Three kinds Unanimous Majority First come Performance Analysis Experiments conducted at Berkeley during an inter-semester break Measured the cost of replicated procedure calls as a function of the degree of replication UDP and TCP echo tests used as a comparison Performance Analysis Performance of UDP, TCP and Circus TCP echo test faster than UDP echo test Cost of TCP connection establishment ignored UDP test makes two alarm calls and therefore two settimer calls Read and Write interface to TCP more streamlined Performance Analysis Unreplicated Circus remote procedure call requires almost twice the amount of time as a simple UDP exchange Due to extra system calls require to handle Circus Elaborate code to handle multi-homed machines Some Berkeley machines had as many as 4 network addresses Design oversight by Berkeley, not a fundamental problem Performance Analysis Expense of a replicated procedure call increments linearly as the degree of replication increases Each additional troupe member adds between 10-20 milliseconds Smaller than the time for a UDP datagram exchange Performance Analysis Execution profiling tool used to analyze Circus implementation in finer detail 6 Berkeley 4.2BSD system calls account for more than ½ the total CPU time to perform a replicated call Most of the time required for a Circus replicated procedure call is spent in the simulation of multicasting Concurrency Control Server troupe controls calls from different clients using multiple threads Conflicts arise when concurrent calls need to access the same resource Concurrency Control Serialization at each troupe member Local concurrency control algorithms Serialization in the same order among members Preserve troupe consistency Need coordination between replicated procedure calls mechanism and synchronization mechanism => Replicated Transactions Replicated Transactions Requirements Serializability Atomicity Ensure that aborting a transaction does not affect other concurrently executed transactions Two-phase locking with unanimous update Drawback: too strict Troupe Commit Protocol Troupe Commit Protocol Before a server troupe member commits (or aborts) a transaction, it invokes the ready_to_commit remote procedure call to the client troupe – call-back Client troupe returns whether it agrees to commit (or abort) the transaction If server troupe members serialize transactions in different order, a deadlock will occur Detecting conflicting transactions is converted to deadlock detection An example of Troupe Commit Protocol Two server troupe members SM1 and SM2 Two client troupes C1 and C2 C1 performs transaction T1 and C2 performs transaction T2 An example of Troupe Commit Protocol C1 Scenario 1: T1 and T2 are serialized in the same order, say T1 first and T2 second, on SM1 and SM2 1.ready_to _commit (true) 3.commit 4.ready_to _commit T1 (true) SM1 SM2 2.true 6.commit 5.true T2 C2 An example of Troupe Commit Protocol Scenario 1 (contd) 1. 2. 3. 4. SM1 and SM2 call ready_to_commit first at C1 passing true as argument C1 returns true to both SM1 and SM2 SM1 and SM2 commit T1 SM1 and SM2 commit T2 by repeating steps (1) – (3) An example of Troupe Commit Protocol Scenario 2: T1 and T2 are serialized in the different order, say SM1 wants to commit T1 and SM2 wants to commit T2. If transactions are committed, SM1 and SM2 will be inconsistent An example of Troupe Commit Protocol C1 Scenario 2 (contd) 1.ready_to _commit (true) 1.ready_to _commit (true) SM1 SM2 C2 An example of Troupe Commit Protocol Scenario 2 (contd) SM1 calls ready_to_commit at C1 and SM2 calls ready_to_commit at C2 C1 will not return any value because it is waiting for the call-back from SM2. The same thing happens to C2. Without returning values from C2, SM2 cannot commit T2 or proceed T1. Neither can SM1. DEADLOCK! => Different serialization orders are detected An example of Troupe Commit Protocol Scenario 3: T1 and T2 are serialized in the different order. However, committing them will NOT leave SM1 and SM2 at inconsistent states SM1 and SM2 calls two ready_to_commit at C1 and C2 in parallel. Both server troupe members will commit T1 and T2 in parallel after C1 and C2 return true. An example of Troupe Commit Protocol C1 Scenario 3 (contd) 1.ready_to _commit (true) 3.commit 1.ready_to T1 and T2 _commit (true) SM1 SM2 2.true 2.true C2 Binding Binding Agents for Distributed Programs Modules and instances Binding Agents for Replicated Programs Troupes Bind addresses with a Identify a troupe with server an ID Bind a set of addresses with a set of server troupe members Binding for Replicated Program Cache Invalidation Problem A client’s binding information becomes stale Causes: 1. 2. 3. A server troupe member or an entire troupe is no longer available The specified interface is no longer exported A new troupe member is added Binding for Replicated Program Cache Invalidation Detection 1. 2. 3. Paired message protocol can detect missing troupe members Remote procedure call level can detect non-exported interface Added troupe members CANNOT be detected by clients alone => Need help from binding agents Binding for Replicated Program How is a new troupe member added? Assume the new member is already in the same state as other members The new member calls the add_troupe_member at the binding agent The binding agent invokes the set_troupe_id procedure at each troupe member Binding for Replicated Program Result of adding a new troupe member The updated troupe contain the new member The troupe ID is changed Client will detect this update by finding the original server troupe ID is no longer valid Binding for Replicated Program Cache Invalidation Recovery Clients call rebind at the binding agent Clients update binding information Binding agent may garbage collect unavailable server hinted by this call Troupe Reconfiguration Recovery from partial failure Replace broken troupe member with a new one Similar to the problem of adding a new troupe member Troupe Reconfiguration Steps 1. Atomic transaction: Bring the new member into the state consistent with other members 2. get_state procedure call Add the new member to binding agent add_troupe_member set_troupe_id
© Copyright 2026 Paperzz