Fully Distributed Approach

Chapter 18: Distributed Coordination
XE33OSA
Chapter 18 Distributed Coordination
 Event Ordering
 Mutual Exclusion
 Atomicity
 Concurrency Control
 Deadlock Handling
 Election Algorithms
 Reaching Agreement
XE33OSA
18.2
Silberschatz, Galvin and Gagne ©2005
Chapter Objectives
 To describe various methods for achieving mutual exclusion in a
distributed system
 To explain how atomic transactions can be implemented in a
distributed system
 To show how some of the concurrency-control schemes discussed
in Chapter 6 can be modified for use in a distributed environment
 To present schemes for handling deadlock prevention, deadlock
avoidance, and deadlock detection in a distributed system
XE33OSA
18.3
Silberschatz, Galvin and Gagne ©2005
Event Ordering
 Happened-before relation (denoted by )
XE33OSA

If A and B are events in the same process, and A was executed
before B, then A  B

If A is the event of sending a message by one process and B is the
event of receiving that message by another process, then A  B

If A  B and B  C then A  C
18.4
Silberschatz, Galvin and Gagne ©2005
Relative Time for Three Concurrent Processes
XE33OSA
18.5
Silberschatz, Galvin and Gagne ©2005
Implementation of 

Associate a timestamp with each system event


Require that for every pair of events A and B, if A  B, then the
timestamp of A is less than the timestamp of B
Within each process Pi a logical clock, LCi is associated

The logical clock can be implemented as a simple counter that is
incremented between any two successive events executed within a
process


A process advances its logical clock when it receives a message whose
timestamp is greater than the current value of its logical clock

If the timestamps of two events A and B are the same, then the events are
concurrent

XE33OSA
Logical clock is monotonically increasing
We may use the process identity numbers to break ties and to create a
total ordering
18.6
Silberschatz, Galvin and Gagne ©2005
Distributed Mutual Exclusion (DME)
 Assumptions

The system consists of n processes; each process Pi resides at a
different processor

Each process has a critical section that requires mutual exclusion
 Requirement

If Pi is executing in its critical section, then no other process Pj is
executing in its critical section
 We present two algorithms to ensure the mutual exclusion execution of
processes in their critical sections
XE33OSA
18.7
Silberschatz, Galvin and Gagne ©2005
DME: Centralized Approach
 One of the processes in the system is chosen to coordinate the entry
to the critical section
 A process that wants to enter its critical section sends a request
message to the coordinator
 The coordinator decides which process can enter the critical section
next, and its sends that process a reply message
 When the process receives a reply message from the coordinator, it
enters its critical section
 After exiting its critical section, the process sends a release message
to the coordinator and proceeds with its execution
 This scheme requires three messages per critical-section entry:
XE33OSA

request

reply

release
18.8
Silberschatz, Galvin and Gagne ©2005
DME: Fully Distributed Approach
 When process Pi wants to enter its critical section, it generates a new
timestamp, TS, and sends the message request (Pi, TS) to all other
processes in the system
 When process Pj receives a request message, it may reply immediately
or it may defer sending a reply back
 When process Pi receives a reply message from all other processes in
the system, it can enter its critical section
 After exiting its critical section, the process sends reply messages to all
its deferred requests
XE33OSA
18.9
Silberschatz, Galvin and Gagne ©2005
DME: Fully Distributed Approach (Cont.)
 The decision whether process Pj replies immediately to a request(Pi, TS)
message or defers its reply is based on three factors:
XE33OSA

If Pj is in its critical section, then it defers its reply to Pi

If Pj does not want to enter its critical section, then it sends a reply
immediately to Pi

If Pj wants to enter its critical section but has not yet entered it, then
it compares its own request timestamp with the timestamp TS

If its own request timestamp is greater than TS, then it sends a
reply immediately to Pi (Pi asked first)

Otherwise, the reply is deferred
18.10
Silberschatz, Galvin and Gagne ©2005
Desirable Behavior of Fully Distributed Approach
 Freedom from Deadlock is ensured
 Freedom from starvation is ensured, since entry to the critical section is
scheduled according to the timestamp ordering

The timestamp ordering ensures that processes are served in a firstcome, first served order
 The number of messages per critical-section entry is
2 x (n – 1)
This is the minimum number of required messages per critical-section
entry when processes act independently and concurrently
XE33OSA
18.11
Silberschatz, Galvin and Gagne ©2005
Three Undesirable Consequences
 The processes need to know the identity of all other processes in the
system, which makes the dynamic addition and removal of processes
more complex
 If one of the processes fails, then the entire scheme collapses

This can be dealt with by continuously monitoring the state of all the
processes in the system
 Processes that have not entered their critical section must pause
frequently to assure other processes that they intend to enter the critical
section

XE33OSA
This protocol is therefore suited for small, stable sets of cooperating
processes
18.12
Silberschatz, Galvin and Gagne ©2005
Token-Passing Approach
 Circulate a token among processes in system

Token is special type of message

Possession of token entitles holder to enter critical section
 Processes logically organized in a ring structure
 Algorithm similar to Chapter 6 algorithm 1 but token substituted for
shared variable
 Unidirectional ring guarantees freedom from starvation
 Two types of failures
XE33OSA

Lost token – election must be called

Failed processes – new logical ring established
18.13
Silberschatz, Galvin and Gagne ©2005
Atomicity
 Either all the operations associated with a program unit are executed to
completion, or none are performed
 Ensuring atomicity in a distributed system requires a transaction
coordinator, which is responsible for the following:
XE33OSA

Starting the execution of the transaction

Breaking the transaction into a number of subtransactions, and
distribution these subtransactions to the appropriate sites for
execution

Coordinating the termination of the transaction, which may result in
the transaction being committed at all sites or aborted at all sites
18.14
Silberschatz, Galvin and Gagne ©2005
Two-Phase Commit Protocol (2PC)
 Assumes fail-stop model
 Execution of the protocol is initiated by the coordinator after the last step
of the transaction has been reached
 When the protocol is initiated, the transaction may still be executing at
some of the local sites
 The protocol involves all the local sites at which the transaction
executed
 Example: Let T be a transaction initiated at site Si and let the
transaction coordinator at Si be Ci
XE33OSA
18.15
Silberschatz, Galvin and Gagne ©2005
Phase 1: Obtaining a Decision
 Ci adds <prepare T> record to the log
 Ci sends <prepare T> message to all sites
 When a site receives a <prepare T> message, the transaction manager
determines if it can commit the transaction
XE33OSA

If no: add <no T> record to the log and respond to Ci with <abort T>

If yes:

add <ready T> record to the log

force all log records for T onto stable storage

transaction manager sends <ready T> message to Ci
18.16
Silberschatz, Galvin and Gagne ©2005
Phase 1 (Cont.)
 Coordinator collects responses
XE33OSA

All respond “ready”,
decision is commit

At least one response is “abort”,
decision is abort

At least one participant fails to respond within time out period,
decision is abort
18.17
Silberschatz, Galvin and Gagne ©2005
Phase 2: Recording Decision in the Database
 Coordinator adds a decision record
<abort T> or <commit T>
to its log and forces record onto stable storage
 Once that record reaches stable storage it is irrevocable (even if failures
occur)
 Coordinator sends a message to each participant informing it of the
decision (commit or abort)
 Participants take appropriate action locally
XE33OSA
18.18
Silberschatz, Galvin and Gagne ©2005
Failure Handling in 2PC – Site Failure
 The log contains a <commit T> record

In this case, the site executes redo(T)
 The log contains an <abort T> record

In this case, the site executes undo(T)
 The contains a <ready T> record; consult Ci

If Ci is down, site sends query-status T message to the other sites
 The log contains no control records concerning T

XE33OSA
In this case, the site executes undo(T)
18.19
Silberschatz, Galvin and Gagne ©2005
Failure Handling in 2PC – Coordinator Ci Failure
 If an active site contains a <commit T> record in its log, the T must be
committed
 If an active site contains an <abort T> record in its log, then T must be
aborted
 If some active site does not contain the record <ready T> in its log then
the failed coordinator Ci cannot have decided to
commit T

Rather than wait for Ci to recover, it is preferable to abort T
 All active sites have a <ready T> record in their logs, but no additional
control records
XE33OSA

In this case we must wait for the coordinator to recover

Blocking problem – T is blocked pending the recovery of site Si
18.20
Silberschatz, Galvin and Gagne ©2005
Concurrency Control
 Modify the centralized concurrency schemes to accommodate the
distribution of transactions
 Transaction manager coordinates execution of transactions (or
subtransactions) that access data at local sites
 Local transaction only executes at that site
 Global transaction executes at several sites
XE33OSA
18.21
Silberschatz, Galvin and Gagne ©2005
Locking Protocols
 Can use the two-phase locking protocol in a distributed environment by
changing how the lock manager is implemented
 Nonreplicated scheme – each site maintains a local lock manager which
administers lock and unlock requests for those data items that are
stored in that site
XE33OSA

Simple implementation involves two message transfers for handling
lock requests, and one message transfer for handling unlock
requests

Deadlock handling is more complex
18.22
Silberschatz, Galvin and Gagne ©2005
Single-Coordinator Approach
 A single lock manager resides in a single chosen site, all lock and
unlock requests are made a that site
 Simple implementation
 Simple deadlock handling
 Possibility of bottleneck
 Vulnerable to loss of concurrency controller if single site fails
 Multiple-coordinator approach distributes lock-manager function over
several sites
XE33OSA
18.23
Silberschatz, Galvin and Gagne ©2005
Majority Protocol
 Avoids drawbacks of central control by dealing with replicated data in a
decentralized manner
 More complicated to implement
 Deadlock-handling algorithms must be modified; possible for deadlock
to occur in locking only one data item
XE33OSA
18.24
Silberschatz, Galvin and Gagne ©2005
Biased Protocol
 Similar to majority protocol, but requests for shared locks prioritized over
requests for exclusive locks
 Less overhead on read operations than in majority protocol; but has
additional overhead on writes
 Like majority protocol, deadlock handling is complex
XE33OSA
18.25
Silberschatz, Galvin and Gagne ©2005
Primary Copy
 One of the sites at which a replica resides is designated as the primary
site

Request to lock a data item is made at the primary site of that data
item
 Concurrency control for replicated data handled in a manner similar to
that of unreplicated data
 Simple implementation, but if primary site fails, the data item is
unavailable, even though other sites may have a replica
XE33OSA
18.26
Silberschatz, Galvin and Gagne ©2005
Timestamping
 Generate unique timestamps in distributed scheme:

Each site generates a unique local timestamp

The global unique timestamp is obtained by concatenation of the
unique local timestamp with the unique site identifier

Use a logical clock defined within each site to ensure the fair
generation of timestamps
 Timestamp-ordering scheme – combine the centralized concurrency
control timestamp scheme with the 2PC protocol to obtain a protocol
that ensures serializability with no cascading rollbacks
XE33OSA
18.27
Silberschatz, Galvin and Gagne ©2005
Generation of Unique Timestamps
XE33OSA
18.28
Silberschatz, Galvin and Gagne ©2005
Deadlock Prevention
 Resource-ordering deadlock-prevention – define a global ordering
among the system resources

Assign a unique number to all system resources

A process may request a resource with unique number i only if it is
not holding a resource with a unique number grater than i

Simple to implement; requires little overhead
 Banker’s algorithm – designate one of the processes in the system as
the process that maintains the information necessary to carry out the
Banker’s algorithm

XE33OSA
Also implemented easily, but may require too much overhead
18.29
Silberschatz, Galvin and Gagne ©2005
Timestamped Deadlock-Prevention Scheme
 Each process Pi is assigned a unique priority number
 Priority numbers are used to decide whether a process Pi should wait for
a process Pj; otherwise Pi is rolled back
 The scheme prevents deadlocks
XE33OSA

For every edge Pi  Pj in the wait-for graph, Pi has a higher priority
than Pj

Thus a cycle cannot exist
18.30
Silberschatz, Galvin and Gagne ©2005
Wait-Die Scheme
 Based on a nonpreemptive technique
 If Pi requests a resource currently held by Pj, Pi is allowed to wait only
if it has a smaller timestamp than does Pj (Pi is older than Pj)

Otherwise, Pi is rolled back (dies)
 Example: Suppose that processes P1, P2, and P3 have timestamps t,
10, and 15 respectively
XE33OSA

if P1 request a resource held by P2, then P1 will wait

If P3 requests a resource held by P2, then P3 will be rolled back
18.31
Silberschatz, Galvin and Gagne ©2005
Would-Wait Scheme
 Based on a preemptive technique; counterpart to the wait-die system
 If Pi requests a resource currently held by Pj, Pi is allowed to wait only if
it has a larger timestamp than does Pj (Pi is younger than Pj). Otherwise
Pj is rolled back (Pj is wounded by Pi)
 Example: Suppose that processes P1, P2, and P3 have timestamps 5,
10, and 15 respectively
XE33OSA

If P1 requests a resource held by P2, then the resource will be
preempted from P2 and P2 will be rolled back

If P3 requests a resource held by P2, then P3 will wait
18.32
Silberschatz, Galvin and Gagne ©2005
Two Local Wait-For Graphs
XE33OSA
18.33
Silberschatz, Galvin and Gagne ©2005
Global Wait-For Graph
XE33OSA
18.34
Silberschatz, Galvin and Gagne ©2005
Deadlock Detection – Centralized Approach

Each site keeps a local wait-for graph

The nodes of the graph correspond to all the processes that are currently
either holding or requesting any of the resources local to that site

A global wait-for graph is maintained in a single coordination process; this
graph is the union of all local wait-for graphs

There are three different options (points in time) when the wait-for graph may
be constructed:
1. Whenever a new edge is inserted or removed in one of the local wait-for graphs
2. Periodically, when a number of changes have occurred in a wait-for graph
3. Whenever the coordinator needs to invoke the cycle-detection algorithm

XE33OSA
Unnecessary rollbacks may occur as a result of false cycles
18.35
Silberschatz, Galvin and Gagne ©2005
Detection Algorithm Based on Option 3
 Append unique identifiers (timestamps) to requests form different sites
 When process Pi, at site A, requests a resource from process Pj, at site
B, a request message with timestamp TS is sent
 The edge Pi  Pj with the label TS is inserted in the local wait-for of A.
The edge is inserted in the local wait-for graph of B only if B has
received the request message and cannot immediately grant the
requested resource
XE33OSA
18.36
Silberschatz, Galvin and Gagne ©2005
The Algorithm
1. The controller sends an initiating message to each site in the system
2. On receiving this message, a site sends its local wait-for graph to the
coordinator
3. When the controller has received a reply from each site, it constructs a
graph as follows:
(a) The constructed graph contains a vertex for every process in the
system
(b) The graph has an edge Pi  Pj if and only if
(1)
(2)
there is an edge Pi  Pj in one of the wait-for graphs, or
an edge Pi  Pj with some label TS appears in more than
one wait-for graph
If the constructed graph contains a cycle  deadlock
XE33OSA
18.37
Silberschatz, Galvin and Gagne ©2005
Local and Global Wait-For Graphs
XE33OSA
18.38
Silberschatz, Galvin and Gagne ©2005
Fully Distributed Approach
 All controllers share equally the responsibility for detecting deadlock
 Every site constructs a wait-for graph that represents a part of the total
graph
 We add one additional node Pex to each local wait-for graph
 If a local wait-for graph contains a cycle that does not involve node Pex,
then the system is in a deadlock state
 A cycle involving Pex implies the possibility of a deadlock

XE33OSA
To ascertain whether a deadlock does exist, a distributed deadlockdetection algorithm must be invoked
18.39
Silberschatz, Galvin and Gagne ©2005
Augmented Local Wait-For Graphs
XE33OSA
18.40
Silberschatz, Galvin and Gagne ©2005
Augmented Local Wait-For Graph in Site S2
XE33OSA
18.41
Silberschatz, Galvin and Gagne ©2005
Election Algorithms
 Determine where a new copy of the coordinator should be restarted
 Assume that a unique priority number is associated with each active
process in the system, and assume that the priority number of process
Pi is i
 Assume a one-to-one correspondence between processes and sites
 The coordinator is always the process with the largest priority number.
When a coordinator fails, the algorithm must elect that active process
with the largest priority number
 Two algorithms, the bully algorithm and a ring algorithm, can be used to
elect a new coordinator in case of failures
XE33OSA
18.42
Silberschatz, Galvin and Gagne ©2005
Bully Algorithm
 Applicable to systems where every process can send a message to
every other process in the system
 If process Pi sends a request that is not answered by the coordinator
within a time interval T, assume that the coordinator has failed; Pi tries
to elect itself as the new coordinator
 Pi sends an election message to every process with a higher priority
number, Pi then waits for any of these processes to answer within T
XE33OSA
18.43
Silberschatz, Galvin and Gagne ©2005
Bully Algorithm (Cont.)
 If no response within T, assume that all processes with numbers greater
than i have failed; Pi elects itself the new coordinator
 If answer is received, Pi begins time interval T´, waiting to receive a
message that a process with a higher priority number has been elected
 If no message is sent within T´, assume the process with a higher
number has failed; Pi should restart the algorithm
XE33OSA
18.44
Silberschatz, Galvin and Gagne ©2005
Bully Algorithm (Cont.)
 If Pi is not the coordinator, then, at any time during execution, Pi may
receive one of the following two messages from process Pj
 Pj is the new coordinator (j > i). Pi, in turn, records this information
 Pj started an election (j > i). Pi, sends a response to Pj and begins
its own election algorithm, provided that Pi has not already initiated
such an election
 After a failed process recovers, it immediately begins execution of the
same algorithm
 If there are no active processes with higher numbers, the recovered
process forces all processes with lower number to let it become the
coordinator process, even if there is a currently active coordinator with a
lower number
XE33OSA
18.45
Silberschatz, Galvin and Gagne ©2005
Ring Algorithm
 Applicable to systems organized as a ring (logically or physically)
 Assumes that the links are unidirectional, and that processes send
their messages to their right neighbors
 Each process maintains an active list, consisting of all the priority
numbers of all active processes in the system when the algorithm
ends
 If process Pi detects a coordinator failure, I creates a new active list
that is initially empty. It then sends a message elect(i) to its right
neighbor, and adds the number i to its active list
XE33OSA
18.46
Silberschatz, Galvin and Gagne ©2005
Ring Algorithm (Cont.)

If Pi receives a message elect(j) from the process on the left, it must respond in
one of three ways:
1.
If this is the first elect message it has seen or sent, Pi creates a new active
list with the numbers i and j

2.
If i  j, then the active list for Pi now contains the numbers of all the active
processes in the system

3.
It then sends the message elect(i), followed by the message elect(j)
Pi can now determine the largest number in the active list to identify the
new coordinator process
If i = j, then Pi receives the message elect(i)

The active list for Pi contains all the active processes in the system

XE33OSA
Pi can now determine the new coordinator process.
18.47
Silberschatz, Galvin and Gagne ©2005
Reaching Agreement
 There are applications where a set of processes wish to agree on a
common “value”
 Such agreement may not take place due to:
XE33OSA

Faulty communication medium

Faulty processes

Processes may send garbled or incorrect messages to other
processes

A subset of the processes may collaborate with each other in an
attempt to defeat the scheme
18.48
Silberschatz, Galvin and Gagne ©2005
Faulty Communications
 Process Pi at site A, has sent a message to process Pj at site B; to
proceed, Pi needs to know if Pj has received the message
 Detect failures using a time-out scheme
 When Pi sends out a message, it also specifies a time interval
during which it is willing to wait for an acknowledgment message
form Pj
 When Pj receives the message, it immediately sends an
acknowledgment to Pi
 If Pi receives the acknowledgment message within the specified time
interval, it concludes that Pj has received its message
 If a time-out occurs, Pj needs to retransmit its message and wait
for an acknowledgment

XE33OSA
Continue until Pi either receives an acknowledgment, or is notified
by the system that B is down
18.49
Silberschatz, Galvin and Gagne ©2005
Faulty Communications (Cont.)
 Suppose that Pj also needs to know that Pi has received its
acknowledgment message, in order to decide on how to proceed
XE33OSA

In the presence of failure, it is not possible to accomplish this task

It is not possible in a distributed environment for processes Pi and Pj
to agree completely on their respective states
18.50
Silberschatz, Galvin and Gagne ©2005
Faulty Processes (Byzantine Generals Problem)
 Communication medium is reliable, but processes can fail in
unpredictable ways
 Consider a system of n processes, of which no more than m are
faulty

Suppose that each process Pi has some private value of Vi
 Devise an algorithm that allows each nonfaulty Pi to construct a
vector Xi = (Ai,1, Ai,2, …, Ai,n) such that::

If Pj is a nonfaulty process, then Aij = Vj.

If Pi and Pj are both nonfaulty processes, then Xi = Xj.
 Solutions share the following properties
XE33OSA

A correct algorithm can be devised only if n  3 x m + 1

The worst-case delay for reaching agreement is proportionate to
m + 1 message-passing delays
18.51
Silberschatz, Galvin and Gagne ©2005
Faulty Processes (Cont.)
 An algorithm for the case where m = 1 and n = 4 requires two rounds
of information exchange:

Each process sends its private value to the other 3 processes

Each process sends the information it has obtained in the first
round to all other processes
 If a faulty process refuses to send messages, a nonfaulty process can
choose an arbitrary value and pretend that that value was sent by that
process
 After the two rounds are completed, a nonfaulty process Pi can
construct its vector Xi = (Ai,1, Ai,2, Ai,3, Ai,4) as follows:

Ai,j = Vi

For j  i, if at least two of the three values reported for process Pj
agree, then the majority value is used to set the value of Aij

XE33OSA
Otherwise, a default value (nil) is used
18.52
Silberschatz, Galvin and Gagne ©2005
End of Chapter 18
XE33OSA