Synchronization

Distributed Synchronization
Clock Synchronization
When each machine has its own clock, an event that
occurred after another event may nevertheless be
assigned an earlier time.
Physical Clocks
Clock Synchronization
Maximum resolution desired for global time keeping

determines the maximum difference
which can be
tolerated between “synchronized” clocks
The time keeping of a clock,
its tick rate should
C
1 
1  
satisfy:
t
  2 t
The worst possible divergence δ between
 two clocks is
t 
thus:
2
Physical Clocks
Clock Synchronization
Christian’s Algorithm
– Periodically poll the machine with access to the reference time source
– Estimate round-trip delay with a time stamp
– Estimate interrupt processing time
• figure 3-6, page 129 Tanenbaum
– Take a series of measurements to estimate the time it takes for a
timestamp to make it from the reference machine to the synchronization
target
– This allows the synchronization to converge within δ with a certain
degree of confidence
Probabilistic algorithm and guarantee
Physical Clocks
Clock Synchronization
Wide availability of hardware and software to keep
clocks synchronized within a few milliseconds
across the Internet is a recent development
– Network Time Protocol (NTP) discussed in papers by David Mill(s)
– GPS receiver in the local network synchronizes other machines
– What if all have GPS receivers
Increasing deployment of distributed system
algorithms depending on synchronized clocks
Supply and demand constantly in flux
Physical Clocks (1)
Computation of the mean solar day.
Physical Clocks (2)
TAI seconds are of constant length, unlike solar
seconds. Leap seconds are introduced when
necessary to keep in phase with the sun.
Clock Synchronization Algorithms
The relation between clock time and UTC when clocks tick at different rates.
Cristian's Algorithm
Getting the current time from a time server.
The Berkeley Algorithm
a)
b)
c)
The time daemon asks all the other machines for their clock values
The machines answer
The time daemon tells everyone how to adjust their clock
Lamport Timestamps
a)
b)
Three processes, each with its own clock. The clocks run at
different rates.
Lamport's algorithm corrects the clocks.
Example: Totally-Ordered Multicasting
Updating a replicated database and leaving it in an inconsistent state.
Global State (1)
a) A consistent cut
b) An inconsistent cut
Global State (2)
a)
Organization of a process and channels for a distributed
snapshot
Global State (3)
b)
c)
d)
Process Q receives a marker for the first time and records its local
state
Q records all incoming message
Q receives a marker for its incoming channel and finishes recording
the state of the incoming channel
The Bully Algorithm (1)
The bully election algorithm
•
Process 4 holds an election
•
Process 5 and 6 respond, telling 4 to stop
•
Now 5 and 6 each hold an election
Mutual Exclusion
Distributed components still need to coordinate their
actions, including but not limited to access to shared
data
– Mutual exclusion to some limited set of operations and data is thus required
Consider several approaches and compare and contrast
their advantages and disadvantages
Centralized Algorithm
–
–
–
–
The single central process is essentially a monitor
Central server becomes a semaphore server
Three messages per use: request, grant, release
Centralized performance constraint and point of failure
Mutual Exclusion
Distributed Algorithm Factors
Functional Requirements
1) Freedom from deadlock
2) Freedom from starvation
3) Fairness
4) Fault tolerance
Performance Evaluation
– Number of messages
– Latency
– Semaphore system Throughput
Synchronization is always overhead and must be
accounted for as a cost
Mutual Exclusion
Distributed Algorithm Factors
Performance should be evaluated under a variety of
loads
– Cover a reasonable range of operating conditions
We care about several types of performance
– Best case
– Worst case
– Average case
Different aspects of performance are important for
different reason and in different contexts
Mutual Exclusion
Lamport’s Algorithm
Every site keeps a request queue sorted by logical time
stamp
– Uses Lamport’s logical clocks to impose a total global order on events associated
with synchronization
Algorithm assumes ordered message delivery between
every pair of communicating sites
– Messages sent from site Sj in a particular order arrive at Sj in the same order
– Note: Since messages arriving at a given site come from many sources the
delivery order of all messages can easily differ from site to site
Lamport’s Algorithm
Request Resource r
1) Site Si sends REQUEST(tsi , j) to S j  Rr and
places the request on the local queuse request_queuei
- R i is the set of all processes using resource i
2)When site S j receives REQUEST(tsi , i ) from site Si it returns
a time stamped REPLY message to site Si and places the request
on request_queue j
Thus, each site has a request queue containing
resource use requests and replies
Note that the requests and replies for any given pair of
sites must be in the same order in queues at both
Lamport’s Algorithm
Entering CS for Resource r
L1) Site Si has received a message from all other sites with a
timestamp larger than (tsi , i)
L2) The request from site Si is at the head
of the queue request_queuei
Site Si enters the CS protecting the resource when
– This ensures that no message from any site with a smaller timestamp could ever
arrive
Lamport’s Algorithm
Releasing the CS
Si
1) Site Si removes its request from the front of request_queue i
and sends a RELEASE(r,i) message to S j  Rr
2) When site Sj receives a RELEASE(r,i) message it removes
REQUEST(tsi , r) from request_queue j
The site holding the resource is releasing it, call that
site
– Note that the request for resource r had to be at the head of the request_queue at
the site holding the resource or it would never have entered the CS
Lamport ME Example
Pj
Pi
release(i5)
queue(j10)
Pi in
critical
section
14
13
reply(12)
reply(12)
12
13
12
queue(j10, i5)
11
11
queue(i5)
Pj enters
critical
section
queue(j10)
15
request (i5)
request (j10)
queue(j10)
Lamport’s Algorithm
Comments
Performance: 3(N-1) messages per CS
invocation since each requires (N-1)
REQUEST, REPLY, and RELEASE messages
Observation: Some REPLY messages are not
S
S
required
S
j
i
i
– If sends a request to and then receives a REQUEST from with a
timestamp
smaller than its own REQUEST
Sj
Si
– need not send a reply to because it already has enough information
to make a decision
– This reduces the messages to between 2(N-1) and 3(N-1)
As a distributed algorithm there is no single
point of failure but there is increased overhead
Ricart and Agrawala
Refine Lamport’s mutual exclusion by merging the
REPLY and RELEASE messages
– Assumption: total ordering of all events in the system implying the use of
Lamport’s logical clocks with tie breaking
Request CS (P) operation:
1) Site requesting the CS creates a
message and sends it to all
REQUEST(tsi , i)
Si the CS including itself
processes using
– Messages are assumed to be reliably delivered in order
– Group communication support can play an obvious role
Ricart and Agrawala
Receive a CS Request
If the receiver is not currently in the CS and does not
have pending request for it in its request_queue
– Send REPLY
If the receiver is already in the CS
– Queue the request, sending no reply
If the receiver desires the CS but has not entered
– Compare the TS of its request to that just received
– REPLY if received is newer
– Queue the request if pending request is newer
Ricart and Agrawala
Enter a CS
– A process enters the CS when it receives a REPLY from every member of the
group that can use the CS
Leave a CS
– When the process leaves the CS it sends a REPLY to the senders of all pending
messages on its queue
Ricart and Agrawala
Example 1
I
K
J
k in CS
OK(i)
i in CS
OK(k)
OK(j)
request(i8)
OK(j)
request(k12)
Ricart and Agrawala
Example 2
I
J
K
OK(j)
k in CS
j in CS
OK(i)
OK(i)
i in CS
OK(k)
OK(k)
OK(j)
q(k9)
q(j8, k9)
q(j8)
request(i7)
request(j8)
request(k9)
Ricart and Agrawala
Observations
The algorithm works because the global logical clock
ensures a global total ordering on events
– This ensures, in turn, that the decision about who enters the CS is unambiguous
Single point of failure is now N points of failure
– A crashed group member cannot be distinguished from a busy CS
– Distributed and “optimized” version is N times more vulnerable than the
centralized version!
– Explicit message denying entry helps reliability and converts this into busy wait
Ricart and Agrawala
Observations
Either group communication support is used, or
each user of the CS must keep track of all other
potential users correctly
– Powerful motivation for standard group communication primitives
Argument against a centralized server said that a
single process involved in each CS decision
was bad
– Now we have N processes involved in each decision
Improvements: get a majority - Makaewa’s
algorithm
Bottom Line: a distributed algorithm is possible
Token Passing Mutex
General structure
– One token per CS  token denotes permission to enter
– Only process with token allowed in CS
– Token passed from process to process  logical ring
Mutex
– Pass token to process i + 1 mod N
– Received token gives permission to enter CS
• hold token while in CS
– Must pass token after exiting CS
– Fairness ensured: each process waits at most N-1 entries to get CS
Token Passing Mutex
Correctness is obvious
– No starvation since passing is in strict order
Difficulties with token passing mutex
– Idle case of no process entering CS pays overhead of constantly passing
the token
– Lost tokens: diagnosis and creating a new token
– Duplicate tokens: ensure generation of only one token
– Crashes: require a receipt to detect dead destinations
– Receipts double the message overhead
Design challenge: holding time for unneeded
token
– Too short  high overhead, too long  high CS latency
Mutex Comparison
Centralized
– Simplest and most efficient
– Centralized coordinator crashes create the need to detect crash and choose a new
coordinator
– M/use: 3; Entry Latency: 2
Distributed
–
–
–
–
3(N-1) messages per CS use (Lamport)
2(N-1) messages per CS use (Ricart & Agrawala)
If any process crashes with a non-empty queue, algorithm won’t work
M/use: 2(N-1); Entry Latency: 2(N-1)
Mutex Comparison
Token Ring
–
–
–
–
–
Ensures fairness
Overhead is subtle  no longer linked to CS use
M/use: 1   ; Entry Latency: 0  N-1
This algorithm pays overhead when idle
Need methods for re-generating a lost token
Design Principle: building fault handling into
algorithms for distributed systems is hard
– Crash recovery is subtle and introduces overhead in normal operation
Performance Metrics: M/use and Entry Latency
Election Algorithms
Centralized approaches often necessary
– Best choice in mutex, for example
– Need method of electing a new coordinator when it fails
General assumptions
–
–
–
–
–
Give processes unique system/global numbers (e.g. PID)
Elect process using a total ordering on the set
All processes know process number of members
All processes agree on new coordinator
All do not know if it is up or down  election algorithm is responsible
for determining this
Design challenge: network delay vs. crashed
peer
Bully Algorithm
Suppose the coordinator doesn’t respond to P1 request
– P1 holds an election by sending an election message to all processes with higher
numbers
– If P1 receives no responses, P1 is the new coordinator
– If any higher numbered process responds, P1 ends its election
Process receives an election request
– Reply to the sender tells it that it has lost the election
– Holds an election of its own
– Eventually all but highest surviving process give up
Process recovering from a crash takes over if highest
Bully Algorithm
ple: Processes 0-7, 4 detects that 7 has crashed
4 holds election and loses
5 holds election and loses
6 holds election and wins
Message overhead variable
Who starts an election matters
Solid lines say “Am I leader?”
Dotted lines say “you lose”
Hollow lines say “I won”
6 becomes the coordinator
7 recovers it is a bully and sends “I win” to all
1
2
5
4
6
0
7
3
Ring Algorithm
Processes have a total order known by all
–
–
–
–
Each process knows its successor  forming a ring
Ring: mod N
So the successor of Pi is P(i+1) mod N
No token involved
Any process Pi noticing that the coordinator is
not responding
– Sends an election message to its successor P(i+1) mod N
– If successor is down, send to next member  timeout
– Receiving process adds its number to the message and passes it along
Ring Algorithm
When election message gets back to election initiator
–
–
–
–
Change message to coordinator
Circulate to all members
Coordinator is highest process in the total order
All processes know the order and thus all will agree no matter how the election
started
Strength
– Only one coordinator chosen
Weakness
– Scalability: latency increases with N because the algorithm is sequential
Ring Algorithm
What if more than one process detects a crashed
coordinator?
– More than one election will be produced: message storm
– All messages will contain the same information: member process numbers and
order of members
– Same coordinator is chosen (highest number)
Refinement might include filtering duplicate messages
Some duplicates will happen
– Consider two elections chasing each other
– Eliminate one initiated by lower numbered process
– Duplicated until lower reaches source of the higher
Global State (3)
d)
e)
Process 6 tells 5 to stop
Process 6 wins and tells everyone
A Ring Algorithm
Election algorithm using a ring.
Mutual Exclusion:
A Centralized Algorithm
a)
b)
c)
Process 1 asks the coordinator for permission to enter a critical region.
Permission is granted
Process 2 then asks permission to enter the same critical region. The
coordinator does not reply.
When process 1 exits the critical region, it tells the coordinator, when
then replies to 2
A Distributed Algorithm
a)
b)
c)
Two processes want to enter the same critical region at the same
moment.
Process 0 has the lowest timestamp, so it wins.
When process 0 is done, it sends an OK also, so 2 can now enter
the critical region.
A Toke Ring Algorithm
a) An unordered group of processes on a network.
b) A logical ring constructed in software.
Comparison
Messages per
entry/exit
Delay before entry
(in message times)
Problems
Centralized
3
2
Coordinator crash
Distributed
2(n–1)
2(n–1)
Crash of any
process
Token ring
1 to 
0 to n – 1
Lost token,
process crash
Algorithm
A comparison of three mutual exclusion algorithms.
Deadlocks
Definition: Each process in a set is waiting for a
resource to be released by another process in set
– The set is some subset of all processes
– Deadlock only involves the processes in the set
Remember the necessary conditions for DL
Remember that methods for handling DL are based on
preventing or detecting and fixing one or more
necessary conditions
Deadlocks
Necessary Conditions
Mutual exclusion
– Process has exclusive use of resource allocated to it
Hold and Wait
– Process can hold one resource while waiting for another
No Preemption
– Resources are released only by explicit action by controlling process
– Requests cannot be withdrawn (i.e. request results in eventual allocation
or deadlock)
Circular Wait
– Every process in the DL set is waiting for another process in the set,
forming a cycle in the SR graph
Deadlock Handling Strategies
No strategy
Prevention
– Make it structurally impossible to have a deadlock
Avoidance
– Allocate resources so deadlock can’t occur
Detection
– Let deadlock occur, detect it, recover from it
No Strategy
The “Ostrich Algorithm”
Assumes deadlock rarely occurs
– Becomes more probable with more processes
Catastrophic consequences when it does occur
– May need to re-boot all or some machines in system
Fairly common and works well when
– DL is rare and
– Other sources of instability are more common
How reboots of Window or MacOS are prompted by
DL?
Deadlock Prevention
Ordered resource allocation most common
example
– Consider link with two-phase-locking grow and shrink
Works but requires global view of all resources
– A total order on resources must exist for the system
– Process code must allocate resources in order
Under utilizes resources when period of use of a
resource conflict with the total resource order
–
–
–
–
Consider process Pi and Pk using resources R1 and R2
Pi uses R1 90% of its execution time and R2 10%
Pk uses R2 90% of its execution time and R1 10%
One holds one resource far too long
Deadlock Avoidance
General method: Refuse allocations that may
lead to deadlock
– Method for keeping track of states
– Need to know resources required by a process
Banker’s algorithm
– Must know maximum number allocated to Pi
– Keep track of resources available
– For each request, make sure maximum need will not exceed total
available
– Under utilizes resources
Never used
– Advance knowledge not available and CPU-intensive
Deadlock Detection and Resolution
Attractive for two main reasons
– Prevention and avoidance are hard, have significant overhead, and require
information difficult or impossible to obtain
– Deadlock is comparatively rare in most systems so a form of the argument for
optimistic concurrency control applies: detect and fix comparatively rare
situations
Availability of transactions helps
– DL resolution requires us to kill some participant(s)
– Transactions are designed to be rolled back and restarted
Centralized Deadlock Detection
General method: Construct a resource graph and
analyze it
– Analyze through resource reductions
– If cycle exists after analysis, deadlock has occurred
– Processes in cycle are deadlocked
Local graphs on each machine
– Pi requests R1
– R1’s machine places request in local graph
– If cycle exists in local graph, perform reductions to detect deadlock
Need to calculate union of all local graphs
– Deadlock cycle may transcend machine boundaries
Graph Reduction
Cycles don’t always mean deadlock!
P1
P2
P3
P1
P2
P3
No Deadlock
Deadlock
P2
P3
Waits-For Graphs (WFGs)
Based on Resource Allocation Graph (SR)
An edge from Pi to Pj
– means Pi is waiting for Pj to release a resource
Replaces two edges in SR graph
– Pi R
– R  Pj
Deadlocked when a cycle is P1found
P1
R1
P2
P3
P2
R2
P3
Centralized Deadlock Detection
–
–
–
–
–
–
All hosts communicate resource state to coordinator
Construct global resource graph on coordinator
Coordinator must be reliable and fast
When to construct the graph is an important choice
Report every resource operation (request, acquire, release)
Large overhead and significant use latency
Periodically send set of operations
Lower overhead and use latency,  detection latency
Whenever a need for cycle detection is indicated
Central or local decision
All have drawbacks b/c of false deadlocks
False Deadlock
Problem: messages may not arrive in a timely fashion
– Inconsistent and out-of-date world view at a particular machine
– In particular, out-of-order arrival
Assume two processes on two machines and two
resources
– P2 releases R2 (message A)
– P1 requests instance of R2 (message B)
P2
P1
R1
M1
M2
R1
R2
False Deadlock
Problem: Coordinator detects false deadlock
Initial coordinator representation:
After receiving message B:
after B
P2
R1
P2
R2
R1
P1
After receiving message A:
P1
P2
R1
P1
R2
R2
False Deadlock
Lack of global message delivery order causes
false DL
– Could apply Lamport’s global virtual clock
– Expensive
Coordinator detects potential DL
– Requests all outstanding messages with lower timestamp
Aim is to establish a common global message
order
– Establishes a total order on resource operations
– Establishes a common world view and thus common decision making
Fixes some false deadlocks, but others are harder
Distributed Deadlock Detection
Chandry-Misra-Haas algorithm
– Processes can request more than one resource with a single message  process
can wait on several resources
– Amortize message overhead
– Speed growing phase
Use waits-for graph to represent system state
– Dependencies across machine boundaries make looking for cycles hard
A process sends probe messages when it has to wait
– If message gets back, deadlock has occurred
Distributed Deadlock Detection
When process has to wait
– Send message to process holding resources
– Recipient forwards to all processes it is waiting on
– Creates concurrent probe of wait-for graph for cycles
If message gets back to originator
– Cycle exists in wait-for graph so deadlock has occurred
– Note that first field of message will always be the initiator
Many messages every time a process blocks
Distributed Deadlock Detection
An Example
P0 gets blocked, resource held by P1
– Initial message from P0 to P1 : (0, 0, 1)
P1 waiting on P2
– P1 sends message (0, 1, 2) to P2
P2 waiting on P3: (0, 2, 3)
P3 waiting on P4 and P5: (0, 3, 4) and (0, 3, 5)
P5 chain ends, but P4  P6 P8
But P8 is waiting on P0:
– P0 gets message, sees itself as the initiator: (0, 8, 0)
– A cycle thus exists
– P0 knows there is deadlock
Distributed Deadlock
Resolution
Some process in the cycle must be killed
– Structuring resource use as transactions makes this better behaved and easier to
understand
Race Condition:
– Two processes block at the same time and send probes
– Both discover the cycle in parallel
– Damping difficult as it is hard to tell what messages may be killed  killing
process must know the cycle
Practice should emphasize the simplest and cheapest
– Most cycles are between two processes
– Example of importance of gathering performance data
Distributed Deadlock Prevention
Prevention
– Careful design to make deadlocks structurally impossible
Make sure at least one of the 4 necessary
conditions for deadlock cannot hold
–
–
–
–
Process can only hold one resource at a time
Request all resources initially
Process releases all resources before requesting new one
Resource ordering
All are cumbersome in practice
Distribution opens some new possibilities
– Lamport clocks create total order preventing cycles
Summary
We began with clocks and saw how relaxing the
semantic requirement for real-time made Lamport’s
logical clocks possible
Given global clocks, virtual or real, we considered
mutual exclusion
– Centralized algorithms keep information in one place effectively becoming a
monitor
– Distribution handles mutual exclusion in parallel at the cost of O(N) messages
per CS use
– Token algorithm reduced messages under some circumstances but introduced
heartbeat overhead
Each has strengths and weaknesses
Summary
Many distributed algorithms require a
coordinator
– Creating the need to select, monitor, and replace the coordinator as
required
Election algorithms provide a way to select a
coordinator
– Bully algorithm
– Ring algorithm
Transactions provide a high level abstraction
with significant power for organizing,
expressing, and implementing distributed
Summary
Transactions are useful because they can be aborted
Concurrency control issues were considered
– Locking
– Optimistic
Deadlock
– Detection
– Prevention
Yet again
– Distributed systems have the same problems
– Only more so