time and synchronization

TIME AND SYNCHRONIZATION
Structure
8.0 Objective
8.1 Introduction
8.2 Clock Synchronization
8.2.1 Synchronizing physical clocks
8.2.1.1 Crstian’s method for synchronizing clocks
8.2.1.2 The Berkeley algorithm
8.2.1.3 The Network Time Protocol
8.3 Logical Time and Logical clocks
8.3.1 Logical clocks
8.3.2 Totally Ordered Logical Clocks
8.4 Distributed Coordination
8.4.1 Distributed Mutual Exclusion
8.4.1.1The central server algorithm
8.4.1.2 A Distributed algorithm using logical clocks
8.4.1.3 A Ring based algorithm
8.5 Elections
8.5.1 The bully algorithm
8.5.2 A Ring based election algorithm
8.6 Summary
8.0 Objective:
In this unit we introduce some topics related to the issue of
coordination in distributed systems. To start with, the notion of time is dealt. We go
on to explain the notion of logical clocks, which are tool for ordering events, without
knowing precisely when they occurred. The second half examines briefly some
algorithms to achieve distributed coordination. These include algorithms achieve
mutual exclusion among collection of processes, so as to coordinate their accesses to
shared resources. It goes on to examine how a group of processes agree upon a new
coordinator of their activities, after their previous coordinator has failed or become
unreachable. This process is called as Election.
8.1 Introduction:
This unit introduces some concepts and algorithms related to the timing and
coordination of events occurring in distributed systems. In the first half we describe
the notion of time in a distributed system. We discuss the problem of how to
synchronize clocks in different computers, and so time events occurring at them
consistently, and we even discuss the related problem of determining the order in
which events occurred. In the latter half of the chapter we introduce some algorithms
whose goal is to confer a privilege upon some unique member of collection of
processes.
In the next section we explain methods whereby computer clocks can
be approximately synchronized, using message passing. However, we shall not be
able to obtain sufficient accuracy to determine, in many cases, the relative ordering of
events on some events by appealing to the flow of data between processes in a
distributed system. We go on to introduce logical clocks, which are used to define an
order on events without measuring the physical time at which they occurred.
8.2 Clock Synchronization: Time is an important and interesting issue in distributed
systems, for several reasons.
o First, the time is a quantity we often want to measure accurately. In order to
know at what time of day a particular event occurred at a particular computer
– for example, for accountancy purposes – it is necessary to synchronize its
clock with a trustworthy, external source of time. This is external
synchronization. Also, if computer clocks are synchronized with one another
to a known degree of accuracy, then we can, within the bounds of this
accuracy, measure the interval between two events occurring at different
computers, by appealing to their local clocks. This is internal synchronization.
Two or more computers that are internally synchronized are not necessarily
externally synchronized, since they may drift collectively from external time.
o Secondly, algorithms that depend upon clock synchronization have been
developed for several problems in distributions. These include maintaining the
consistency of distributed data (the use of timestamps to serialize
transactions); checking the authenticity of a request and eliminating the
processing of duplicate updates.
o Thirdly, the notion of physical time is also a problem in a distributed system.
This is not due to the effects of special relativity, which are negligible or nonexistent for normal computers, but the problem is based on a similar limitation
concerning our ability to pass information from one computer to another. This
is that the clocks belonging to different computers can only be synchronized,
at least in the majority of cases, by network communication. Message passing
is limited by virtue of the speed at which it can transfer information, but this in
itself would not be a problem if we knew how long message transmission took
The problem is that sending a message usually takes an unpredictable amount
of time.
Now let us see different aspects of clock synchronization one by one.
8.2.1 Synchronizing physical clocks:
Computers each contain their own physical clock. These clocks are electronic
devices that count oscillations occurring in a crystal at a definite frequency, and which
typically divide this count and store the result in a counter register. Clock devices can
be programmed to generate interrupts at regular intervals in order that, for example,
time slicing can be implemented; however we shall not concern ourselves with this
aspect of clock operation. The clock output can be read by software and scaled into a
suitable time unit. This value can be used to timestamp any event experienced by any
process executing at the host computer. By ‘event’ we mean an action that appears to
occur indivisibly, all at once – such as sending or receiving a message. Note however,
that successive events will correspond to different timestamps only if the clock
resolution – the period between updates of the clock register – is smaller than the rate
at which events can occur. The rate at which events occur depends on such factors as
the length of the processor instruction cycle.
In order to compare timestamps generated by clocks of the same physical
construction at different computers, one might think that we need only know the
relative offset of one clock’s counter from that of the other – for example, that one
count had the value 1958 when the other was initialized to 0. Unfortunately, this
supposition is based on a false argument; computer clocks in practice are extremely
unlikely to ‘tick’ at the same rate, whether or not they are of the ‘same’ physical
construction.
Clock drift: The crystal-based clocks used in computers are, like any other clocks,
subject to clock drift (Figure8.1), which means that they count time at different rates,
and so deviate. The underlying oscillators are subject to physical variations, with the
consequence that their frequencies of oscillation differ. Moreover, even the same
clock’s frequency varies with the temperature. Designs exist that attempt to
compensate for this variation but they cannot eliminate it. The difference in the
oscillation period
Network
Fig.7.1 Drift between computers clocks in a distributed system.
between two clocks might be extremely small, but the different accumulated over
many oscillation leads to an observable difference in the counters registered by two
clocks, no matter how accurately they were initialized to the same value. A clock’s
drift rate is the change in the offset (difference in reading) between the clock and a
nominal perfect reference clock per unit of time measured by the reference clock. For
clocks based on a query crystal, this is about 10-6 – giving a difference of one second
every 1,000,000 seconds or 11.6 days.
Coordinated universal time: The most accurate physical clocks known use atomic
oscillators, whose accuracy is about one part in 1013. The output of these atomic
clocks is used as the standard for elapsed real time, known as International Atomic
Time.

Coordinated universal time – abbreviated as UTC (sic) – is an international
standard that is based on atomic time, but a so-called leap second is
occasionally inserted or deleted to keep in step with astronomical time. UTC
signals are synchronized and broadcast regularly from land-based radio
stations and satellites covering many parts of the world. For example, in the
US radio station WWW broadcasts time signals on several short-wave
frequencies. Satellite sources include the Geostationary Operational
Environmental Satellite (GOES) and the Global Positioning System (GPS).

Receivers are available commercially. Compared with ‘perfect’ UTC, the
signals received from land-based stations have accuracy in the order of 0.1-10
milliseconds, depending on the station used. Signals received from GEOS are
accurate to about 0.1ms and signals received from GPS are accurate to about
one millisecond. Computers with receivers attached can synchronize their
clocks with these timing signals.
Compensating for clock drift :If the time provided by a time service, such as
Universal Coordinated Time signals, is greater than the time at a computer C, then it
may be possible simply to set C’s clock to the time service time. Several clock ‘ticks’
appear to have been missed, but time continues to advance, as expected.
Now
consider what should happen if the time service’s time is behind that of C. We cannot
set C’s time back to the time, because this is liable to confuse applications that relay
on the assumption that time always advances. The solution is not set C’s clock back,
but to cause it to run slow for a period, until it is in accord with the timeserver time. It
is possible to change the rate at which updates are made to the time as given to
applications. This can be achieved in software, without changing the rate at which the
hardware clock ticks (an operation which is not always supported by hardware
clocks).
Let us call the time given to applications (the software clock’s reading) S and
the time given by the hardware clock H. Let the compensating factor be , so that
S(t)=H(t) + (t)
The simplest form for  to make S change continuously is a linear function of the
hardware clock:
(t) = aH(t) + b, where a and b are constants to be found.
Substtuting for  in the identity, we have S(t) = (1 + a)H(t) + b.
Let the value of the software clock be Tskew when H=h, and let the actual time
at the point be Treal. We may have that Tskew > Treal or Tskew <Treal. If S is to give the
actual time after N further ticks, we must have
Tskew =(1 + a)h+b, and Treal + N = (1+a)(h + N) + b
By solving these equations, we find
A = (Treal - Tskew) / N and b= Tskew – (1 + a)h.
8.2.1.1 Cristian’s method for synchronizing clocks :One way to achieve
synchronization between computers in a distributed system is for a central timeserver
process S to supply the time according to its clock upon request, as shown in Fig 8.2.
The timeserver computer can be fitted with a suitable receiver so as to be
synchronized with UTC. If a process P requests the time in a message mr, and
receives the time value t in a message mt , then in principle it could set its clock to the
time t +Ttrans, where Ttrans,
mr
P
mt
Time server,s
Fig.8.2 Clock synchronization using a timeserver.
where Ttrans is the time taken to transmit mt from S to P (t is inserted in mt at the last
possible point before transmission from S’s computer).
Unfortunately, Ttrans is subject to variation. In general, other processes are
competing with S and P for resources at each computer involved, and other messages
compete with mt for the network. These factors are unpredictable and not practically
avoidable or accurately measurable in most installations. In general, we may say that
Ttrans = min + x, where x  0. The minimum value min is the value that would be
obtained if no other processes executed and no other network traffic existed; min can
be measured or conservatively estimated. The value of x is not known in a particular
case, although a distribution of values may be measurable for a particular installation.
Cristian suggested the use of such a timeserver, connected to a device that
receives signals from a source of UTC, to synchronize computers. Synchronization
between the timeserver and its UTC receiver can be achieved by a method that is
similar to the following procedure for synchronization between computers.

A process P wishing to learn the time from S can record the total round-trip
time with reasonable accuracy if its rate of clock drift is small. For example,
the round-trip time should be in the order of 1-10 milliseconds on a LAN, over
which time a clock with a drift rate of 10-6 varies by at most 10-5 milliseconds.

Let the time returned in S’s message mt be t. A simple estimate of the time to
which P should set its clock is t+ Tround/2, which assumes that the elapsed time
is split equally before and after S placed t in mt. Let the time between sending
and receipt for mr and mt. be min + x and
min + y, respectively. If the value
of min is known or can be conservatively estimated, then we can determine the
accuracy of this result as follows.

The earliest point that S could have placed the time in mt was min after P
dispatched mt. The latest point at which it could have done this was min before
mt arrived at P. The time by S’S clock when the reply messages arrives is
therefore in the range [t+min, t+Tround-min]. The width of this range is Tround – 2
min, so the accuracy is  (Tround /2-min).
Discussion of Cristian’s algorithm: As described Cristian’s method suffers from the
problem associated with all services implemented by a single server, that the single
timeserver might fail and thus render synchronization impossible temporarily. Cristian
suggested, for this reason, that time should be provided by a group of synchronized
timeservers, each with a server for UTC time signals. For example, a client could
multicast its request to all servers, and use only the first reply obtained. Note that a
malfunctioning timeserver that replied with spurious time values, or an improper
timeserver that replied with deliberately incorrect times could wreak havoc in a
computer system.
8.2.1.2 The Berkeley algorithm:
Gusella and Zatti describe an algorithm for
internal synchronization which they developed for collections of computers running
Berkeley UNIX. In it, a coordinator computer is chosen to act as the master. Unlike
Cristian’s protocol, this computer periodically polls the other computers whose clocks
are to be synchronized, called slaves. The slaves send back their clock values to it.
The master estimates their local clock times by observing the round-trip times
(similarly to Cristian’s technique), and it averages the values obtained (including its
own clock’s reading). The balance of probabilities is that this average cancels out the
individual clocks’ tendencies to run fast or slow. The accuracy of the protocol
depends upon a nominal maximum round-trip time between the master and the slaves.
The master eliminates any occasional readings associated with larger times than this
maximum.
Instead of sending the updated current time back to the other computers –
which would introduce further uncertainty due to the message transmission time – the
master sends the amount by which each individual slave’s clock requires adjustment.
This can be a positive or negative value.
The algorithm eliminates readings from clocks that have drifted badly or that
has failed and provides spurious readings. Such clocks could have a significant
adverse effect if an ordinary average was taken. The master takes a fault-tolerant
average. That is, a subset of clocks is chosen that do not differ from one another by
more than a specified amount, and the average is taken only of readings from these
clocks.
Gusella and Zatti describe an experiment involving 15 computers whose
clocks were synchronized to within about 20-25 milliseconds using their protocol.
The local clocks’ drift rate was measured to be less than 2 x 10-5 , and the maximum
round-trip time was taken to be 10 milliseconds.
8.2.1.3 The Network Time Protocol: The Network Time Protocol (NTP) defines
architecture for a time service and a protocol to distribute time information over a
wide variety of interconnected networks. It has been adopted as a standard for clock
synchronization throughout the Internet.
NTP’s chief design aims and features are:

To provide a service enabling clients across the Internet to be
synchronized accurately to UTC; despite the large and variable
message delays encountered in Internet communication. NTP employs
statistical techniques for the filtering of timing data and it discriminates
between the qualities of timing from different servers.

To provide a reliable service that can survive lengthy losses of
connectivity; there are redundant servers and redundant paths between
the servers. The servers can reconfigure so as still to provide the
service if one of them becomes unreachable.

To enable clients to resynchronize sufficiently; to offset the rate of drift
found in most computers. The service is designed to scale to large
number of clients and servers.

To provide protection against interference with the time service;
whether malicious or accidental. The time service uses authentication
techniques to check that timing data originate from the claimed trusted
sources. It also validates the return addresses of messages sent to it.
.
1
1
2
3
3
3
Fig.8.3 An example synchronization subnet in an NTP implementation
Note: Arrows denote synchronization control and Numbers denote strata.
A network of servers located across the Internet provides the NTP service. Primary
servers are directly connected to a time such as a radio clock receiving UTC;
secondary servers are synchronized, ultimately, to primary servers. The servers are
connected in a logical hierarchy called a synchronization subnet (see Fig 8.3), whose
levels are called strata. Primary servers occupy stratum 1: they are at the root.
Stratum 2 servers are secondary servers that are synchronized directly to the primary
severs; stratum 3 are synchronized from stratum 2 servers, and so on. The lowest level
(leaf) severs execute in users’ workstations.
NTP servers synchronize modes: multicast, procedure-call and symmetric mode.
 Multicast mode: is intended for use on a high-speed LAN. One or more
servers periodically multicasts the time to the servers running in other
computers connected by the LAN, which set their clocks assuming a small
delay. This mode can only achieve relatively low accurate, but ones which
nonetheless are considered sufficient for many purposes.
 Procedure-call mode: is similar to the operation of Cristian’s algorithm,
described above. In this mode, one server accepts requests from other
computers, which it processes by replying with its timestamp (current clock
reading). This mode is suitable where higher accuracies are required that can
be achieved with multicast or where multicast is not supported in hardware.
For example, file servers on the same or a neighboring LAN , which need to
keep accurate timing information for file accesses, could contact a local master
server in procedure-call mode.
 Symmetric mode: is intended for use by the master servers that supply time
information in LANs and by the higher levels (lower strata) of the
synchronization subnet, where the highest accuracies are to be achieved. A
pair of servers operating in symmetric mode exchange messages bearing
timing information. Timing data are retained as part of an association between
the servers that is maintained in order to improve the accuracy of their
synchronization over time.
In all modes, messages are delivered unreliably, using the standard UDP
Internet transport protocol. In procedure-call mode and symmetric mode, messages
are exchanged in pairs. Each message bears timestamps of recent message events; the
local times when the previous NTP messages between the pair were sent and received,
and the local time when the current message was transmitted. The recipient of the
NTP message notes the local time when it receives the message. The four times Ti-3,
Ti-2, Ti-1, and Ti are shown in Fig 8.4 for messages m and m’ sent between servers A and
B. Note that in symmetric mode, unlike the Cristian algorithm, there can be a nonnegligible delay between the arrival of one message and the dispatch of the next.
Also, messages may be lost, but the three timestamps carried by each message are
nonetheless valid.
For each pair of messages sent between two servers the NTP protocol
calculates an offset oi that is an estimate of the actual offset between the two clocks,
and a delay di that is total transmission time for the two messages. If the true offset of
the clock at B relative to that at A is o, and if the actual transmission times for m and
m’ are t and t’ respectively, then we have:
Ti-2 = Ti-3 + t + o, and Ti = Ti-1 + t’ – o
Defining a = Ti-2 – Ti-3 and b=Ti-1 – Ti
this leads to:
di=t + t’ = a-b ; o = oi + ( t’ – t)/2 where oi = ( a+b)/2
Server B
Ti-2
Ti-1
Time
Time
Server A
Fig.8.4
Ti-3
Ti
Message exchanged between a pair of NTP peers
Using the fact that t, t’  0 it can be shown that oi – dj/2  o  oi + di/2. Thus oi is an
estimate of the offset, and di is a measure of the accuracy of this estimate.
8.3 Logical time and logical clocks:
From the point of view of any single process, events are ordered uniquely by times
shown on the local clock. However, as Lamport pointed out, since we cannot perfectly
synchronize clocks across a distributed system, we cannot in general use physical
time to find out the order of any arbitrary pair of events occurring within it.
The order of events occurring at different processes can be critical in a
distributed application. In general, we can use a scheme which is similar to physical
causality, but which applies in distributed systems, to order some of the events that
occur at different processes. This ordering is based on two simple and naturally
obvious points:

If two events occurred at the same process, then they occurred in the order in
which it observes them.

Whenever a message is sent between processes, the event of sending the
message occurred before the event of receiving the message.
Lamport called the ordering obtained by generalizing these two relationships the
happened-before relation. It is also sometimes known as the relation of casual
ordering or potential casual ordering.
More formally, we write x y if two events x and y occurred at a single
process p, and x occurred before y. Using this restricted order we can define the
happened-before relation, denoted by, as follows:
HB1: If  process p: x  y, then , x  y.
HB2: For any messages m, send(m) 
-
rcv(m).
where send(m) is the event of sending the message, and
rcv(m) is the event of receiving it.
HB3: If x,y and z are events such that x y and y z, then x z.
P1
a
b
m1
Physical
Time
P2
c
d
m2
P3
e
f
Fig. 8.5 Event occurring at three processes.
Thus if x
y, then we can find a series of events e1,e2, e3…, en occurring at one
or more processes such that x=e1 and y=en and for i=1,2 .., n-1, either HBI or HB2
applies between ei and ei+1: That is, either they occur in succession at the same
process., or there is a message m such that ei=send(m) and ei+1=rev(m). The sequence
of events e1,e2, e3…, en need not be unique.
The relation
is illustrated for the case of the three processes P1,P2 and P3
in Fig 8.5. It can be seen that a
b, since the events occur in this order at process p1,
and similarly c
d; b
c, since these are the sending and reception of message m1,
and similarly d  f. Combining these relations, we may also say that, for example,
a f.
It can also be observer from Fig 8.5 that not all events are related by the relation .
We say that events such as a and e that are not ordered by  are concurrent, and
write this all e.
The  relation captures a flow of data intervening between two events. Note
however, that in real life data can flow in ways other than by message passing. For
example, if Smith enters a command to his process to send a message, then telephones
Jones who commands her process to issue another message, then the issuing of the
first message clearly happened-before that of the second. Unfortunately, since no
network messages were sent between the issuing processes, we cannot model this type
of relationship in our system.
8.3.1 Logical clocks: Lamport invented a simple mechanism by which the happened
before ordering can be captured numerically, called a logically clock. A logical clock
is a monotonically increasing software counter, whose value need bear no particular
relationship to any physical clock, in general. Each process p keeps its own logical
clock, Cp, which it uses to timestamp events.
P1
1
2
a
b
P2
P3
m1
3
c
4
d
1
e
Physical
Time
m2
5
f
Fig.8.6 Logical timestamps for the events shown in Fig 8.5
We denote the timestamp of event a at p by Cp(a), and by C(b) we denote the
timestamp of event b at whatever process it occurred.
To capture the happened-before relation
, processes update their logical
clocks and transmit the values of their logical clock in messages as follows:
LC1 :
Cp is incremented before each event is issued at process p: Cp :=
Cp + 1
LC2 :
a) When a process p sends a message m, it piggybacks on m the
value t=Cp.
b) On receiving (m,t) a process q computes Cq:=max(Cq,t) and
then applies LCI before time stamping the event rev(m).
Although we increment clocks by one, we could have chosen any positive value. It
can easily be shown, by induction on the length of any sequence of events relating
two events a and b, that a  b => C(a) < C(b).
Note that the converse is not true. If C(a) < C(b), then we cannot infer that a

b. In Fig 8.6 we illustrate the use of logical clocks for the example given in Fig
8.5. Each of the processes p1,p2 and p3 has its logical clocks initialized to 0. The clock
values given are those immediately after the event to which they are adjacent. Note
that, for example, C(b) > C(e) but b || e.
8.3.2 Totally ordered logical clocks: Logical clocks impose only a partial order on
the set of all events, since some pairs of distinct events, generated by different
processes, have numerically identical timestamps. However, we can extend this to a
total order-that is, one for which all pairs of distinct events are ordered – by taking
into account the identifiers of the processes at which events occur. If a is an event
occurring at pa with local timestamp Ta, and b is an event occurring at pb with local
timestamps Tb, we define the global logical timestamps for these events to be (Ta , pa)
and (Tb , pb) respectively and we define (Ta , pa) < (Tb , pb) if and only if either Ta < Tb
or Ta = Tb and , pa < pb.
8.4 Distributed Coordination:
Distributed processes often need to coordinate their activities. For example, if a
collection of processes shares a resource or collection of resources managed by a
server, then often-mutual exclusion is required to prevent interference and ensure
consistency when accessing the resources. This is essentially the Critical section
problem, familiar in the domain of the domain of operating systems. In the distributed
case, however, neither shared variables nor facilities supplied by a single local kernel
can be used to solve it in general.
A separate, generic mechanism for distributed mutual exclusion is required
in certain cases – one that is independent of the particular resource management
scheme in question. Distributed mutual exclusion involves a single process being
given a privilege - the right to access shared resources – temporarily, before another
process is granted it. In some other cases, however, the requirement is for a set of
processes to choose one of their numbers to play a privileged, coordinating role over
the long term. A method for choosing a unique process to play a particular role is
called an election algorithm. This sectionnow examines some algorithms for
achieving the goals of distributed mutual exclusion and elections.
8.4.1 Distributed mutual exclusion: Our basic requirements for mutual exclusion
concerning some resource (or collection of resources) are as follows:
ME1 : (safety)
At most one process may execute in the critical section (CS) at
a
time
ME2 : (LIVENESS) A process requesting entry to the CS is eventually granted it (so
long as any process executing in the CS eventually leaves it).
ME2 implies that the implementation is deadlock-free, and that starvation does not
occur. A further requirement, which may be made, is that of casual ordering:
ME3 : (ordering)
Entry to the CS should be granted in happened-before order.
A process may continue with other processing while waiting to be granted entry to a
critical section. During this time it might send a message to another process, which
consequently also tries to enter the critical section. ME3 specifies that the first process
should be granted access before the second.
We now discuss some algorithms for achieving these requirements.
8.4.1.1 The central server algorithm: The simplest way to achieve mutual exclusion
is to employ a server that grants permission to enter section. For the sake of
simplicity, we shall assume that there is only one critical section managed by the
server. Recall that the protocol for executing a critical section is as follows.
enter()
(*enter critical section – block if necessary *)
……
(*access shared resources in critical section *)
exit()
(*leave critical section – other processes may now enter *)
Fig 8.7 shows the use of this server. To enter a critical section, a process sends a
request message to the server and awaits a reply from it. Conceptually, the reply
constitutes a token signifying permission to enter the critical section. If no other
process has the token at the time of the request, then the server replies immediately,
granting the token. If the token is currently held by another process, then the server
does no reply
Queue of
Requests
4
Server
2
1. Grant
token
1. Request
token
P1
P2
1. Release
token
P4
P3
Fig.8.7 Server managing a mutual exclusion token for a set of processes
but queues the request. On existing the critical section, a message is sent to the giving
it back the token.
If the queue of waiting processes is not empty, then the server chooses the
oldest entry in the queue, removes it and replies to the corresponding process. The
chosen process then holds the token. In the figure, we show a situation in which p2’s
request has been appended to the queue, which already contained p4’s request. p3
exists the critical section, and the server removes p4’s entry and grants permission to
enter to p4 by replying to it. Process p1 does not currently require entry to the critical
section.
The server is also critical point of failure. Any process attempting to
communicate with it, either to enter or exit the critical section, will detect the failure
of the server. To recover from a server failure, a new server can be created (or one of
the processes requiring mutual exclusion could be picked to play a dual role as a
server). Since the server must be unique if it is to guarantee mutual exclusion, an
election must be called to choose one of the clients to create or act as the server, and
to multicast its address to the others. We shall describe some election algorithms
below. When the new server has been chosen, it needs to obtain the state of its clients,
so that it can process their requests, as the previous server would have done. The
ordering of entry requests will be different in the new server from that in the failed
one unless precautions are taken, but we shall not address this problem here.
8.4.1.2
A distributed algorithm using logical clocks: Ricart and Agrawala
developed an algorithm to implement mutual exclusion that is based upon distributed
agreement, instead of using a central server. The basic idea is that processes that
require entry to a critical section multicast a request message, and can enter it only
when all the other processes have replied to this message. The conditions under which
a process replies to a request are designed to ensure that conditions ME1 – ME3 are
met.
Again, we can associate permission to enter a token in this case takes multiple
messages. The assumptions of the algorithm are that the processes p1 … … pn know one
another’s addresses, that all messages sent are eventually delivered, and that each
process pi keeps a logical clock, updated according to the rules LC1 and LC2 of the
previous unit. Messages requesting the token are of the for <T, pi>, where T is the
sender’s timestamp and pi is the sender’s identifier. For simplicity’s sake, we assume
that only one critical section is at issue, and so it does not have to be identified.
Each process records its state of having released the token (RELEASED),
wanting the token (WANTED) or processing the token (HELD) in a variable. The
protocol is given in Fig 8.8.
Fig.8.8 Ricart and Agrawala’s algorithm
On initialization:
State := RELEASED;
To obtain the token:
State := WANTED;
Multicast request to all processes;
Request processing deferred here
T:=request’s timestamp;
Wait until (number of replies received = (n-1);
On receipt of a request < Ti , pi > at pj (i # j);
If (state = HELD or (state = WANTED and (T,pj) < (T,pi)))
Then
Queue request from pi without replying;
Else
Reply immediately to pi;
End if
To release token
State := RELEASED;
Reply to any queued requests;
If a process requests the token is RELEASED everywhere else – that is, no
other process wants it – then all processes will reply immediately to the request and
the requester will obtain the token. If a token is HELD at some process, then that
process will not reply to requests until it has finished with the token, and so the
requester cannot obtain the token in the meantime. If two or more processes request
token at the same time, then whichever process’s request bears the lowest timestamp
will be the first to collect (n-1) replies, granting it the token next. If the requests bear
equal timestamps, the process identifiers are compared to order them. Note that, when
a process requests the token, it defers processing requests from other processes until
its own request has been sent and the timestamp T is known. This is so that processes
make consistent decisions when processing requests.
To illustrate the algorithm, consider a situation involving three processes, p1,
p2 and p3 shown in Fig 8.9. Let us assume that p3 is not interested in the token, and that
p1 and p2 requests it concurrently. The timestamp of p1’s request is 41, that of p2 is 34.
When p3 receives their requests, it replies immediately. When p2 receives p1’s request
it finds its own request has the lower timestamp, and so does not reply, holding p1 off.
However, p1 finds that p2’s request has a lower timestamp than that of its own request,
and so replies immediately. On receiving this second reply, p2 possesses the token.
When p2 releases the token, it will reply to p1’s request, and so grant it to token.
Obtaining the token takes 2(n-1) messages in the algorithm: (n-1) to multicast
the request, followed by (n-1) replies. Or, if there is hardware support for multicast,
only one message is required for the request; the total is then n messages. It is thus a
processes request mutual exclusion by multicasting. Considerably more expensive
algorithm, in general, than the central server algorithm just described. Note also, that
while it is a fully distributed algorithm, the failure of any process involved would
make process impossible. And in the distributed algorithm all the processes involved
receive and process every request, so no performance gain has been made over the
single server bottleneck, which does just the same. Finally, note that a process that
wishes to obtain the token and which was the last to process it still goes through the
protocol as described, even though it could simply decide locally to reallocate it to
itself. Ricart and Agrawala refined this protocol so that it requests n messages to
obtain the token in the worst (and common) case, without hardware support for
multicast.
8.4.1.3 A ring-based algorithm: One of the simplest ways to arrange mutual
exclusion between n processes p1, ... pn is to arrange them in a logical ring. The idea is
that exclusion is conferred by obtaining a token in the form of a message passed from
process to process in a single direction-clockwise, say – round the ring. The ring
topology – which is unrelated to the physical interconnections between the underlying
computers- which is unrelated to the physical interconnections between the
underlying computers – is created by giving each process the address of its neighbor.
A process that requires the token waits until it receives it, but retains it. To exit the
critical section, the process sends the token on to its neighbor.
41
P3
41
Reply
P1
Reply
Reply
34
34
41
34
P2
Fig. 8.9 Multicast Synchronization
The arrangement of processes is shown in Fig 8.10. It is straightforward to
verify that the conditions ME1 and ME2 are met by this algorithm, but that the token
is not necessarily obtained in happened-before order. It can take from 1 to (n-1)
messages to obtain this token, from the point at which the token becomes required.
However, messages are sent around the ring even when no process requires the token.
If a process fails, then clearly no progress can be made beyond it in transferring the
token, until a reconfiguration is applied to extract the failed process from the ring.
P1
P2
Pn
P3
P4
Token
Fig. 8.10 A ring of processes transferring a mutual exclusion token.
If the process holding the token fails, then an election is required to pick a unique
process from the surviving members, which will regenerate the token and transmit it
as before.
Care has to be taken, in ensuring that the ‘failed’ process really has
failed, and does not later unexpectedly inject the old token into the ring, so that there
are tow tokens. This situation can arise since process failure can only be ascertained
by repeated failure of the process to acknowledge messages sent to it.
8.5 Elections: An election is a procedure carried out to choose a process from a
group, for example to take over the role of a process that has failed. The main
requirement is for the choice of elected process to be unique, even if several processes
call elections concurrently.
8.5.1 The bully algorithm: The bully algorithm can be used when the numbers of the
group know the identities and addresses of the other members. The algorithm selects
the surviving members with the largest identifiers to function as the coordinator. We
assume that communication is reliable, but processes can be failed during an election.
The algorithm proceeds as follows.
There are three types of messages in this algorithm.
 An election message is sent to announce an election
 An answer message is sent in response to an election message
 A coordinator message is sent to announce the identity of the new
coordinator.
A process begins an election when it notices that the coordinator has failed. To begin
an election, a process sends an election message those processes that have a higher
identifier. It then awaits an answer message in response. If none arrives within a
certain time, the process considers itself the coordinator, and sends a coordinator
message to all processes with lower identifiers announcing this fact. Otherwise, the
process waits further limited period for a coordinator message to arrive from the new
coordinator. If none arrives, it begins another election.
If a process receives a coordinator message, it records the identifier of the
coordinator contained within it, and treats that process as the coordinator. If a process
receives an election message, it sends back an answer message, and begins another
election – unless it has begun one already.
When a failed process is restarted, it begins an election. If it has the higher
process identifier, then it will decide that it is the coordinator, and announce this to
the other processes. Thus it will become the coordinator, even though the current
coordinator is functioning. It is this reason that the algorithm is called the ‘bully’
algorithm. The operation of the algorithm is shown in Fig 8.11. There are four
processes p1-p4, and an election is called when p1 detects the failure of the
coordinator, p4, and announces an election (stage 1 in the figure). On receiving an
election message from p1, p2 and p3 send answer messages to p1 and begin their own
election; p3 sends an answer message to p2, but p3 receives no answer message from the
failed process p4 (stage 2). It therefore decides that it is the coordinator. But before it
can send out the coordinator message, it too fails (stage 3). When p1’s timeout period
expires (which we assume occurs before p2’s timeout expires), it notices the absence
of a coordinator message and begins another election. Eventually, p2 is elected
coordinator (stage 4).
8.5.2 A ring-based election algorithm: We give the algorithm of Chang and Roberts,
suitable for a collection of processes that are arranged in a logical ring (Refer fig8.12).
We assume that the processes do not know the identities of the others a priori, and
that each process knows only how to communicate with its neighbor in, say, the
clockwise direction. The goal of this algorithm is to elect a single coordinator, which
is the process with the largest identifier. The algorithm assumes that all the processes
remain functional and reachable during its operation. Initially, every process is
marked as a non-participant in an election. Any process can begin an election. It
proceeds by marking itself as a participant, placing its identifier in an election
message and sending it to its neighbor.
When a process receives an election message, it compares the identifier in the
message with its own. If the arrived identifier is the greater, then it forwards the
message to its neighbor. If the arrived identifier is smaller and the receiver is not a
participant then it substitutes its own identifier in the message and forwards it; but it
does not forward the message if it is already a participant. On forwarding an election
message in any case, the process marks itself as a participant. If however, the
received identifier is that of the receiver itself, then this process’s identifier must be
the greatest, and it becomes the coordinator. The coordinator marks itself as a nonparticipant, and forwards the message to its neighbor.
election
C
election
answer
Stage 1
P2
P3
P4
P1
answer
election
C
election
Stage 2
P2
P1
answer
election
P3
P4
timeout
Stage 3
P1
P2
P3
P4
P3
P4
coordinator
Eventually…..
C
Stage 4
P2
P1
Fig. 8.11
The bully algorithm: The election of coordinator p2, after the failure of
p4 and then p3.
3
17
4
24
9
1
15
24
28
NOTE: The election was started by process 17. The highest process identifier
encountered so far is 24. Participant processes are shown darkened
Fig.8.12 A Ring based Election in progress.
The point of making processes as participant or non-participant is so that
messages arising when another process starts an election at the time are extinguished
as soon as possible, and always before the ‘winning’ election result has been
announced.
If only a single process starts an election, then the worst case is when its
anticlockwise neighbor has the highest identifier. A total of n-1 messages are then
required to reach this neighbor, which will not announce its election until its identifier
has completed another circuit, taking a further n messages. The elected message is
then sent n times, making 3n-1 messages in all.
An example of a ring-based election in progress is shown in Figure 10.12. The
election message currently contains 24, but process 28 will replace this with its
identifier when the message reaches it.
8.6 Summary:
In this unit first we have discussed the importance of accurate time keeping for
distributed systems. It then described algorithms for synchronizing clocks despite the
drift between them and the variability of message delays between computers.
The degree of synchronization accuracy that is practically obtainable fulfils
many requirements, but is nonetheless not sufficient to determine the ordering of an
arbitrary pair of events occurring at different computers. The happened-before
relationship is a partial order on events, which reflects a flow of information – within
a process, or via messages between processes – between them. Some algorithms
require events to be ordered in happened-before order, for example, successive
updates made at separate copies of data. Logical clocks are counters that are updated
so as to reflect the happened-before relationship between them.
The unit then described the need for processes to access shared resources
under conditions of mutual exclusion. Resource servers in all cases do not implement
locks, and a separate distributed mutual exclusion service is then required. Three
algorithms were considered which achieve mutual exclusion: the central server, a
distributed algorithm using logical clocks, and a ring-based algorithm. These are
heavyweight mechanisms that cannot withstand failure, although they can be
modified to be fault-tolerant. On the whole, it seems advisable to integrate locking
with resource management.
Finally, the unit considered the bully algorithm and a ring-based algorithm
whose common aim is to elect a process uniquely from a given set – even if several
elections take place concurrently. These algorithms could be used, for example, to
elect a new master timeserver, or a new lock server, when the previous one fails.