TIME AND SYNCHRONIZATION Structure 8.0 Objective 8.1 Introduction 8.2 Clock Synchronization 8.2.1 Synchronizing physical clocks 8.2.1.1 Crstian’s method for synchronizing clocks 8.2.1.2 The Berkeley algorithm 8.2.1.3 The Network Time Protocol 8.3 Logical Time and Logical clocks 8.3.1 Logical clocks 8.3.2 Totally Ordered Logical Clocks 8.4 Distributed Coordination 8.4.1 Distributed Mutual Exclusion 8.4.1.1The central server algorithm 8.4.1.2 A Distributed algorithm using logical clocks 8.4.1.3 A Ring based algorithm 8.5 Elections 8.5.1 The bully algorithm 8.5.2 A Ring based election algorithm 8.6 Summary 8.0 Objective: In this unit we introduce some topics related to the issue of coordination in distributed systems. To start with, the notion of time is dealt. We go on to explain the notion of logical clocks, which are tool for ordering events, without knowing precisely when they occurred. The second half examines briefly some algorithms to achieve distributed coordination. These include algorithms achieve mutual exclusion among collection of processes, so as to coordinate their accesses to shared resources. It goes on to examine how a group of processes agree upon a new coordinator of their activities, after their previous coordinator has failed or become unreachable. This process is called as Election. 8.1 Introduction: This unit introduces some concepts and algorithms related to the timing and coordination of events occurring in distributed systems. In the first half we describe the notion of time in a distributed system. We discuss the problem of how to synchronize clocks in different computers, and so time events occurring at them consistently, and we even discuss the related problem of determining the order in which events occurred. In the latter half of the chapter we introduce some algorithms whose goal is to confer a privilege upon some unique member of collection of processes. In the next section we explain methods whereby computer clocks can be approximately synchronized, using message passing. However, we shall not be able to obtain sufficient accuracy to determine, in many cases, the relative ordering of events on some events by appealing to the flow of data between processes in a distributed system. We go on to introduce logical clocks, which are used to define an order on events without measuring the physical time at which they occurred. 8.2 Clock Synchronization: Time is an important and interesting issue in distributed systems, for several reasons. o First, the time is a quantity we often want to measure accurately. In order to know at what time of day a particular event occurred at a particular computer – for example, for accountancy purposes – it is necessary to synchronize its clock with a trustworthy, external source of time. This is external synchronization. Also, if computer clocks are synchronized with one another to a known degree of accuracy, then we can, within the bounds of this accuracy, measure the interval between two events occurring at different computers, by appealing to their local clocks. This is internal synchronization. Two or more computers that are internally synchronized are not necessarily externally synchronized, since they may drift collectively from external time. o Secondly, algorithms that depend upon clock synchronization have been developed for several problems in distributions. These include maintaining the consistency of distributed data (the use of timestamps to serialize transactions); checking the authenticity of a request and eliminating the processing of duplicate updates. o Thirdly, the notion of physical time is also a problem in a distributed system. This is not due to the effects of special relativity, which are negligible or nonexistent for normal computers, but the problem is based on a similar limitation concerning our ability to pass information from one computer to another. This is that the clocks belonging to different computers can only be synchronized, at least in the majority of cases, by network communication. Message passing is limited by virtue of the speed at which it can transfer information, but this in itself would not be a problem if we knew how long message transmission took The problem is that sending a message usually takes an unpredictable amount of time. Now let us see different aspects of clock synchronization one by one. 8.2.1 Synchronizing physical clocks: Computers each contain their own physical clock. These clocks are electronic devices that count oscillations occurring in a crystal at a definite frequency, and which typically divide this count and store the result in a counter register. Clock devices can be programmed to generate interrupts at regular intervals in order that, for example, time slicing can be implemented; however we shall not concern ourselves with this aspect of clock operation. The clock output can be read by software and scaled into a suitable time unit. This value can be used to timestamp any event experienced by any process executing at the host computer. By ‘event’ we mean an action that appears to occur indivisibly, all at once – such as sending or receiving a message. Note however, that successive events will correspond to different timestamps only if the clock resolution – the period between updates of the clock register – is smaller than the rate at which events can occur. The rate at which events occur depends on such factors as the length of the processor instruction cycle. In order to compare timestamps generated by clocks of the same physical construction at different computers, one might think that we need only know the relative offset of one clock’s counter from that of the other – for example, that one count had the value 1958 when the other was initialized to 0. Unfortunately, this supposition is based on a false argument; computer clocks in practice are extremely unlikely to ‘tick’ at the same rate, whether or not they are of the ‘same’ physical construction. Clock drift: The crystal-based clocks used in computers are, like any other clocks, subject to clock drift (Figure8.1), which means that they count time at different rates, and so deviate. The underlying oscillators are subject to physical variations, with the consequence that their frequencies of oscillation differ. Moreover, even the same clock’s frequency varies with the temperature. Designs exist that attempt to compensate for this variation but they cannot eliminate it. The difference in the oscillation period Network Fig.7.1 Drift between computers clocks in a distributed system. between two clocks might be extremely small, but the different accumulated over many oscillation leads to an observable difference in the counters registered by two clocks, no matter how accurately they were initialized to the same value. A clock’s drift rate is the change in the offset (difference in reading) between the clock and a nominal perfect reference clock per unit of time measured by the reference clock. For clocks based on a query crystal, this is about 10-6 – giving a difference of one second every 1,000,000 seconds or 11.6 days. Coordinated universal time: The most accurate physical clocks known use atomic oscillators, whose accuracy is about one part in 1013. The output of these atomic clocks is used as the standard for elapsed real time, known as International Atomic Time. Coordinated universal time – abbreviated as UTC (sic) – is an international standard that is based on atomic time, but a so-called leap second is occasionally inserted or deleted to keep in step with astronomical time. UTC signals are synchronized and broadcast regularly from land-based radio stations and satellites covering many parts of the world. For example, in the US radio station WWW broadcasts time signals on several short-wave frequencies. Satellite sources include the Geostationary Operational Environmental Satellite (GOES) and the Global Positioning System (GPS). Receivers are available commercially. Compared with ‘perfect’ UTC, the signals received from land-based stations have accuracy in the order of 0.1-10 milliseconds, depending on the station used. Signals received from GEOS are accurate to about 0.1ms and signals received from GPS are accurate to about one millisecond. Computers with receivers attached can synchronize their clocks with these timing signals. Compensating for clock drift :If the time provided by a time service, such as Universal Coordinated Time signals, is greater than the time at a computer C, then it may be possible simply to set C’s clock to the time service time. Several clock ‘ticks’ appear to have been missed, but time continues to advance, as expected. Now consider what should happen if the time service’s time is behind that of C. We cannot set C’s time back to the time, because this is liable to confuse applications that relay on the assumption that time always advances. The solution is not set C’s clock back, but to cause it to run slow for a period, until it is in accord with the timeserver time. It is possible to change the rate at which updates are made to the time as given to applications. This can be achieved in software, without changing the rate at which the hardware clock ticks (an operation which is not always supported by hardware clocks). Let us call the time given to applications (the software clock’s reading) S and the time given by the hardware clock H. Let the compensating factor be , so that S(t)=H(t) + (t) The simplest form for to make S change continuously is a linear function of the hardware clock: (t) = aH(t) + b, where a and b are constants to be found. Substtuting for in the identity, we have S(t) = (1 + a)H(t) + b. Let the value of the software clock be Tskew when H=h, and let the actual time at the point be Treal. We may have that Tskew > Treal or Tskew <Treal. If S is to give the actual time after N further ticks, we must have Tskew =(1 + a)h+b, and Treal + N = (1+a)(h + N) + b By solving these equations, we find A = (Treal - Tskew) / N and b= Tskew – (1 + a)h. 8.2.1.1 Cristian’s method for synchronizing clocks :One way to achieve synchronization between computers in a distributed system is for a central timeserver process S to supply the time according to its clock upon request, as shown in Fig 8.2. The timeserver computer can be fitted with a suitable receiver so as to be synchronized with UTC. If a process P requests the time in a message mr, and receives the time value t in a message mt , then in principle it could set its clock to the time t +Ttrans, where Ttrans, mr P mt Time server,s Fig.8.2 Clock synchronization using a timeserver. where Ttrans is the time taken to transmit mt from S to P (t is inserted in mt at the last possible point before transmission from S’s computer). Unfortunately, Ttrans is subject to variation. In general, other processes are competing with S and P for resources at each computer involved, and other messages compete with mt for the network. These factors are unpredictable and not practically avoidable or accurately measurable in most installations. In general, we may say that Ttrans = min + x, where x 0. The minimum value min is the value that would be obtained if no other processes executed and no other network traffic existed; min can be measured or conservatively estimated. The value of x is not known in a particular case, although a distribution of values may be measurable for a particular installation. Cristian suggested the use of such a timeserver, connected to a device that receives signals from a source of UTC, to synchronize computers. Synchronization between the timeserver and its UTC receiver can be achieved by a method that is similar to the following procedure for synchronization between computers. A process P wishing to learn the time from S can record the total round-trip time with reasonable accuracy if its rate of clock drift is small. For example, the round-trip time should be in the order of 1-10 milliseconds on a LAN, over which time a clock with a drift rate of 10-6 varies by at most 10-5 milliseconds. Let the time returned in S’s message mt be t. A simple estimate of the time to which P should set its clock is t+ Tround/2, which assumes that the elapsed time is split equally before and after S placed t in mt. Let the time between sending and receipt for mr and mt. be min + x and min + y, respectively. If the value of min is known or can be conservatively estimated, then we can determine the accuracy of this result as follows. The earliest point that S could have placed the time in mt was min after P dispatched mt. The latest point at which it could have done this was min before mt arrived at P. The time by S’S clock when the reply messages arrives is therefore in the range [t+min, t+Tround-min]. The width of this range is Tround – 2 min, so the accuracy is (Tround /2-min). Discussion of Cristian’s algorithm: As described Cristian’s method suffers from the problem associated with all services implemented by a single server, that the single timeserver might fail and thus render synchronization impossible temporarily. Cristian suggested, for this reason, that time should be provided by a group of synchronized timeservers, each with a server for UTC time signals. For example, a client could multicast its request to all servers, and use only the first reply obtained. Note that a malfunctioning timeserver that replied with spurious time values, or an improper timeserver that replied with deliberately incorrect times could wreak havoc in a computer system. 8.2.1.2 The Berkeley algorithm: Gusella and Zatti describe an algorithm for internal synchronization which they developed for collections of computers running Berkeley UNIX. In it, a coordinator computer is chosen to act as the master. Unlike Cristian’s protocol, this computer periodically polls the other computers whose clocks are to be synchronized, called slaves. The slaves send back their clock values to it. The master estimates their local clock times by observing the round-trip times (similarly to Cristian’s technique), and it averages the values obtained (including its own clock’s reading). The balance of probabilities is that this average cancels out the individual clocks’ tendencies to run fast or slow. The accuracy of the protocol depends upon a nominal maximum round-trip time between the master and the slaves. The master eliminates any occasional readings associated with larger times than this maximum. Instead of sending the updated current time back to the other computers – which would introduce further uncertainty due to the message transmission time – the master sends the amount by which each individual slave’s clock requires adjustment. This can be a positive or negative value. The algorithm eliminates readings from clocks that have drifted badly or that has failed and provides spurious readings. Such clocks could have a significant adverse effect if an ordinary average was taken. The master takes a fault-tolerant average. That is, a subset of clocks is chosen that do not differ from one another by more than a specified amount, and the average is taken only of readings from these clocks. Gusella and Zatti describe an experiment involving 15 computers whose clocks were synchronized to within about 20-25 milliseconds using their protocol. The local clocks’ drift rate was measured to be less than 2 x 10-5 , and the maximum round-trip time was taken to be 10 milliseconds. 8.2.1.3 The Network Time Protocol: The Network Time Protocol (NTP) defines architecture for a time service and a protocol to distribute time information over a wide variety of interconnected networks. It has been adopted as a standard for clock synchronization throughout the Internet. NTP’s chief design aims and features are: To provide a service enabling clients across the Internet to be synchronized accurately to UTC; despite the large and variable message delays encountered in Internet communication. NTP employs statistical techniques for the filtering of timing data and it discriminates between the qualities of timing from different servers. To provide a reliable service that can survive lengthy losses of connectivity; there are redundant servers and redundant paths between the servers. The servers can reconfigure so as still to provide the service if one of them becomes unreachable. To enable clients to resynchronize sufficiently; to offset the rate of drift found in most computers. The service is designed to scale to large number of clients and servers. To provide protection against interference with the time service; whether malicious or accidental. The time service uses authentication techniques to check that timing data originate from the claimed trusted sources. It also validates the return addresses of messages sent to it. . 1 1 2 3 3 3 Fig.8.3 An example synchronization subnet in an NTP implementation Note: Arrows denote synchronization control and Numbers denote strata. A network of servers located across the Internet provides the NTP service. Primary servers are directly connected to a time such as a radio clock receiving UTC; secondary servers are synchronized, ultimately, to primary servers. The servers are connected in a logical hierarchy called a synchronization subnet (see Fig 8.3), whose levels are called strata. Primary servers occupy stratum 1: they are at the root. Stratum 2 servers are secondary servers that are synchronized directly to the primary severs; stratum 3 are synchronized from stratum 2 servers, and so on. The lowest level (leaf) severs execute in users’ workstations. NTP servers synchronize modes: multicast, procedure-call and symmetric mode. Multicast mode: is intended for use on a high-speed LAN. One or more servers periodically multicasts the time to the servers running in other computers connected by the LAN, which set their clocks assuming a small delay. This mode can only achieve relatively low accurate, but ones which nonetheless are considered sufficient for many purposes. Procedure-call mode: is similar to the operation of Cristian’s algorithm, described above. In this mode, one server accepts requests from other computers, which it processes by replying with its timestamp (current clock reading). This mode is suitable where higher accuracies are required that can be achieved with multicast or where multicast is not supported in hardware. For example, file servers on the same or a neighboring LAN , which need to keep accurate timing information for file accesses, could contact a local master server in procedure-call mode. Symmetric mode: is intended for use by the master servers that supply time information in LANs and by the higher levels (lower strata) of the synchronization subnet, where the highest accuracies are to be achieved. A pair of servers operating in symmetric mode exchange messages bearing timing information. Timing data are retained as part of an association between the servers that is maintained in order to improve the accuracy of their synchronization over time. In all modes, messages are delivered unreliably, using the standard UDP Internet transport protocol. In procedure-call mode and symmetric mode, messages are exchanged in pairs. Each message bears timestamps of recent message events; the local times when the previous NTP messages between the pair were sent and received, and the local time when the current message was transmitted. The recipient of the NTP message notes the local time when it receives the message. The four times Ti-3, Ti-2, Ti-1, and Ti are shown in Fig 8.4 for messages m and m’ sent between servers A and B. Note that in symmetric mode, unlike the Cristian algorithm, there can be a nonnegligible delay between the arrival of one message and the dispatch of the next. Also, messages may be lost, but the three timestamps carried by each message are nonetheless valid. For each pair of messages sent between two servers the NTP protocol calculates an offset oi that is an estimate of the actual offset between the two clocks, and a delay di that is total transmission time for the two messages. If the true offset of the clock at B relative to that at A is o, and if the actual transmission times for m and m’ are t and t’ respectively, then we have: Ti-2 = Ti-3 + t + o, and Ti = Ti-1 + t’ – o Defining a = Ti-2 – Ti-3 and b=Ti-1 – Ti this leads to: di=t + t’ = a-b ; o = oi + ( t’ – t)/2 where oi = ( a+b)/2 Server B Ti-2 Ti-1 Time Time Server A Fig.8.4 Ti-3 Ti Message exchanged between a pair of NTP peers Using the fact that t, t’ 0 it can be shown that oi – dj/2 o oi + di/2. Thus oi is an estimate of the offset, and di is a measure of the accuracy of this estimate. 8.3 Logical time and logical clocks: From the point of view of any single process, events are ordered uniquely by times shown on the local clock. However, as Lamport pointed out, since we cannot perfectly synchronize clocks across a distributed system, we cannot in general use physical time to find out the order of any arbitrary pair of events occurring within it. The order of events occurring at different processes can be critical in a distributed application. In general, we can use a scheme which is similar to physical causality, but which applies in distributed systems, to order some of the events that occur at different processes. This ordering is based on two simple and naturally obvious points: If two events occurred at the same process, then they occurred in the order in which it observes them. Whenever a message is sent between processes, the event of sending the message occurred before the event of receiving the message. Lamport called the ordering obtained by generalizing these two relationships the happened-before relation. It is also sometimes known as the relation of casual ordering or potential casual ordering. More formally, we write x y if two events x and y occurred at a single process p, and x occurred before y. Using this restricted order we can define the happened-before relation, denoted by, as follows: HB1: If process p: x y, then , x y. HB2: For any messages m, send(m) - rcv(m). where send(m) is the event of sending the message, and rcv(m) is the event of receiving it. HB3: If x,y and z are events such that x y and y z, then x z. P1 a b m1 Physical Time P2 c d m2 P3 e f Fig. 8.5 Event occurring at three processes. Thus if x y, then we can find a series of events e1,e2, e3…, en occurring at one or more processes such that x=e1 and y=en and for i=1,2 .., n-1, either HBI or HB2 applies between ei and ei+1: That is, either they occur in succession at the same process., or there is a message m such that ei=send(m) and ei+1=rev(m). The sequence of events e1,e2, e3…, en need not be unique. The relation is illustrated for the case of the three processes P1,P2 and P3 in Fig 8.5. It can be seen that a b, since the events occur in this order at process p1, and similarly c d; b c, since these are the sending and reception of message m1, and similarly d f. Combining these relations, we may also say that, for example, a f. It can also be observer from Fig 8.5 that not all events are related by the relation . We say that events such as a and e that are not ordered by are concurrent, and write this all e. The relation captures a flow of data intervening between two events. Note however, that in real life data can flow in ways other than by message passing. For example, if Smith enters a command to his process to send a message, then telephones Jones who commands her process to issue another message, then the issuing of the first message clearly happened-before that of the second. Unfortunately, since no network messages were sent between the issuing processes, we cannot model this type of relationship in our system. 8.3.1 Logical clocks: Lamport invented a simple mechanism by which the happened before ordering can be captured numerically, called a logically clock. A logical clock is a monotonically increasing software counter, whose value need bear no particular relationship to any physical clock, in general. Each process p keeps its own logical clock, Cp, which it uses to timestamp events. P1 1 2 a b P2 P3 m1 3 c 4 d 1 e Physical Time m2 5 f Fig.8.6 Logical timestamps for the events shown in Fig 8.5 We denote the timestamp of event a at p by Cp(a), and by C(b) we denote the timestamp of event b at whatever process it occurred. To capture the happened-before relation , processes update their logical clocks and transmit the values of their logical clock in messages as follows: LC1 : Cp is incremented before each event is issued at process p: Cp := Cp + 1 LC2 : a) When a process p sends a message m, it piggybacks on m the value t=Cp. b) On receiving (m,t) a process q computes Cq:=max(Cq,t) and then applies LCI before time stamping the event rev(m). Although we increment clocks by one, we could have chosen any positive value. It can easily be shown, by induction on the length of any sequence of events relating two events a and b, that a b => C(a) < C(b). Note that the converse is not true. If C(a) < C(b), then we cannot infer that a b. In Fig 8.6 we illustrate the use of logical clocks for the example given in Fig 8.5. Each of the processes p1,p2 and p3 has its logical clocks initialized to 0. The clock values given are those immediately after the event to which they are adjacent. Note that, for example, C(b) > C(e) but b || e. 8.3.2 Totally ordered logical clocks: Logical clocks impose only a partial order on the set of all events, since some pairs of distinct events, generated by different processes, have numerically identical timestamps. However, we can extend this to a total order-that is, one for which all pairs of distinct events are ordered – by taking into account the identifiers of the processes at which events occur. If a is an event occurring at pa with local timestamp Ta, and b is an event occurring at pb with local timestamps Tb, we define the global logical timestamps for these events to be (Ta , pa) and (Tb , pb) respectively and we define (Ta , pa) < (Tb , pb) if and only if either Ta < Tb or Ta = Tb and , pa < pb. 8.4 Distributed Coordination: Distributed processes often need to coordinate their activities. For example, if a collection of processes shares a resource or collection of resources managed by a server, then often-mutual exclusion is required to prevent interference and ensure consistency when accessing the resources. This is essentially the Critical section problem, familiar in the domain of the domain of operating systems. In the distributed case, however, neither shared variables nor facilities supplied by a single local kernel can be used to solve it in general. A separate, generic mechanism for distributed mutual exclusion is required in certain cases – one that is independent of the particular resource management scheme in question. Distributed mutual exclusion involves a single process being given a privilege - the right to access shared resources – temporarily, before another process is granted it. In some other cases, however, the requirement is for a set of processes to choose one of their numbers to play a privileged, coordinating role over the long term. A method for choosing a unique process to play a particular role is called an election algorithm. This sectionnow examines some algorithms for achieving the goals of distributed mutual exclusion and elections. 8.4.1 Distributed mutual exclusion: Our basic requirements for mutual exclusion concerning some resource (or collection of resources) are as follows: ME1 : (safety) At most one process may execute in the critical section (CS) at a time ME2 : (LIVENESS) A process requesting entry to the CS is eventually granted it (so long as any process executing in the CS eventually leaves it). ME2 implies that the implementation is deadlock-free, and that starvation does not occur. A further requirement, which may be made, is that of casual ordering: ME3 : (ordering) Entry to the CS should be granted in happened-before order. A process may continue with other processing while waiting to be granted entry to a critical section. During this time it might send a message to another process, which consequently also tries to enter the critical section. ME3 specifies that the first process should be granted access before the second. We now discuss some algorithms for achieving these requirements. 8.4.1.1 The central server algorithm: The simplest way to achieve mutual exclusion is to employ a server that grants permission to enter section. For the sake of simplicity, we shall assume that there is only one critical section managed by the server. Recall that the protocol for executing a critical section is as follows. enter() (*enter critical section – block if necessary *) …… (*access shared resources in critical section *) exit() (*leave critical section – other processes may now enter *) Fig 8.7 shows the use of this server. To enter a critical section, a process sends a request message to the server and awaits a reply from it. Conceptually, the reply constitutes a token signifying permission to enter the critical section. If no other process has the token at the time of the request, then the server replies immediately, granting the token. If the token is currently held by another process, then the server does no reply Queue of Requests 4 Server 2 1. Grant token 1. Request token P1 P2 1. Release token P4 P3 Fig.8.7 Server managing a mutual exclusion token for a set of processes but queues the request. On existing the critical section, a message is sent to the giving it back the token. If the queue of waiting processes is not empty, then the server chooses the oldest entry in the queue, removes it and replies to the corresponding process. The chosen process then holds the token. In the figure, we show a situation in which p2’s request has been appended to the queue, which already contained p4’s request. p3 exists the critical section, and the server removes p4’s entry and grants permission to enter to p4 by replying to it. Process p1 does not currently require entry to the critical section. The server is also critical point of failure. Any process attempting to communicate with it, either to enter or exit the critical section, will detect the failure of the server. To recover from a server failure, a new server can be created (or one of the processes requiring mutual exclusion could be picked to play a dual role as a server). Since the server must be unique if it is to guarantee mutual exclusion, an election must be called to choose one of the clients to create or act as the server, and to multicast its address to the others. We shall describe some election algorithms below. When the new server has been chosen, it needs to obtain the state of its clients, so that it can process their requests, as the previous server would have done. The ordering of entry requests will be different in the new server from that in the failed one unless precautions are taken, but we shall not address this problem here. 8.4.1.2 A distributed algorithm using logical clocks: Ricart and Agrawala developed an algorithm to implement mutual exclusion that is based upon distributed agreement, instead of using a central server. The basic idea is that processes that require entry to a critical section multicast a request message, and can enter it only when all the other processes have replied to this message. The conditions under which a process replies to a request are designed to ensure that conditions ME1 – ME3 are met. Again, we can associate permission to enter a token in this case takes multiple messages. The assumptions of the algorithm are that the processes p1 … … pn know one another’s addresses, that all messages sent are eventually delivered, and that each process pi keeps a logical clock, updated according to the rules LC1 and LC2 of the previous unit. Messages requesting the token are of the for <T, pi>, where T is the sender’s timestamp and pi is the sender’s identifier. For simplicity’s sake, we assume that only one critical section is at issue, and so it does not have to be identified. Each process records its state of having released the token (RELEASED), wanting the token (WANTED) or processing the token (HELD) in a variable. The protocol is given in Fig 8.8. Fig.8.8 Ricart and Agrawala’s algorithm On initialization: State := RELEASED; To obtain the token: State := WANTED; Multicast request to all processes; Request processing deferred here T:=request’s timestamp; Wait until (number of replies received = (n-1); On receipt of a request < Ti , pi > at pj (i # j); If (state = HELD or (state = WANTED and (T,pj) < (T,pi))) Then Queue request from pi without replying; Else Reply immediately to pi; End if To release token State := RELEASED; Reply to any queued requests; If a process requests the token is RELEASED everywhere else – that is, no other process wants it – then all processes will reply immediately to the request and the requester will obtain the token. If a token is HELD at some process, then that process will not reply to requests until it has finished with the token, and so the requester cannot obtain the token in the meantime. If two or more processes request token at the same time, then whichever process’s request bears the lowest timestamp will be the first to collect (n-1) replies, granting it the token next. If the requests bear equal timestamps, the process identifiers are compared to order them. Note that, when a process requests the token, it defers processing requests from other processes until its own request has been sent and the timestamp T is known. This is so that processes make consistent decisions when processing requests. To illustrate the algorithm, consider a situation involving three processes, p1, p2 and p3 shown in Fig 8.9. Let us assume that p3 is not interested in the token, and that p1 and p2 requests it concurrently. The timestamp of p1’s request is 41, that of p2 is 34. When p3 receives their requests, it replies immediately. When p2 receives p1’s request it finds its own request has the lower timestamp, and so does not reply, holding p1 off. However, p1 finds that p2’s request has a lower timestamp than that of its own request, and so replies immediately. On receiving this second reply, p2 possesses the token. When p2 releases the token, it will reply to p1’s request, and so grant it to token. Obtaining the token takes 2(n-1) messages in the algorithm: (n-1) to multicast the request, followed by (n-1) replies. Or, if there is hardware support for multicast, only one message is required for the request; the total is then n messages. It is thus a processes request mutual exclusion by multicasting. Considerably more expensive algorithm, in general, than the central server algorithm just described. Note also, that while it is a fully distributed algorithm, the failure of any process involved would make process impossible. And in the distributed algorithm all the processes involved receive and process every request, so no performance gain has been made over the single server bottleneck, which does just the same. Finally, note that a process that wishes to obtain the token and which was the last to process it still goes through the protocol as described, even though it could simply decide locally to reallocate it to itself. Ricart and Agrawala refined this protocol so that it requests n messages to obtain the token in the worst (and common) case, without hardware support for multicast. 8.4.1.3 A ring-based algorithm: One of the simplest ways to arrange mutual exclusion between n processes p1, ... pn is to arrange them in a logical ring. The idea is that exclusion is conferred by obtaining a token in the form of a message passed from process to process in a single direction-clockwise, say – round the ring. The ring topology – which is unrelated to the physical interconnections between the underlying computers- which is unrelated to the physical interconnections between the underlying computers – is created by giving each process the address of its neighbor. A process that requires the token waits until it receives it, but retains it. To exit the critical section, the process sends the token on to its neighbor. 41 P3 41 Reply P1 Reply Reply 34 34 41 34 P2 Fig. 8.9 Multicast Synchronization The arrangement of processes is shown in Fig 8.10. It is straightforward to verify that the conditions ME1 and ME2 are met by this algorithm, but that the token is not necessarily obtained in happened-before order. It can take from 1 to (n-1) messages to obtain this token, from the point at which the token becomes required. However, messages are sent around the ring even when no process requires the token. If a process fails, then clearly no progress can be made beyond it in transferring the token, until a reconfiguration is applied to extract the failed process from the ring. P1 P2 Pn P3 P4 Token Fig. 8.10 A ring of processes transferring a mutual exclusion token. If the process holding the token fails, then an election is required to pick a unique process from the surviving members, which will regenerate the token and transmit it as before. Care has to be taken, in ensuring that the ‘failed’ process really has failed, and does not later unexpectedly inject the old token into the ring, so that there are tow tokens. This situation can arise since process failure can only be ascertained by repeated failure of the process to acknowledge messages sent to it. 8.5 Elections: An election is a procedure carried out to choose a process from a group, for example to take over the role of a process that has failed. The main requirement is for the choice of elected process to be unique, even if several processes call elections concurrently. 8.5.1 The bully algorithm: The bully algorithm can be used when the numbers of the group know the identities and addresses of the other members. The algorithm selects the surviving members with the largest identifiers to function as the coordinator. We assume that communication is reliable, but processes can be failed during an election. The algorithm proceeds as follows. There are three types of messages in this algorithm. An election message is sent to announce an election An answer message is sent in response to an election message A coordinator message is sent to announce the identity of the new coordinator. A process begins an election when it notices that the coordinator has failed. To begin an election, a process sends an election message those processes that have a higher identifier. It then awaits an answer message in response. If none arrives within a certain time, the process considers itself the coordinator, and sends a coordinator message to all processes with lower identifiers announcing this fact. Otherwise, the process waits further limited period for a coordinator message to arrive from the new coordinator. If none arrives, it begins another election. If a process receives a coordinator message, it records the identifier of the coordinator contained within it, and treats that process as the coordinator. If a process receives an election message, it sends back an answer message, and begins another election – unless it has begun one already. When a failed process is restarted, it begins an election. If it has the higher process identifier, then it will decide that it is the coordinator, and announce this to the other processes. Thus it will become the coordinator, even though the current coordinator is functioning. It is this reason that the algorithm is called the ‘bully’ algorithm. The operation of the algorithm is shown in Fig 8.11. There are four processes p1-p4, and an election is called when p1 detects the failure of the coordinator, p4, and announces an election (stage 1 in the figure). On receiving an election message from p1, p2 and p3 send answer messages to p1 and begin their own election; p3 sends an answer message to p2, but p3 receives no answer message from the failed process p4 (stage 2). It therefore decides that it is the coordinator. But before it can send out the coordinator message, it too fails (stage 3). When p1’s timeout period expires (which we assume occurs before p2’s timeout expires), it notices the absence of a coordinator message and begins another election. Eventually, p2 is elected coordinator (stage 4). 8.5.2 A ring-based election algorithm: We give the algorithm of Chang and Roberts, suitable for a collection of processes that are arranged in a logical ring (Refer fig8.12). We assume that the processes do not know the identities of the others a priori, and that each process knows only how to communicate with its neighbor in, say, the clockwise direction. The goal of this algorithm is to elect a single coordinator, which is the process with the largest identifier. The algorithm assumes that all the processes remain functional and reachable during its operation. Initially, every process is marked as a non-participant in an election. Any process can begin an election. It proceeds by marking itself as a participant, placing its identifier in an election message and sending it to its neighbor. When a process receives an election message, it compares the identifier in the message with its own. If the arrived identifier is the greater, then it forwards the message to its neighbor. If the arrived identifier is smaller and the receiver is not a participant then it substitutes its own identifier in the message and forwards it; but it does not forward the message if it is already a participant. On forwarding an election message in any case, the process marks itself as a participant. If however, the received identifier is that of the receiver itself, then this process’s identifier must be the greatest, and it becomes the coordinator. The coordinator marks itself as a nonparticipant, and forwards the message to its neighbor. election C election answer Stage 1 P2 P3 P4 P1 answer election C election Stage 2 P2 P1 answer election P3 P4 timeout Stage 3 P1 P2 P3 P4 P3 P4 coordinator Eventually….. C Stage 4 P2 P1 Fig. 8.11 The bully algorithm: The election of coordinator p2, after the failure of p4 and then p3. 3 17 4 24 9 1 15 24 28 NOTE: The election was started by process 17. The highest process identifier encountered so far is 24. Participant processes are shown darkened Fig.8.12 A Ring based Election in progress. The point of making processes as participant or non-participant is so that messages arising when another process starts an election at the time are extinguished as soon as possible, and always before the ‘winning’ election result has been announced. If only a single process starts an election, then the worst case is when its anticlockwise neighbor has the highest identifier. A total of n-1 messages are then required to reach this neighbor, which will not announce its election until its identifier has completed another circuit, taking a further n messages. The elected message is then sent n times, making 3n-1 messages in all. An example of a ring-based election in progress is shown in Figure 10.12. The election message currently contains 24, but process 28 will replace this with its identifier when the message reaches it. 8.6 Summary: In this unit first we have discussed the importance of accurate time keeping for distributed systems. It then described algorithms for synchronizing clocks despite the drift between them and the variability of message delays between computers. The degree of synchronization accuracy that is practically obtainable fulfils many requirements, but is nonetheless not sufficient to determine the ordering of an arbitrary pair of events occurring at different computers. The happened-before relationship is a partial order on events, which reflects a flow of information – within a process, or via messages between processes – between them. Some algorithms require events to be ordered in happened-before order, for example, successive updates made at separate copies of data. Logical clocks are counters that are updated so as to reflect the happened-before relationship between them. The unit then described the need for processes to access shared resources under conditions of mutual exclusion. Resource servers in all cases do not implement locks, and a separate distributed mutual exclusion service is then required. Three algorithms were considered which achieve mutual exclusion: the central server, a distributed algorithm using logical clocks, and a ring-based algorithm. These are heavyweight mechanisms that cannot withstand failure, although they can be modified to be fault-tolerant. On the whole, it seems advisable to integrate locking with resource management. Finally, the unit considered the bully algorithm and a ring-based algorithm whose common aim is to elect a process uniquely from a given set – even if several elections take place concurrently. These algorithms could be used, for example, to elect a new master timeserver, or a new lock server, when the previous one fails.
© Copyright 2024 Paperzz