Fall 2012
Distributed Systems (5DV020)
Coordination and agreement
Fall 2012
1
A set of processes need to
coordinate their actions or agree
on one or more values
2
Processes often need to coordinate their actions
and/or reach an agreement
Which process gets to access a shared resource?
Has the master crashed? Elect a new one!
Failure detection – how can we know that a node has
failed (e.g., crashed)?
3
1
Fall 2012
Why not use a master-slave relationship?
Because we want our systems to keep
working correctly even if failures occur
We need to avoid single points of failure
4
Failure detection
5
Keep-alive messages every X time units
No message? the process has failed
Crashing immediately after sending?
Message delays, re-routing, etc.
6
2
Fall 2012
Unreliable failure detection
Unsuspected vs. Suspected
Reliable failure detection
Unsuspected vs. Failed
Synchronous systems
7
Distributed mutual exclusion
8
Is needed when a collection of
processes share a resource and need
to coordinate their access to the
resource
… but the solution need to be based on message passing
9
3
Fall 2012
Assumptions
10
N processes pi, i=1, 2, …, N that do not share variables
pi access shared resources in a critical section
The system is asynchronous, process do not fail, and message
delivery is reliable
pi’s are well behaved, finite time on the critical section
enter()
resourceAccesses()
exit()
Application level protocol for
executing a critical section
11
Essential requirements
12
4
Fall 2012
ME1 (safety): at most 1 process may enter the
critical section at a time
ME2 (liveness): requests to enter and exit the
critical section eventually succeed
ME3 (→ ordering): if a request to enter the critical
section happened-before another, then access is
granted according to that order
13
Fairness
if all requests are related by the
happened-before relation,
it is not possible for a process to enter the
critical section more than once while
another waits to enter
14
Some metrics to evaluate & compare Mutex algorithms
Bandwidth consumed
Client delay
Effect upon throughput
Synchronization delay
15
5
Fall 2012
Mutex algorithms
16
Central server
17
Send request to server, oldest process in queue
gets access (a token), return token when done
No process has token reply (enter) immediately
Otherwise queue request
Oldest process in the queue gets token after released
18
6
Fall 2012
Queue of
requests
Server
4
2
1. Request
token
p1
p2
3. Grant
token
2. Release
token
p4
p3
Figure adapted from Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 – Figure 15.2
19
Safety? Yes!
Liveness? Yes (as long as server does not crash)
→ ordering? No! Why not?
Performance
Entering – 2 messages (request + grant)
Exiting – 1 message
20
Ring-based
21
7
Fall 2012
Token is passed around a ring of processes
Want access? Wait until token comes, and claim it
(then pass the token along)
22
p
1
p
2
p
n
p
3
p
4
Token
Figure adapted from Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 – Figure 15.3
23
Safety? Liveness? Yes (assuming no crashes)
→ ordering? Not even close!
Performance
Continuously uses network bandwidth
Client delay – between 0 – N messages
Exiting – 1 message
Synchronization delay – between 1 – N messages
24
8
Fall 2012
Ricart and Agrawala
25
The idea is to multicast a request
message an enter the critical
section only when all other
processes have replied to the
message
26
Each process
– Has unique process ID
– Has communication channels to the other processes
– Maintains a logical (Lamport) clock
– Has state ∈ {wanted, held, released}
Requests are multicasted to group
(process ID and clock value) <id, value>
Lowest clock value gets access first
Equal values? Check process ID!
27
9
Fall 2012
On initialization
state := RELEASED;
To enter the section
state := WANTED;
Multicast request to all processes;
request processing deferred here
T := request’s timestamp;
Wait until (number of replies received = (N – 1));
state := HELD;
On receipt of a request <Ti, pi> at pj (i ≠ j)
if (state = HELD or (state = WANTED and (T, pj) < (Ti, pi)))
then
queue request from pi without replying;
else
reply immediately to pi;
end if
To exit the critical section
state := RELEASED;
reply to any queued requests;
Figure adapted from Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 – Figure 15.5
41
p
28
p
41
3
Reply
1
34
Reply
41
34
p
Reply
34
2
Figure adapted from Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 – Figure 15.5
29
Have access or want it yourself, and <id, value> is
lower than incoming request?
Queue the request! Else, reply to it
30
10
Fall 2012
Safety? Liveness? → ordering? Yes!
…but every node is a point of failure
Performance
Entering – 2(N-1) messages
(n-1) multicast request + (n-1) replies
Synchronization delay – 1 message transmission
Improve performance
If process wants to re-enter critical section, and no new
requests have been made, just do it!
31
Maekawa’s voting
32
Optimization: only ask subset of processes for entry
Works as long as subsets overlap
Vote only in one election
at a time!
Safety? Yes
Liveness? → ordering? No, deadlocks can happen!
but if enhanced using vector clocks we can get both!
33
11
Fall 2012
Comparison of Mutex algorithms
34
Central server
simple and error-prone!
Ring-based algorithm
also simple, but not single point of failure
Ricart and Agrawala
completely distributed and decentralized, multicast request,
access when all have replied (ordered using logical clocks)
Maekawa's voting algorithm
ask only a subset for access: works if subsets are overlapping
35
Study the algorithms yourselves, including their
performance
Message loss?
This is an excellent question for exams
None of the algorithms handle this
Crashing processes?
Ring? No! others? depends (central- not server nor
holding or having requested token, Maekawa’s only if
crashed process is not in voting set)
36
12
Fall 2012
Summary
37
Failure detection, hard to get right in asynchronous systems
Unreliable failure detection, unsuspected vs. suspected
Reliable failure detection, unsuspected vs. failed
Mutual Exclusion
central server
ring-based
Ricart and Agrawala
Maekawa's voting algorithm
38
Election algorithms
39
13
Fall 2012
How to choose a process to play
a particular role in the system
40
Any process can call an election but can only call
one election at a time
The election must always produce a unique
winner
The elected process is the one with the largest
identifier, identifiers should be unique and
totally ordered
41
Requirements
42
14
Fall 2012
E1 (safety): A participant has electedi = False or
electedi = P, where P is chosen as the non-crashed
process with the highest identifier
E2 (liveness): All processes participate and
eventually set electedi to not =False or crash
43
Ring-based
You may be familiar with this one!
Bully
Very specific requirements, not generally applicable
44
Ring-based
45
15
Fall 2012
Goal is to elect a single process – the
coordinator
process with the largest identifier
During election, pass max(own ID, incoming ID) to next
process
If a process receives own ID, it must have been
highest and may send that it has been elected
46
Safety? Liveness? Yes!
3
Tolerates no failures (limited use)
17
4
Worst case, N-1 messages until
reaching peer with largest identifier
24
9
+
N messages to complete another
circuit
1
15
28
+
24
N messages advertising the election
The election was started by process 17.
The highest process identifier encountered so far is 24.
Participant processes are shown in a darker color
3N-1 messages
Figure adapted from Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 – Figure 15.7
47
Bully algorithm
48
16
Fall 2012
Requires:
–
–
–
–
Synchronous system
All processes know of each other (which ones have higher ids)
Reliable failure detectors
Reliable message delivery
Allows
– Crashing processes
Safety? Not if process IDs can be reused!
Liveness? Yes (if message delivery is reliable)
49
Summary
50
Election algorithms
Seems like a simple problem, but non-trivial
solutions are... non-trivial
51
17
Fall 2012
Study these algorithms on your own, and their
properties
– Again, excellent exam material...
Want to read more about non-trivial election
algorithms?
– http://www.sics.se/~ali/teaching/dalg/l06.ppt
52
Consensus and related problems
53
The problem is for processes to
agree on a value after one or more
of the processes has proposed what
that value should be
… even in the presence of faults
54
18
Fall 2012
Agreement
55
Consensus
Any process can suggest a value,
all have to reach consensus on which is correct
Byzantine generals problem
One process can suggest a value,
others have to agree on what that value was
56
Consensus (in general)
Every process begins in an undecided state and
proposes a single value
Processes communicate with one another
exchanging values
Each process enters a decided state once it has
decided on a value
A process can’t change from a decided state
57
19
Fall 2012
P1
d 1 :=proceed
d 2 :=proceed
v 1 =proceed
1
P
2
v 2 =proceed
Consensus algorithm
v 3 =abort
P 3 (crashes)
Figure adapted from Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 – Figure 15.16
58
Requirements
Termination: eventually each correct process
selects a value
Agreement: only a single value is chosen
Integrity: if the correct processes all propose the
same value, then any correct process in the
decided state has chosen that value
59
Simple if processes can’t fail
Collect all processes in a group
Each process multicast its proposed value to the members of the
group
Each process waits for N messages (including own)
Evaluates majority(v1, v2, …, vN)
If no majority exists, majority returns a special value
60
20
Fall 2012
Byzantine generals problem
61
3 or more generals, one or more can be faulty
Commander: Issues order to attack or retreat
Lieutenants:
Decide what to do
Coordinate their actions by repeating the order from the
commander to each other
Integrity – if the commander is correct, then all
processes decide on the value that the commander
proposed
62
p 1 (Commander)
p 1 (Commander)
1:v
1:v
p2
2:1:v
1:x
1:w
p3
p2
3:1:u
2:1:w
p3
3:1:x
Faulty processes are shown colored
Impossible with three processes
Figure adapted from Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 – Figure 15.18
63
21
Fall 2012
p1
p 1 (Commander)
(Commander)
1:v
1:v
1:u
1:w
1:v
1:v
2:1:v
p2
2:1:u
p3
3:1:u
4:1:v
p2
4:1:v
3:1:w
4:1:v
2:1:u
p4
p3
3:1:w
4:1:v
2:1:v
3:1:w
p4
Faulty processes are shown colored
Possible with N ≥ 3f + 1 processes,
where f is amount of treacherous ones
Figure adapted from Instructor’s Guide for Coulouris, Dollimore, Kindberg and Blair, Distributed Systems: Concepts and Design Edn. 5 © Pearson Education 2012 – Figure 15.19
64
Solutions rely on system being synchronous
Asynchronous system – bad!
No timing constraints
Fischer's impossibility result
Even with just one crashing process, we cannot reach
consensus in an asynchronous system
This also applies to total ordering of messages and reliable
multicast...
Still, we manage to do quite well in practice, how can that be?
65
Impossibility workarounds
66
22
Fall 2012
Mask the faults
Use persistent storage and allow process restarts
Use failure detectors
No reliable detectors: but good enough (agree that
process is crashed if too long time has passed)
Eventually weak failure detector, reaches consensus
while allowing suspected processes to behave
correctly instead of excluding them.
67
Randomization
Introduce an element of chance to affect the
adversary’s strategy, enables to reach consensus in a
finite expected time, but still there are some cases
where consensus can’t be reached
68
Further reading:
Leslie Lamport Paxos Made Simple
ACM SIGACT News (Distributed Computing
Column) 32, 4 (Whole Number 121, December
2001) 51-58.
The article is well worth your time…
http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf
69
23
Fall 2012
Summary
70
The problem of agreement is for processes to
agree on a value after one or more of the
processes has proposed what that value should
be, even in the presence of faults
Byzantine generals problem
71
Fisher's impossibility result (asynchronous systems)
it is impossible to reach consensus even with a single faulty
process
Synchronous systems
Impossible for three generals
Possible when N ≥ 3f + 1 processes, and f faulty processes
A number of techniques for avoiding Fisher’s result
Masking faults
Failure detectors
Randomization
72
24
Fall 2012
Next Lecture
Group communication
73
25
© Copyright 2026 Paperzz