ppt

Lecture IX: Coordination And Agreement
CMPT 431
A Replicated Service
servers
client
slave
W
R
master
network
R
W
W
slave
client
W write
CMPT 431 © A. Fedorova
data
W replication
R read
2
client
A Need For Coordination And
Agreement
servers
slave
Must coordinate
election of a new
master
network
master
Must agree on a
new master
slave
client
CMPT 431 © A. Fedorova
3
Roadmap
• Today we will discuss protocols for coordination and
agreement
• This is a difficult problem because of failures and lack of
bound on message delay
• We will begin with a strong set of assumptions (assume
few failures), and then we will relax those assumptions
• We will look at several problems requiring communication
and agreement: distributed mutual exclusion, election
• We will finally learn that in an asynchronous distributed
system it is impossible to reach a consensus
CMPT 431 © A. Fedorova
4
Distributed Mutual Exclusion (DMTX)
•
•
•
•
Similar to a local mutual exclusion problem
Processes in a distributed system share a resource
Only one process can access a resource at a time
Examples:
– File sharing
– Sharing a bank account
– Updating a shared database
CMPT 431 © A. Fedorova
5
Assumptions and Requirements
•
•
•
•
A synchronous system
Processes do not fail
Message delivery is reliable (exactly once)
Protocol requirements:
Safety: At most one process may execute in the critical
section at a time
Liveness: Requests to enter and exit the critical section
eventually succeed
Fairness: Requests to enter the critical section are granted
in the order in which they were received
CMPT 431 © A. Fedorova
6
Evaluation Criteria of DMTX Algorithms
• Bandwidth consumed
– proportional to the number of messages sent in each entry and
exit operation
• Client delay
– delay incurred by a process and each entry and exit operation
• System throughput
– the rate at which processes can access the critical section
(number of accesses per unit of time)
CMPT 431 © A. Fedorova
7
DMTX Algorithms
• We will consider the following algorithms:
– Central server algorithm
– Ring-based algorithm
– An algorithm based on voting
CMPT 431 © A. Fedorova
8
The Central Server Algorithm
CMPT 431 © A. Fedorova
9
The Central Server Algorithm
• Performance:
– Entering a critical section takes two messages (a request message
followed by a grant message)
– System throughput is limited by the synchronization delay at the
server: the time between the release message to the server and
the grant message to the next client)
• Fault tolerance
– Does not tolerate failures
– What if the client holding the token fails?
CMPT 431 © A. Fedorova
10
A Ring-Based Algorithm
CMPT 431 © A. Fedorova
11
A Ring-Based Algorithm (cont)
• Processes are arranged in the ring
• There is a communication channel from process pi to process
(pi+1) mod N
• They continuously pass the mutual exclusion token around the
ring
• A process that does not need to enter the critical section (CS)
passes the token along
• A process that needs to enter the CS retains the token; once it
exits the CS, it keeps on passing the token
• No fault tolerance
• Excessive bandwidth consumption
CMPT 431 © A. Fedorova
12
Maekawa’s Voting Algorithm
• To enter a critical section a process must receive a
permission from a subset of its peers
• Processes are organized in voting sets
• A process is a member of M voting sets
• All voting sets are of equal size (for fairness)
CMPT 431 © A. Fedorova
13
Maekawa’s Voting Algorithm
p4
p1
p3
p2
CMPT 431 © A. Fedorova
• Intersection of voting
sets guarantees mutual
exclusion
• To avoid deadlock,
requests to enter critical
section must be ordered
14
Elections
• Election algorithms are used when a unique process must
be chosen to play a particular role:
– Master in a master-slave replication system
– Central server in the DMTX protocol
• We will look at the bully election algorithm
• The bully algorithm tolerates failstop failures
• But it works only in a synchronous system with reliable
messaging
CMPT 431 © A. Fedorova
15
The Bully Election Algorithm
• All processes are assigned identifiers
• The system always elects a coordinator with the highest
identifier:
– Each process must know all processes with higher
identifiers than its own
• Three types of messages:
– election – a process begins an election
– answer – a process acknowledges the election message
– coordinator – an announcement of the identity of the
elected process
CMPT 431 © A. Fedorova
16
The Bully Election Algorithm (cont.)
• Initiation of election:
– Process p1 detects that the existing coordinator p4 has
crashed an initiates the election
– p1 sends an election messages to all processes with higher
identifier than itself
election
election
p1
CMPT 431 © A. Fedorova
p2
p3
p4
17
The Bully Election Algorithm (cont.)
• What happens if there are no crashes:
– p2 and p3 receive the election message from p1 send back
the answer message to p1 , and begin their own elections
– p3 sends answer to p2
– p3 receives no answer message from p4, so after a timeout it
elects itself as a leader (knowing it has the highest ID)
coordinator
coordinator
election
election
election
p1
p2
answer
election
p3
p4
answer
answer
CMPT 431 © A. Fedorova
18
The Bully Election Algorithm (cont.)
• What happens if p3 also crashes after sending the answer
message but before sending the coordinator message?
• In that case, p2 will time out while waiting for coordinator
message and will start a new election
election
election
election
p1
p2
answer
CMPT 431 © A. Fedorova
election
p3
p4
answer
answer
19
The Bully Election Algorithm (summary)
•
•
•
•
•
The algorithm does not require a central server
Does not require knowing identities of all the processes
Requires knowing identities of processes with higher IDs
Survives crashes
Assumes a synchronous system (relies on timeouts)
CMPT 431 © A. Fedorova
20
Consensus With General Failures
• The algorithms we’ve covered so far tolerated only
failstop failures
• Let’s look at reaching consensus in presence of more
general failures
– Omission
– Byzantine
CMPT 431 © A. Fedorova
21
Consensus
• All processes agree on the same value (or set of values)
• When do you need consensus?
– Leader (master) election
– Mutual exclusion
– Transaction involving multiple parties (banking)
• We will look at several variants of consensus problem
– Consensus
– Byzantine generals
CMPT 431 © A. Fedorova
22
System Model
•
•
•
•
•
There is a set of processes Pi
There is a set of values {v0, …, vN-1} proposed by processes
Each processes Pi decides on di
di belongs to the set {v0, …, vN-1}
Assumptions:
–
–
–
–
Synchronous system (for now)
Failstop failures
Byzantine failures
Reliable channels
CMPT 431 © A. Fedorova
23
Consensus
P1
P1
v1
v3
v2
P2
d1
d2
Consensus
algorithm
d3
Step 1
Propose.
P3
P2
Step 2
Decide.
P3
Courtesy of Jeff Chase, Duke University
CMPT 431 © A. Fedorova
24
Consensus (C)
di = vk
Pi selects di from {v0, …, vN-1}.
All Pi select the same vk (make the same decision)
Courtesy of Jeff Chase, Duke University
CMPT 431 © A. Fedorova
25
Conditions for Consensus
• Termination: All correct processes eventually decide.
• Agreement: All correct processes select the same di.
• Integrity: If all correct processes propose the same v, then
di = v
CMPT 431 © A. Fedorova
26
Consensus in a Synchronous System
Without Failures
• Each process pi proposes a decision value vi
• All proposed vi are sent around, such that each process
knows all proposed vi
• Once all processes receive all proposed v’s, they apply to
them the same function, such as: minimum(v1, v2, …., vN)
• Each process pi sets di = minimum(v1, v2, …., vN)
• The consensus is reached
• What if processes fail? Can other processes still reach an
agreement?
CMPT 431 © A. Fedorova
27
Consensus in a Synchronous System
With Failstop & Omission Failures
• We assume that at most f out of N processes fail
• To reach a consensus despite f failures, we must extend
the algorithm to take f+1 rounds
• At round 1: each process pi sends its proposed vi to all
other processes and receives v’s from other processes
• At each subsequent round process pi sends v’s that it has
not sent before and receives new v’s
• The algorithm terminates after f+1 rounds
• Let’s see why it works…
CMPT 431 © A. Fedorova
28
Proof that Consensus is Reached
•
•
•
•
•
•
•
•
Will prove by contradiction
Suppose some correct process pi possesses a value that another correct process
pj does not possess
This must have happened because some other processes pk sent that value to pi
but crashed or before sending it to pj (or lost the message)
The crash must have happened in round f+1 (last round). Otherwise, pi would
have sent that value to pj in round f+1
But how come pj have not received that value in any of the previous rounds?
There must have been a crash at every previous round – some process sent the
value to some other processes, but did not send it to pj
But this implies that there must have been f+1 failures
This is a contradiction: we assumed at most f failures
CMPT 431 © A. Fedorova
29
A Take-Away Point
• If you cannot build a fully failproof algorithm...
• Build an algorithm that is guaranteed to tolerate some
number f of failures
• Then build a system that has fewer than f failures with
high probability
CMPT 431 © A. Fedorova
30
Byzantine Generals Problem (BG)
leader or
commander
vleader
subordinate or
lieutenant
dj = vleader
di = vleader
• Two types of generals: commander and subordinates
• A commander proposes an action (vi).
• Subordinates must agree
CMPT 431 © A. Fedorova
Courtesy of Jeff Chase, Duke University
31
Conditions for Consensus
• Termination: All correct processes eventually decide.
• Agreement: All correct processes select the same di.
• Integrity: If the commander is correct than all correct
processes decide on the value that the commander
proposed
CMPT 431 © A. Fedorova
32
Consensus in a Synchronous System
With Byzantine Failures
• Byzantine failure: a process can forward to another
process an arbitrary value v
• Byzantine generals: the commander...
– says to one lieutenant that v = A
– says to another lieutenant that v = B
• We will show that consensus is impossible with only 3
generals
• Pease et. al generalized this to impossibility of consensus
with N≤3f faulty generals
CMPT 431 © A. Fedorova
33
BG: Impossibility With Three General
Scenario 1
Scenario 2
p1 (Commander)
p1 (Commander)
1:v
1:v
“3:1:u” means
“3 says 1 says u”.
1:v
2:1:v
2:1:v
p3
p2
1:u
p2
3:1:u
p3
3:1:u
Faulty processes are shown shaded
•
•
•
•
•
Scenario 1: p2 must decide v (by integrity condition)
But p2 cannot distinguish between Scenario 1 and Scenario 2
If it decides to believe the general, it will decide v in Scenario 2
By symmetry, p3 will decide u in Scenario 2
p2 and p3 will have reached different decisions
CMPT 431 © A. Fedorova
34
Solution With Four Byzantine Generals
• We can reach consensus if there are 4 generals and at
most 1 is faulty
• Intuition: use the majority rule
Correct
process
CMPT 431 © A. Fedorova
Who is
telling the
truth?
Majority
rules!
35
Solution With Four Byzantine Generals
p1 (Commander)
p1 (Commander)
1:v
1:v
1:u
1:w
1:v
1:v
2:1:v
p2
2:1:u
p3
3:1:u
p2
4:1:v 4:1:v
2:1:v
4:1:v 4:1:v
3:1:w
p4
p3
3:1:w
2:1:u
Faulty processes are shown shaded
3:1:w
p4
Round 1: The commander sends v to all other generals
Round 2: All generals exchange values that they sent to commander
The decision is made based on majority
CMPT 431 © A. Fedorova
36
Solution With Four Byzantine Generals
p1 (Commander)
1:v
p2 receives: {v, v, u}. Decides v
p4 receives: {v, v, w}. Decides v
1:v
1:v
2:1:v
p2
p3
3:1:u
4:1:v 4:1:v
2:1:v
3:1:w
p4
CMPT 431 © A. Fedorova
37
Solution With Four Byzantine Generals
p2 receives: {u, w, v}. Decides NULL
p4 receives: {u, v, w}. Decides NULL
p3 receives: {w, u, v}. Decides NULL
p1 (Commander)
1:u
1:w
1:v
2:1:u
p2
p3
3:1:w
4:1:v 4:1:v
2:1:u
3:1:w
p4
The result generalizes for system with N ≥ 3f + 1,
(N is the number of processes, f is the number of
faulty processes)
CMPT 431 © A. Fedorova
38
Consensus in an Asynchronous System
• In the algorithms we’ve looked at consensus has been reached
by using several rounds of communication
• The systems were synchronous, so each round always
terminated
• If a process has not received a message from another process
in a given round, it could assume that the process is faulty
• In an asynchronous system this assumption cannot be made!
• Fischer-Lynch-Patterson (1985): No consensus can be
guaranteed in an asynchronous communication system in the
presence of any failures.
• Intuition: a “failed” process may just be slow, and can rise from
the dead at exactly the wrong time.
CMPT 431 © A. Fedorova
39
Consensus in Practice
• Real distributed systems are by and large asynchronous
• How do they operate if consensus cannot be reached?
• Assume a synchronous system: use manual fault resolution if
something goes wrong
• Fault masking: assume that failed processes always recover, and
define a way to reintegrate them into the group.
– If you haven’t heard from a process, just keep waiting…
– A round terminates when every expected message is received.
• Failure detectors: construct a failure detector that can determine if a
process has failed.
– A round terminates when every expected message is received, or the
failure detector reports that its sender has failed.
CMPT 431 © A. Fedorova
40
Failure Detectors
• First problem: how to detect that a member has failed?
– pings, timeouts, beacons, heartbeats
– recovery notifications
• Is the failure detector accurate? – Does it accurately
detect failures?
• Is the failure detector live? – Are there bounds on failure
detection time?
• In an asynchronous system, it impossible for a failure
detector to be both accurate and live
CMPT 431 © A. Fedorova
41
Summary
• Coordination and agreement are essential in real distributed
systems
• Real distributed systems are asynchronous
• Consensus cannot be reached in an asynchronous distributed
system
• Nevertheless, people still build useful distributed systems that
rely on consensus
• Fault recovery and masking are used as mechanisms for helping
processes reach consensus
• Popular fault masking and recovery techniques are transactions
and replication – the topics of the next few lectures
CMPT 431 © A. Fedorova
42