Consensus

CS 3700
Networks and Distributed Systems
Distributed Consensus and Fault Tolerance
(or, why can’t we all just get along?)
Black Box Online Services
Black Box Service
• Storing and retrieving data from online services is commonplace
• We tend to treat these services as black boxes
• Data goes in, we assume outputs are correct
• We have no idea how the service is implemented
Black Box Online Services
debit_transaction(-$75)
OK
get_recent_transactions()
[…, “-$75”, …]
• Storing and retrieving data from online services is commonplace
• We tend to treat these services as black boxes
• Data goes in, we assume outputs are correct
• We have no idea how the service is implemented
Black Box Online Services
add_item_to_cart(“Cheerios”)
OK
get_cart()
[“Lucky Charms”, “Cheerios”]
• Storing and retrieving data from online services is commonplace
• We tend to treat these services as black boxes
• Data goes in, we assume outputs are correct
• We have no idea how the service is implemented
Black Box Online Services
post_update(“I LOLed”)
OK
get_newsfeed()
[…, {“txt”: “I LOLed”, “likes”: 87}]
• Storing and retrieving data from online services is commonplace
• We tend to treat these services as black boxes
• Data goes in, we assume outputs are correct
• We have no idea how the service is implemented
Peeling Back the Curtain
?
Black Box Service
• How are large services implemented?
• Different types of services may have different requirements
• Leads to different design decisions
Centralization
debit_transaction(-$75)
OK
get_account_balance()
Bob
?
Bob:
Bob:$300
$225
$225
• Advantages of centralization
• Easy to setup and deploy
• Consistency is guaranteed (assuming correct software implementation)
• Shortcomings
• No load balancing
• Single point of failure
Sharding
<A-M>
debit_account(-$75)
Bob:
Bob:$300
$225
OK
get_account_balance()
Bob
$225
• Advantages of sharding
• Better load balancing
• If done intelligently, may allow incremental scalability
• Shortcomings
• Failures are still devastating
<N-Z>
Replication
debit_account(-$75)
OK
get_account_balance()
Bob
100%
Agreement
<A-M>
Bob:
Bob:$300
$225 <A-M>
<A-M>
$225
Bob:
Bob:$300
$225
• Advantages of replication
• Better load balancing of reads (potentially)
• Resilience against failure; high availability (with some caveats)
• Shortcomings
• How do we maintain consistency?
Bob:
Bob:$300
$225
Consistency Failures
Leader cannot disambiguate cases
where requests and responses are lost
No ACK
Bob: $300
No
Agreement
Bob:
Bob:$300
$225
Asynchronous networks
are problematic
Bob:
Bob:$300
$225
No ACK
Too few
replicas?
Bob: $300
Bob:
Bob:$300
$225
Timeout!
Bob:
Bob:$300
$225
Bob:
Bob:$300
$225
No
Agreement
Bob:
Bob:$300
$225
Byzantine Failures
Bob: $300
No
Agreement
Bob:
Bob:$300
$1000
In some cases,
replicas may be
buggy or malicious
• When discussing Distributed Systems, failures due to malice are
known as Byzantine Failures
• Name comes from the Byzantine generals problem
• More on this later…
Problem and Definitions
• Build a distributed system that meets the following goals:
• The system should be able to reach consensus
• Consensus [n]: general agreement
• The system should be consistent
• Data should be correct; no integrity violations
• The system should be highly available
• Data should be accessible even in the face of arbitrary failures
• Challenges:
• Many, many different failure modes
• Theory tells us that these goals are impossible to achieve (more on this later)
Distributed Commits (2PC and 3PC)
Theory (FLP and CAP)
Quorums (Paxos)
Forcing Consistency
debit_account(-$75)
Bob:
Bob:$300
$225
OK
debit_account(-$50)
Bob
Bob:
$300
Bob:
Bob:$225
$175
Error
Bob:
$300
Bob:
Bob:$225
$175
• One approach building distributed systems is to force them to be consistent
• Guarantee that all replicas receive an update…
• …Or none of them do
• If consistency is guaranteed, then reaching consensus is trivial
Distributed Commit Problem
• Application that performs operations on multiple replicas or databases
• We want to guarantee that all replicas get updated, or none do
• Distributed commit problem:
1. Operation is committed when all participants can perform the action
2. Once a commit decision is reached, all participants must perform the action
• Two steps gives rise to the Two Phase Commit protocol
Motivating Transactions
transfer_money(Alice, Bob, $100)
debit_account(Alice, -$100)
Error
OK
debit_account(Bob, $100)
Error
OK
Alice:
Alice:$600
$500
Bob:
Bob:$300
$400
• System becomes inconsistent if any individual action fails
Simple Transactions
transfer_money(Alice, Bob, $100)
begin_transaction()
debit_account(Alice, -$100)
debit_account(Bob, $100)
At this point, if there
haven’t been any
errors, we say the
transaction is
committed
Alice:
Alice:$600
$500
Alice: $500
Bob:
Bob:$300
$400
Bob: $400
end_transaction()
OK
• Actions inside a transaction behave as a single action
Simple Transactions
transfer_money(Alice, Bob, $100)
begin_transaction()
debit_account(Alice, -$100)
Alice: $600
debit_account(Bob, $100)
Bob: $300
Error
• If any individual action fails, the whole transaction fails
• Failed transactions have no side effects
• Incomplete results during transactions are hidden
Alice: $500
ACID Properties
• Traditional transactional databases support the following:
1. Atomicity: all or none; if transaction fails then no changes are applied to the
database
2. Consistency: there are no violations of database integrity
3. Isolation: partial results from incomplete transactions are hidden
4. Durability: the effects of committed transactions are permanent
Two Phase Commits (2PC)
• Well known techniques used to implement transactions in centralized
databases
• E.g. journaling (append-only logs)
• Out of scope for this class (take a database class, or CS 5600)
• Two Phase Commit (2PC) is a protocol for implementing transactions
in a distributed setting
•
•
•
•
Protocol operates in rounds
Assume we have leader or coordinator that manages transactions
Each replica states that it is ready to commit
Leader decides the outcome and instructs replicas to commit or abort
• Assume no byzantine faults (i.e. nobody is malicious)
2PC Example
• Begin by distributing
the update
• Txid is a logical clock
• Tell replicas to
commit
• At this point, all
replicas are
guaranteed to be
up-to-date
Replica 1
Replica 2
Replica 3
x
x
x
x y
x y
x y
y
y
y
ready txid = 678
Time
• Wait to receive
“ready to commit”
from all replicas
txid = 678; value = y
Leader
commit txid = 678
committed txid = 678
Failure Modes
• Replica Failure
• Before or during the initial promise phase
• Before or during the commit
• Leader Failure
• Before receiving all promises
• Before or during sending commits
• Before receiving all committed messages
Replica Failure (1)
Leader
Replica 1
Replica 2
Replica 3
x
x
x
x y
x y
x
x
txid = 678; value = y
• The same thing
happens if a write or a
“ready” is dropped, a
replica times out, or a
replica returns an error
ready txid = 678
Time
• Error: not all replicas
are “ready”
abort txid = 678
aborted txid = 678
Replica Failure (2)
Leader
Replica 1
Replica 2
Replica 3
x y
x y
x y
ready txid = 678
commit txid = 678
y
committed txid = 678
commit txid = 678
y
committed txid = 678
Time
• Known inconsistent
state
• Leader must keep
retrying until all
commits succeed
Replica Failure (2)
• Replicas attempt to
resume unfinished
transactions when
they reboot
Leader
Replica 1
Replica 3
y
y
x y
stat txid = 678
committed txid = 678
Time
commit txid = 678
y
• Finally, the system is
consistent and may
proceed
Replica 2
Leader Failure
• What happens if the leader crashes?
• Leader must constantly be writing its state to permanent storage
• It must pick up where it left off once it reboots
• If there are unconfirmed transactions
• Send new write messages, wait for “ready to commit” replies
• If there are uncommitted transactions
• Send new commit messages, wait for “committed” replies
• Replicas may see duplicate messages during this process
• Thus, it’s important that every transaction have a unique txid
Allowing Progress
• Key problem: what if the leader crashes and never recovers?
• By default, replicas block until contacted by the leader
• Can the system make progress?
• Yes, under limited circumstances
•
•
•
•
After sending a “ready to commit” message, each replica starts a timer
The first replica whose timer expires elects itself as the new leader
Query the other replicas for their status
Send “commits” to all replicas if they are all “ready”
• However, this only works if all the replicas are alive and reachable
• If a replica crashes or is unreachable, deadlock is unavoidable
New Leader
Leader
Replica 1
Replica 2
Replica 3
x y
x y
x y
y
y
y
ready txid = 678
Time
• Replica 2’s timeout
expires, begins
recovery procedure
stat txid = 678
ready txid = 678
commit txid = 678
• System is consistent
again
committed txid = 678
Deadlock
Leader
Replica 1
Replica 2
Replica 3
x y
x y
x y
ready txid = 678
stat txid = 678
ready txid = 678
• Cannot proceed, but
cannot abort
stat txid = 678
Time
• Replica 2’s timeout
expires, begins
recovery procedure
Garbage Collection
• 2PC is somewhat of a misnomer: there is actually a third phase
• Garbage collection
• Replicas must retain records of past transactions, just in case the
leader fails
• Example, suppose the leader crashes, reboots, and attempts to commit a
transaction that has already been committed
• Replicas must remember that this past transaction was already committed,
since committing a second time may lead to inconsistencies
• In practice, leader periodically tells replicas to garbage collect
• Transactions <= some txid in the past may be deleted
2PC Summary
• Message complexity: O(2n)
• The good: guarantees consistency
• The bad:
•
•
•
•
Write performance suffers if there are failures during the commit phase
Does not scale gracefully (possible, but difficult to do)
A pure 2PC system blocks all writes if the leader fails
Smarter 2PC systems still blocks all writes if the leader + 1 replica fail
• 2PC sacrifices availability in favor of consistency
Can 2PC be Fixed?
• They issue with 2PC is reliance on the centralized leader
• Only the leader knows if a transaction is 100% ready to commit or not
• Thus, if the leader + 1 replica fail, recovery is impossible
• Potential solution: Three Phase Commit
• Add an additional round of communication
• Tell all replicas to prepare to commit, before actually committed
• State of the system can always be deduced by a subset of alive
replicas that can communicate with each other
• … unless there are partitions (more on this later)
3PC Example
• Begin by distributing
the update
• Wait to receive
“ready to commit”
from all replicas
• Tell replicas to
commit
• At this point, all
replicas are
guaranteed to be
up-to-date
Replica 1
Replica 2
Replica 3
x
x
x
x y
x y
x y
y
y
y
ready txid = 678
prepare txid = 678
Time
• Tell all replicas that
everyone is “ready
to commit”
txid = 678; value = y
Leader
prepared txid = 678
commit txid = 678
committed txid = 678
Leader Failures
• Begin by distributing
the update
• Wait to receive
“ready to commit”
from all replicas
• Replica 3 cannot be in
the committed state,
thus okay to abort
• System is consistent
again
Replica 1
Replica 2
Replica 3
x
x
x
x y
x y
x y
x
x
ready txid = 678
Time
• Replica 2’s timeout
expires, begins
recovery procedure
txid = 678; value = y
Leader
stat txid = 678
ready txid = 678
abort txid = 678
aborted txid = 678
Leader Failures
Leader
Replica 1
Replica 2
y
y
Replica 3
prepare txid = 678
prepared txid = 678
• All replicas must have
been ready to commit
Time
• Replica 2’s timeout
expires, begins
recovery procedure
stat txid = 678
prepared txid = 678
commit txid = 678
• System is consistent
again
committed txid = 678
Oh Great, I Fixed Everything!
• Wrong
• 3PC is not robust against network partitions
• What is a network partition?
• A split in the network, such that full n-to-n connectivity is broken
• i.e. not all servers can contact each other
• Partitions split the network into one or more disjoint subnetworks
• How can a network partition occur?
• A switch or a router may fail, or it may receive an incorrect routing rule
• A cable connecting two racks of servers may develop a fault
• Network partitions are very real, they happen all the time
Partitioning
txid = 678; value = y
Leader
Replica 1
Replica 2
Replica 3
x
x
x
x y
x y
x y
ready txid = 678
Leader recovery
initiated
prepare txid = 678
Time
• Network partitions
into two subnets!
prepared txid = 678
• Leader assumes
replicas 2 and 3 have
failed, moves on
Abort
commit txid = 678
y
• System is
inconsistent
committed txid = 678
x
x
3PC Summary
• Adds an additional phase vs. 2PC
• Message complexity: O(3n)
• Really four phases with garbage collection
• The good: allows the system to make progress under more failure
conditions
• The bad:
• Extra round of communication makes 3PC even slower than 2PC
• Does not work if the network partitions
• 2PC will simply deadlock if there is a partition, rather than become inconsistent
• In practice, nobody used 3PC
• Additional complexity and performance penalty just isn’t worth it
• Loss of consistency during partitions is a deal breaker
Distributed Commits (2PC and 3PC)
Theory (FLP and CAP)
Quorums (Paxos)
A Moment of Reflection
• Goals, revisited:
• The system should be able to reach consensus
• Consensus [n]: general agreement
• The system should be consistent
• Data should be correct; no integrity violations
• The system should be highly available
• Data should be accessible even in the face of arbitrary failures
• Achieving these goals may be harder than we thought :(
• Huge number of failure modes
• Network partitions are difficult to cope with
• We haven’t even considered byzantine failures
What Can Theory Tell Us?
• Lets assume the network is synchronous and reliable
• Algorithm can be divided into discreet rounds
• If a message from r is not received in a round, then r must be faulty
• Since we’re assuming synchrony, packets cannot be delayed arbitrarily
• During each round, r may send m <= n messages
• n is the total number of replicas
• You might crash before sending all n messages
• If we are willing to tolerate f total failures (f < n), how many rounds of
communication do we need to guarantee consensus?
Consensus in a Synchronous System
• Initialization:
• All replicas choose a value 0 or 1 (can generalize to more values if you want)
• Properties:
• Agreement: all non-faulty processes ultimately choose the same value
• Either 0 or 1 in this case
• Validity: if a replica decides on a value, then at least one replica must have
started with that value
• This prevents the trivial solution of all replicas always choosing 0, which is technically
perfect consensus but is practically useless
• Termination: the system must converge in finite time
Algorithm Sketch
• Each replica maintains a map M of all known values
• Initially, the vector only contains the replica’s own value
• e.g. M = {‘replica1’: 0}
• Each round, broadcast M to all other replicas
• On receipt, construct the union of received M and local M
• Algorithm terminates when all non-faulty replicas have the values
from all other non-faulty replicas
• Example with three non-faulty replicas (1, 3, and 5)
• M = {‘replica1’: 0, ‘replica3’: 1, ‘replica5’: 0}
• Final value is min(M.values())
Bounding Convergence Time
• How many rounds will it take if we are willing to tolerate f failures?
• f + 1 rounds
• Key insight: all replicas must be sure that all replicas that did not crash
have the same information (so they can make the same decision)
• Proof sketch, assuming f = 2
• Worst case scenario is that replicas crash during rounds 1 and 2
• During round 1, replica x crashes
• All other replicas don’t know if x it alive or dead
• During round 2, replica y crashes
• Clear that x is not alive, but unknown if y is alive or dead
• During round 3, no more replicas may crash
• All replicas are guaranteed to receive updated info from all other replicas
• Final decision can be made
A More Realistic Model
• The previous result is interesting, but unrealistic
• We assume that the network is synchronous and reliable
• Of course, neither of these things are true in reality
• What if the network is asynchronous and reliable?
• Replicas may take an arbitrarily long time to respond to messages
• Let’s also assume that all faults are crash faults
• i.e. if a replica has a problem it crashes and never wakes up
• No byzantine faults
The FLP Result
There is no asynchronous algorithm that achieves consensus on a 1-bit
value in the presence of crash faults. The result is true even if no crash
actually occurs!
• This is known as the FLP result
• Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson, 1985
• Extremely powerful result because:
• If you can’t agree on 1-bit, generalizing to larger values isn’t going to help you
• If you can’t converge with crash faults, no way you can converge with
byzantine faults
• If you can’t converge on a reliable network, no way you can on an unreliable
network
FLP Proof Sketch
• In an asynchronous system, a replica x cannot tell whether a nonresponsive replica y has crashed or it is just slow
• What can x do?
• If it waits, it will block since it might never receive the message from y
• If it decides, it may find out later that y made a different decision
• Proof constructs a scenario where each attempt to decide is
overruled by a delayed, asynchronous message
• Thus, the system oscillates between 0 and 1 never converges
Impact of FLP
• FLP proves that any fault-tolerant distributed algorithm attempting to
reach consensus has runs that never terminate
• These runs are extremely unlikely (“probability zero”)
• Yet they imply that we can’t find a totally correct solution
• And so “consensus is impossible” (“not always possible”)
• So what can we do?
• Use randomization, probabilistic guarantees (gossip protocols)
• Avoid consensus, use quorum systems (Paxos or Raft)
• In other words, trade off consistency in favor of availability
Consistency vs. Availability
• FLP states that perfect consistency is impossible
• Practically, we can get close to perfect consistency, but at significant cost
• e.g. using 3PC
• Availability begins to suffer dramatically under failure conditions
• Is there a way to formalize the tradeoff between consistency and
availability?
Eric Brewer’s CAP Theorem
• CAP theorem for distributed data replication
• Consistency: updates to data are applied to all or none
• Availability: must be able to access all data
• Network Partition Tolerance: failures can partition network into subtrees
• The Brewer Theorem
• No system can simultaneously achieve C and A and P
• Typical interpretation: “C, A, and P: choose 2”
• In practice, all networks may partition, thus you must choose P
• So a better interpretation might be “C or A: choose 1”
• Never formally proved or published
• Yet widely accepted as a rule of thumb
CAP Examples
(key, 1)
• Availability
(key, 1)
1)
(key, 2)
Error: Service
Unavailable
(key, 1)
(key, 2)
1)
• Impact of partitions
• Not consistent
• Consistency
Replicate
C+P
Replicate
A+P
• Client can always read
• Reads must always return
accurate results
• Impact of partitions
• No availability
“C or A: Choose 1”
• Taken to the extreme, CAP suggests a binary division in distributed
systems
• Your system is consistent or available
• In practice, it’s more like a spectrum of possibilities
Perfect
Consistency
Financial information
must be correct
Always
Available
Attempt to balance
correctness with availability
Serve content to all visitors,
regardless of correctness
Distributed Commits (2PC and 3PC)
Theory (FLP and CAP)
Quorums (Paxos)
Strong Consistency, Revisited
• 2PC and 3PC achieve strong consistency, but they have significant
shortcomings
• 2PC cannot make progress in the face of leader + 1 replica failures
• 3PC loses consistency guarantees in the face of network partitions
• Where do we go from here?
• Observation: 2PC and 3PC attempt to reach 100% agreement
• What if 51% of the replicas agree?
Quorum Systems
• In law, a quorum is the minimum number of members of a
deliberative body necessary to conduct the business of that group
• When quorum is not met, a legislative body cannot hold a vote, and cannot
change the status quo
• E.g. Imagine if only 1 senator showed up to vote in the Senate
• Distributed Systems can also use a quorum approach to consensus
• Essentially a voting approach
• If 𝑁/2 + 1 replicas agree (out of N), then an update can be committed
Advantages of Quorums
• Availability: quorum systems are more resilient in the face of failures
• Quorum systems can be designed to tolerate both benign and byzantine
failures
• Efficiency: can significantly reduce communication complexity
• Do not require all servers in order to perform an operation
• Requires a subset of them for each operation
High-Level Quorum Example
ts:
1
ts:
ts:23
Bob:
$300
Bob:
Bob:$400
$375
Write
ts: 1
ts: 3
Bob: $300
Bob: $375
ts:
1
ts:
ts:23
Bob:
$300
Bob:
Bob:$400
$375
Challenges
1. Ensuring that at least
𝑁/2 + 1 replicas commit
each update
2. Ensuring that updates have
the correct logical ordering
ts: 1
Bob: $300
ts:
ts:12
Bob:
Bob:$300
$400
Read
Timestamp
Balance
1
$300
2
$400
3
$375
Paxos
• Replication protocol that ensures a global ordering of updates
• All writes into the system are ordered in logical time
• Replicas agree on the order of committed writes
• Uses a quorum approach to consensus
• One replica is always chosen as the leader
• The protocol moves forward as long as 𝑁/2 + 1 replicas agree
• The “Paxos protocol” is actually a theoretical proof
• We’ll be discussing a concrete implementation of the protocol
• Paxos for System Builders, Jonathan Kirsch and Yair Amir.
http://www.cs.jhu.edu/~jak/docs/paxos_for_system_builders.pdf
History of Paxos
• Developed by Turing award winner Leslie Lamport
• First published as a tech report in 1989
• Journal refused to publish it, nobody understood the protocol
• Formally published in 1998
• Again, nobody understands it
• Leslie Lamport publishes “Paxos Made Simple” in 2001
• People start to get the protocol
• Reaches widespread fame in 2006-2007
• Used by Google in their Chubby distributed mutex system
• Zookeeper is the open-source version of Chubby
Paxos at a High-Level
1. Replicas elect a leader and agree on the view number
• The view is a logical clock that divides time into epochs
• During each view, there is a single leader
2. The leader collects promises from the replicas
• Replicas promise to only accept proposals from the current or future views
• Prevents replicas from going “back in time”
• Leader learns about proposed updates from the previous view that haven’t
yet been accepted
3. The leader proposes updates and replicas accept them
• Start by completing unfished updates from the previous view
• Then move on to new writes from clients
View Selection
• All replicas have a view number
• Goal is to have all replicas agree on
the view
• Leader is replica with ID = view % N
Replica 1
Replica 2
Replica 3
Replica 4
0
0
0
0
0
2
2
2
3
3
3
3
3
Time
• Broadcast view to all other replicas
• If a replica receives 𝑁/2 + 1
broadcasts with the same view,
assume the view is correct
• If a replica receives a broadcast with a
larger view, adopt it and rebroadcast
Replica 0
Prepare/Promise
Replica 0
Replica 1
Replica 2
Replica 3
Replica 4
5
5
5
5
5
13
13
13
13
13
• Leader must ensure that a quorum
of 𝑁/2 + 1 replicas exist
prepare view=5 clock= 13
• Replicas promise to not accept any
messages with view < v
• Replicas won’t elect a new leader
until the current one fails
(measured using a timeout)
Time
promise view=5 clock=13
Commit/Accept
Replica 0
Replica 1
Replica 2
Replica 3
Replica 4
5
5
5
5
5
13
13
13
13
13
• All client requests are
serialized through the leader
write
• Replicas write the new value
to temporary storage
x
y
y
x
y
x
y
x
Time
commit clock=14
y
x
accept clock=14
• Increment the clock and
commit the new value
after receiving 𝑁/2 + 1
“accept” messages
14
x
OK
14
x
14
14
x
x
14
x
Paxos Review
• Three phases: elect leader, collect promises, commit/accept
• Message complexity: roughly O(n2+n)
• However, more like O(n2) in steady state (repeated commit/accept)
• Two logical clocks:
1. The view increments each time the leader changes
• Replicas promise not to accept updates from prior views
2. The clock increments after each update/write
• Maintains the global ordering of updates
• Replicas set a timeout every time they hear from the leader
• Increment the view and elect a new leader if the timeout expires
• Protocol moves forward once 𝑁/2 + 1 replicas are in agreement
Failure Modes
1. What happens if a commit fails?
2. What happens during a partition?
3. What happens after the leader fails?
Bad Commit
Replica 0
Replica 1
Replica 2
13
13
13
• What happens if a quorum
does not accept a commit?
Replica 3
13
Replica 4
13
commit clock=14
x
y
x
y
x
x
y
x
y
accept clock=14
Time
• Leader must retry until
quorum is reached, or
broadcast an abort
commit clock=14
y
x
y
x
accept clock=14
• Replicas that fall behind can
reconcile by downloading
missed updates from a peer
14
x
14
x
14
14
x
x
Partitions (1)
Replica 0
Replica 1
Replica 2
Replica 3
Replica 4
0
0
0
0
0
• What happens during a partition?
• Once partition is fixed, either:
• Hold a new leader election and
move forward
• Or, reconcile with up-to-date peers
x
y
y
x
y
x
1
Time
• The group with a quorum (if one exists)
keeps making progress
• This may require a new leader election
• Group with < 𝑁/2 + 1 replicas cannot
form a quorum or accept updates
1
x
x
x
Partitions (2)
Replica 0
Replica 1
Replica 2
Replica 3
Replica 4
0
0
0
0
0
• What happens during a partition?
• The group with a quorum (if one exists)
keeps making progress
• This may require a new leader election
• Group with < 𝑁/2 + 1 replicas cannot
form a quorum or accept updates
y
x
y
x
y
x
y
1
1
Time
x
1
• What happens when the view = 0 group
attempts to rejoin?
• Promises for view = 1 prevent the old
leader from interfering with the new
quorum
Leader Failure (1)
Replica 0
Replica 1
Replica 2
13
13
13
x
commit clock=14
x
x
promise clock=13
commit clock=14’
z
x
z
x
13
z
y
Time
prepare clock= 13
x
Replica 4
x
Increment view, elect new leader
• Replica 3 is desynchronized, must
reconcile with another replica
13
x
• What happens if there is an
uncommitted update with no quorum?
• Leader is unaware of
uncommitted update
• Leader announces a
new update with
clock=14’, which is
rejected by replica 3
Replica 3
Leader Failure (2)
Replica 0
Replica 1
Replica 2
13
13
13
x
commit clock=14
x
Replica 3
13
x
y
x
y
Time
Increment view, elect new leader
prepare clock= 13
promise clock=13
commit clock=14
x
y
x
y
x
13
x
x
• What happens if there is an
uncommitted update with no quorum?
• Leader is aware of
uncommitted update
• Leader must recommit
the original clock=14
update
Replica 4
y
Leader Failure (3)
Replica 0
Replica 1
Replica 2
13
13
13
commit clock=14
x
x
Replica 3
13
x
y
y
x
y
• Leader must recommit the
original clock=14 update
Send prepares, collect promises
y
x
y
x
y
x
y
Time
x
Increment view, elect new leader
• By definition, leader must
become aware of the
commit clock=14
uncommitted update
• Recall that the leader must
x
collect 𝑁/2 + 1 promises
13
x
x
• What happens if there is an
uncommitted update with a quorum?
Replica 4
The Devil in the Details
• Clearly, Paxos is complicated
• Things we haven’t covered:
• Reconciliation – how to bring a replica up-to-date
• Managing the queue of updates from clients
• Updates may be sent to any replica
• Replicas are responsible for responding to clients who contact them
• Replicas may need to re-forward updates if the leader changes
• Garbage collection
• Replicas need to remember the exact history of updates, in case the leader changes
• Periodically, the lists need to be garbage collected
Odds and Ends
Byzantine Generals
Gossip
Byzantine Generals Problem
• Name coined by Leslie Lamport
• Several Byzantine Generals are
laying siege to an enemy city
• They have to agree on a common
strategy: attack or retreat
• They can only communicate by
messenger
• Some generals may be traitors
(their identity is unknown)
Do you see the connection with the consensus problem?
Byzantine Distributed Systems
• Goals
1. All loyal lieutenants obey the same order
2. If the commanding general is loyal, then every loyal lieutenant obeys the
order he sends
• Can the problem be solved?
• Yes, iff there at least 3m+1 generals in the presence of m traitors
• E.g. if there are 3 generals, even 1 traitor makes the problem unsolvable
• Bazillion variations on the basic problem
• What if messages are cryptographically signed (e.g. they are unforgeable)?
• What if communication is not g x g (i.e. some pairs of generals cannot
communicate)?
• Most algos have byzantine versions (e.g. Byzantine Paxos)
Alternatives to Quorums
• Quorums favor consistency over availability
• If no quorum exists, then the system stops accepting writes
• Significant overhead maintaining consistent replicated state
• What if eventual consistency is okay?
• Favor availability over consistency
• Results may be stale or incorrect sometimes (hopefully only in rare cases)
• Gossip protocols
•
•
•
•
Replicas periodically, randomly exchange state with each other
No strong consistency guarantees but…
Surprisingly fast and reliable convergence to up-to-date state
Requires vector clocks or better in order to causally order events
• Extreme cases of divergence may be irreconcilable
Sources
1. Some slides courtesy of Cristina Nita-Rotaru (http://cnitarot.github.io/courses/ds_Fall_2016/)
2. The Part-Time Parliament, Leslie Lamport. http://research.microsoft.com/en-us/um/people/lamport/pubs/lamportpaxos.pdf
3. Paxos Made Simple, Leslie Lamport. http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf
4. Paxos for System Builders, Jonathan Kirsch and Yair Amir.
http://www.cs.jhu.edu/~jak/docs/paxos_for_system_builders.pdf
5. The Chubby Lock Service for Loosely-Coupled Distributed Systems, Mike Burrows.
http://research.google.com/archive/chubby-osdi06.pdf
6. Paxos Made Live – An Engineering Perspective, Tushar Deepak Chandra, Robert Griesemer, Joshua Redstone.
http://research.google.com/archive/paxos_made_live.pdf