Paxos
http://net.pku.edu.cn/~course/cs501/2012
Hongfei Yan
School of EECS, Peking University
5/2/2012
Some Slides borrow from Idit Keidar
Terminology
• Consensus, the problem of a number of
processes trying to agree on a common
decision
– leader, state machine replication, and atomic
broadcasts.
• Failure detectors, as a mechanism for
solving consensus in an asynchronous
message-passing system
2
Material
• Paxos Made Simple
Leslie Lamport
ACM SIGACT News (Distributed
Computing Column) 32, 4 (Whole Number
121, December 2001) 18-25.
3
Paxos‘s notation
• Classes of agents:
– Proposers
– Acceptors
– Learners
• A node can act as more than one clients
(usually 3).
Paxos algorithm
• Phase 1 (prepare):
– A proposer selects a proposal number n and sends
a prepare request with number n to majority of
acceptors.
– If an acceptor receives a prepare request with
number n greater than that of any prepare request
it saw, it responses YES to that request with a
promise not to accept any more proposals
numbered less than n and include the highestnumbered proposal (if any) that it has accepted.
Paxos algorithm
• Phase 2 (accept):
– If the proposer receives a response YES to its
prepare requests from a majority of acceptors,
then it sends an accept request to each of those
acceptors for a proposal numbered n with a
values v which is the value of the highestnumbered proposal among the responses.
– If an acceptor receives an accept request for a
proposal numbered n, it accepts the proposal
unless it has already responded to a prepare
request having a number greater than n.
Definition of chosen
• A value is chosen at proposal number n iff
majority of acceptor accept that value in
phase 2 of the proposal number.
Paxos’s properties
• P1: Any proposal number is unique.
• P2: Any two set of acceptors have at least
one acceptor in common.
• P3: the value sent out in phase 2 is the value
of the highest-numbered proposal of all the
responses in phase 1.
Proof of safety
• Claim: if a value v is chosen at proposal
number n, any value that is sent out in phase
2 of any later prososal numbers must be
also v.
• Proof (by contradiction): Let m is the first
proposal number that is later than n and in
phase 2, the value sent out is not v.
Learning a chosen value
• There are some options:
– Each acceptor, whenever it accepts a proposal,
informs all the learners.
– Acceptors informs a distinguished learner
(usually the proposer) and let the distinguished
learner broadcast the result.
Tunable knobs
• Acceptors have many options to response:
– Prepare request: No/Yes
– Accept request: No/Yes if it didn’t promise not
to do so
• Back off time after abandon a proposal:
exponential back-off/pre-assigned values
Error cases in basic Paxos
•
•
•
•
failure of Acceptor
failure of redundant Learner
failure of Proposer
Basic Paxos, dueling Proposers
12
The Part-Time Parliament
[Lamport 88,98,01]
Recent archaeological discoveries on the island of
Paxos reveal that the parliament functioned
despite the peripatetic propensity of its part-time
legislators.
The legislators maintained consistent copies of the
parliamentary record, despite their frequent forays
from the chamber and the forgetfulness of their
messengers.
The Paxon parliament’s protocol provides a new way
of implementing the state-machine approach to the
design of distributed systems.
13
Annotation of TOCS 98 Paper
• This submission was recently discovered behind a filing
cabinet in the TOCS editorial office.
• …the author is currently doing field work in the Greek
isles and cannot be reached …
• The author appears to be an archeologist with only a
passing interest in computer science.
• This is unfortunate; even though the obscure ancient Paxon
civilization he describes is of little interest to most
computer scientists, its legislative system is an excellent
model for how to implement a distributed computer system
in an asynchronous environment.
14
The Setting
• The data (ledger) is replicated at n processes
(legislators)
• Operations (decrees) should be invoked (recorded)
at each replica (ledger) in the same order
• Processes (legislators) can fail (leave the
parliament)
• At least a majority of processes (legislators) must
be up (present in the parliament) in order to make
progress (pass decrees)
– Why majority?
15
The Practicality
• Overcomes message loss without
retransmitting entire history
• Tolerates crash and recovery
• Used in replicated file systems
– Frangipani
16
Failure Detector
W – Leader
– Outputs one trusted process
– From some point, all correct processes trust the
same correct process
• Can easily implement ◊S
• Is the weakest for consensus
[Chandra, Hadzilacos, Toueg 96]
17
A Natural W Implementation
• Use ◊P implementation
– in eventual synchrony model
• Output lowest id non-suspected process
W is implementable also in some situations
where ◊P isn’t
18
Crash-Recovery Model
• Processes that crash may later recover
• Processes can store information on stable
storage (disk) and retrieve it upon recovery
• Can be modeled as message loss
• Problem?
19
Eventually Reliable Links
• There is a time after which every message
sent by a correct process to a correct
process eventually arrives
• Usual failure-detector-based algorithms do
not work
20
The Paxos
Atomic Broadcast Algorithm
• Leader based: each process has an estimate
of who is the current leader
• To order an operation, a process sends it to
its current leader
• The leader sequences the operation and
launches a Consensus algorithm (Synod) to
fix the agreement
21
The (Synod) Consensus Algorithm
• Solves non-terminating consensus in
asynchronous system
– or consensus in a partial synchrony system
– or consensus using an W failure detector
• Overcomes crashes, recoveries, and
message loss
– can be modeled as message loss
22
The Consensus Algorithm Structure
•
•
•
•
Two phases
Leader contacts a majority in each phase
There may be multiple concurrent leaders
Ballots distinguish among values
proposed by different leaders
– unique, locally monotonically increasing
– correspond to rounds of ◊S-based algorithm [MR]
– processes respond only to leader with highest ballot
seen so far
23
Ballot Numbers
• Pairs num, process id
• n1, p1 > n2, p2
– if n1 > n2
– or n1=n2 and p1 > p2
• Leader p chooses unique, locally
monotonically increasing ballot number
– if latest known ballot is n, q
– p chooses n+1, p
24
The Two Phases of Paxos
• Phase 1: prepare
– If trust yourself by W (believe you are the leader)
• Choose new unique (using ids) ballot number
• Learn outcome of all smaller ballots from majority
• Phase 2: accept
– Leader proposes a value with his ballot number
– Leader gets majority to accept his proposal
– A value accepted by a majority can be decided
25
Paxos - Variables
• Type Rank
– pair num, proc
• Variables:
Rank
Rank
Value {^}
BallotNum, initially 0,0
AcceptNum, initially 0,0
AcceptVal,
initially ^
26
Paxos Phase I: Prepare
• Periodically, until decision is reached do:
if leader (by W) then
BallotNum BallotNum.num+1, my proc id
send (“prepare”, BallotNum) to all
• Upon receive (“prepare”, rank) from i
if rank > BallotNum then
BallotNum rank
send (“ack”, rank, AcceptNum, AcceptVal) to i
27
Paxos Phase II: Accept
Upon receive (“ack”, BallotNum, b, val) from n-t
if all vals = ^ then myVal = initial value
else myVal = received val with highest b
send (“accept”, BallotNum, myVal) to all /* proposal */
Upon receive (“accept”, b, v) with b BallotNum
AcceptNum b; AcceptVal v /* accept proposal */
send (“accept”, b, v) to all
(first time only)
28
Paxos – Deciding
Upon receive (“accept”, b, v) from n-t
decide v
periodically send (“decide”, v) to all
Upon receive (“decide”, v)
decide v
29
In Failure-Free Synchronous Runs
1
1
2
(“prepare”,1) .
1
Our W implementation
always trusts process 1
1
2
2
.
.
.
.
.
n
n
(“accept”, 1,1 ,v1.)
.
(“ack”,
1,1 ,r0,^)
.
n
1
(“accept”, 1,1 ,v1)
decide v1
30
Correctness: Agreement
• Follows from the following Lemma 1:
If a proposal (b,v) is accepted by a majority
of the processes, then for every proposal (b’,
v’) with b’>b, v’=v.
• Because every decided value is sent in some
proposal that is accepted by a majority.
• Prove Agreement by induction, starting
from 1st proposal accepted by a majority.
31
To Prove Lemma 1
• Use Lemma 2: (invariant):
If a proposal (b,v) is sent, then there is a set
S consisting of a majority such that either
– no p in S accepts a proposal ranked less than b
(all vals = ^); or
– v is the value of the highest-ranked proposal
among proposals ranked less than b accepted by
processes in S (myVal = received val with
highest b).
32
To Prove Lemma 2
• A process can accept a proposal numbered b
if and only if it has not responded to a
prepare request having a number greater
than b.
33
Termination
• Assume no loss for a moment.
• Once there is one correct leader –
– It chooses the highest rank.
– No other process becomes a leader with a
higher rank.
– All correct processes “ack” its prepare message
and “accept” its accept message and decide.
34
What About Message Loss?
• Does not block in case of a lost message
– phase I can start with new rank even if previous
attempts never ended
• Conditional liveness:
If n-t correct processes including the leader can
communicate with each other
then they eventually decide
• Holds with eventually reliable links
35
Optimization: Fast Paxos
• Allow process 1 (only!) to skip Phase 1
– initiate BallotNum to 1,1
– propose its own initial value
• 2 steps in failure-free synchronous runs
• 2 steps for repeated invocations with the
same leader
– Common case
36
Atomic Broadcast
by Running A Sequence of
Consensus Instances
37
The Setting
• Data is replicated at n servers.
• Operations are initiated by clients.
• Operations need to be performed at all
correct servers in the same order (aka statemachine replication).
38
Client-Server Interaction
(Benign Version)
• Leader-based: each process (client/server)
has an estimate of who is the current leader.
• A client sends a request to its current leader.
• The leader launches the Paxos consensus
algorithm to agree upon the order of the
request.
• The leader sends the response to the client.
39
Failure-Free Message Flow
C
C
request
S1
response
S1
S1
S2
(“prepare”) .
.
.
Sn
Phase 1
(“ack”)
S1
S1
S2
S2
. (“accept”)
.
.
Sn
.
.
.
Sn
Phase 2
40
Observation
• No consensus values are sent in Phase 1.
– Leader chooses largest unique ballot number.
– Gets a majority to “vote” for this ballot number.
– Learns the outcome of all smaller ballots from
this majority.
• In Phase 2, leader proposes either its own
initial value or latest value it learned in
Phase 1.
41
Message Flow: Take 2
C
C
request
S1
S1
S1
S2
(“prepare”) .
.
.
Sn
Phase 1
(“ack”)
S1
response
S1
S1
S2
S2
. (“accept”)
.
.
Sn
.
.
.
Sn
Phase 2
42
Optimization: Fast Paxos
• Run Phase 1 only when the leader changes.
– Phase 1 is called “view change” or “recovery mode”.
– Phase 2 is the “normal mode”.
• Each message includes BallotNum (from the last
Phase 1) and ReqNum.
– E.g., ReqNum = 7 when we’re trying to agree what the
7th operation to invoke on the state machine should be.
• Respond only to messages with the “right”
BallotNum.
43
Fast Paxos: Normal Mode
Upon receive (“request”, v) from client
if (I am not the leader) then forward to leader
else
/* propose v as request number n */
ReqNum ReqNum +1;
send (“accept”, BallotNum , ReqNum, v) to all
Upon receive (“accept”, b, n, v) with b = BallotNum
/* accept proposal for request number n */
AcceptNum[n] b; AcceptVal[n] v
send (“accept”, b, n, v) to all
(first time only)
44
Recovery Mode
• The new leader must learn the outcome of all the
pending requests that have smaller BallotNums.
– The “ack” messages include AcceptNums and
AcceptVals of all pending requests.
• For all pending requests, the leader sends “accept”
messages.
• What if there are holes?
– E.g., leader learns of request number 13 and not of 12.
– Fill in the gaps with dummy “do nothing” requests.
45
Leslie Lamport’s Reflections
• Inspired by my success at popularizing the
consensus problem by describing it with
Byzantine generals, I decided to cast the algorithm
in terms of a parliament on an ancient Greek
island.
• To carry the image further, I gave a few lectures in
the persona of an Indiana-Jones-style
archaeologist,
• My attempt at inserting some humor into the
subject was a dismal failure.
46
The History of the Paper
by Lamport
• I submitted the paper to TOCS in 1990. All three referees
said that the paper was mildly interesting, though not very
important, but that all the Paxos stuff had to be removed. I
was quite annoyed at how humorless everyone working in
the field seemed to be, so I did nothing with the paper.
• A number of years later, a couple of people at SRC needed
algorithms for distributed systems they were building, and
Paxos provided just what they needed. I gave them the
paper to read and they had no problem with it. So, I
thought that maybe the time had come to try publishing it
again.
47
© Copyright 2025 Paperzz