(a) 1

Raft consensus
Landon Cox
February 15, 2017
Paxos
• ACM TOCS:
•
Transactions on Computer Systems
• Submitted: 1990. Accepted: 1998
• Introduced:
Butler W. Lampson
/
http://research.microsoft.com/en-us/um/people/blampson
Butler Lampson is a Technical Fellow at Microsoft Corporation and an Adjunct Professor at
MIT…..He was one of the designers of the SDS 940 time-sharing system, the Alto personal
distributed computing system, the Xerox 9700 laser printer, two-phase commit protocols, the
Autonet LAN, the SPKI system for network security, the Microsoft Tablet PC software, the
Microsoft Palladium high-assurance stack, and several programming languages. He received
the ACM Software Systems Award in 1984 for his work on the Alto, the IEEE Computer
Pioneer award in 1996 and von Neumann Medal in 2001, the Turing Award in 1992, and the
NAE’s Draper Prize in 2004.
Barbara Liskov
• MIT professor
• 2008 Turing Award
• “View-stamped replication”
• PODC ’88
• Very similar to Raft
• Different election process
State machines
At any moment,
machine exists in a
“state”
What is a state? Should think of as a set of named variables and their values
State machines
Clients can ask a machine about its current state.
What is your
state?
Client
1
4
2
3
6
5
My state is “2”
State machines
“actions” change the
machine’s state
What is an action? Command that updates named variables’ values
State machines
“actions” change the
machine’s state
Is an action’s effect deterministic? For our purposes, yes. Given a state and an action,
we can determine next state w/ 100% certainty.
State machines
“actions” change the
machine’s state
Is the effect of a sequence of actions deterministic? Yes, given a state and a
sequence of actions, can be
100% certain of end state
Replicated state machines
Each state machine should compute same state, even if some fail.
Client
What is
the state?
Client
What is
the state?
What is
the state?
Client
Replicated state machines
What has to be true of the actions that clients submit? Applied in same order
Client
Apply
action a.
Client
Apply
action c.
Apply
action b.
Client
State machines
How should a machine make sure it applies action in same order across reboots?
Store them in a log!
Action
Action
Action
…
Replicated state machines
Can reduce problem of consistent, replicated states to consistent, replicated logs
…
…
…
…
Replicated state machines
How to make sure that logs are consistent? Two-phase commit? …
…
…
…
…
Replicated state machines
What is the heart of the matter?
Have to agree on the leader, outside of the logs.
Leader=L
Leader=L
Apply
action a.
Leader=L
…
…
Client
Leader=L
…
…
Key elements of consensus
• Leader election
• Who is in charge?
• Log replication
• What are the actions and what is their order?
• Safety
• What is true for all states, in all executions (including failures)?
• e.g., either we haven’t agreed or we all agree on the same value
Key elements of consensus
• Leader election
• Who is in charge?
• Log replication
• What are the actions and what is their order?
• Safety
• What is true for all states, in all executions (including failures?
• e.g., either we haven’t agreed or we all agree on the same value
Server states
• Three states: leader, follower, candidate
C
F
L
Server state: follower
• Passive state: respond to candidates and leaders
C
F
L
Server state: leader
• Server handles all client requests
C
F
L
What should happen if a client sends a request to a follower?
Follower forwards request to leader.
Server state: candidate
• An intermediate state, used during elections
C
F
L
Time divided into terms
Term 1
Election
Leader
unknown
Term 2
Normal
operation
Leader
known
Election
T3
Term 4
Normal operation
What happened
here?
Election failed
Terms as a logical clock
•
All servers maintain the current term
•
•
•
•
What if A’s term is bigger than B’s?
•
•
B updates its term to A’s
What if A’s term is bigger and B thinks of itself as the leader?
•
•
Terms increase monotonically
Maintained as a logical clock
Terms are exchanged whenever servers communicate
B reverts to a follower state
What if A’s term is bigger, and it receives a request from B?
•
•
A rejects B’s request
B must be up to date to issue requests
Server state: follower
S2
C
S1
F
C
F
L
Current term = 0
S3
L
C
Current term = 0
All servers start as followers.
All servers have local timers.
Note: no bootstrapping problem!
F
L
Current term = 0
Server state: follower
S2
C
S1
F
C
F
L
Current term = 0
S3
L
C
Current term = 0
Server remains follower as long as it receives
periodic valid messages from a leader or
candidate.
Called a “heartbeat” message.
F
L
Current term = 0
Server state: follower
S2
C
S1
F
C
F
L
Current term = 0
S3
L
C
Current term = 0
What should server assume if no
heartbeat?
Assume no viable leader, start election.
F
L
Current term = 0
Server state: follower
S2
C
S1
F
C
F
L
Current term = 0
S3
L
C
Current term = 0
Who should the server nominate?
How about itself? At least it knows that it’s
running.
F
L
Current term = 0
Server state: candidate
S2
C
S1
F
C
F
L
Current term = 0
S3
L
C
Current term = 1
To start an election:
Increment current term and set state to
candidate
F
L
Current term = 0
Server state: candidate
S2
C
S1
F
C
F
L
Current term = 0
S3
L
C
Current term = 1
Need to collect votes. For whom should the
server vote?
Itself, of course! Major qualification: It’s
running.
F
L
Current term = 0
Server state: candidate
S2
C
S1
F
C
Votes
S1=S1
S2=?
S3=?
F
L
Current term = 0
S3
L
C
Current term = 1
How should S2, S3 respond to vote
request?
F
L
Current term = 0
Server state: candidate
S2
C
S1
F
C
Votes
S1=S1
S2=?
S3=?
F
L
Current term = 1
S3
L
C
Current term = 1
How should S2, S3 respond to vote request?
F
L
Increment term, vote for S1 … why vote for S1?
Our goal is consensus, and we know the
collector voted for itself.
Current term = 1
Server state: candidate
S2
C
S1
F
C
Votes
S1=S1
S2=S1
S3=S1
F
L
Current term = 1
S3
L
C
Current term = 1
What should S1 do next?
Count votes (majority wins)
Make itself the leader, start sending
heartbeats.
F
L
Current term = 1
Server state: candidate
S2
C
S1
F
C
Leader
= S1
F
L
Leader
= S1
Current term = 1
S3
L
C
Current term = 1
F
L
Current term = 1
Leader
= S1
Server state: candidate
S2
C
S1
F
C
Leader
= S1
F
L
Leader
= S1
Current term = 1
S3
L
C
Current term = 1
How many faults can we tolerate?
One. Need two/three to vote for same new
leader
F
L
Current term = 1
Leader
= S1
Server state: candidate
S2
C
S1
F
C
Votes
S1=S1
S2=?
S3=?
F
L
Current term = 1
S3
L
C
Current term = 1
Who votes for whom if S1 and S3 both call
elections?
F
L
Current term = 1
Votes
S1=?
S2=?
S3=S3
Server state: candidate
S2
C
S1
F
C
Votes
S1=S1
S2=S3
S3=S3
F
L
Current term = 1
S3
L
C
Current term = 1
Who votes for whom if S1 and S3 both call
elections?
S2 votes for the server that asked first
F
L
Current term = 1
Votes
S1=S1
S2=S3
S3=S3
Server state: candidate
S2
C
S1
F
C
Votes
S1=S1
S2=S3
S3=S3
F
L
Current term = 1
S3
L
C
Current term = 1
What does S1 do if it loses the election?
F
L
Current term = 1
Votes
S1=S1
S2=S3
S3=S3
Server state: candidate
S2
C
S1
F
C
Votes
S1=S1
S2=S3
S3=S3
F
L
Current term = 1
S3
L
C
Current term = 1
What does S1 do if it loses the election?
Moves back to a follower state.
F
L
Current term = 1
Votes
S1=S1
S2=S3
S3=S3
Server state: candidate
S2
C
S1
F
C
Leade
r = S3
F
L
Leade
r = S3
Current term = 1
S3
L
C
Current term = 1
F
L
Current term = 1
Leade
r = S3
Server state: follower
S2
C
S1
F
C
F
L
Current term = 0
S3
L
C
Current term = 0
Can all three servers nominate themselves?
F
L
Current term = 0
Server state: candidate
S2
C
Votes
S1
F
C
Votes
S1=S1
S2=S2
S3=S3
F
L
Current term = 1
S1=S1
S2=S2
S3=S3
S3
L
C
Current term = 1
Can all three servers nominate themselves?
Sure!
F
L
Current term = 1
Votes
S1=S1
S2=S2
S3=S3
Server state: candidate
S2
C
Votes
S1
F
C
Votes
S1=S1
S2=S2
S3=S3
F
L
Current term = 1
S1=S1
S2=S2
S3=S3
S3
L
C
Current term = 1
Could this happen indefinitely?
F
L
Current term = 1
Votes
S1=S1
S2=S2
S3=S3
Server state: candidate
S2
C
Votes
S1
F
C
Votes
S1=S1
S2=S2
S3=S3
F
L
Current term = 1
S1=S1
S2=S2
S3=S3
S3
L
C
Current term = 1
Could this happen indefinitely?
Yes, there is not way to prevent this from happening.
The worst possible thing is still possible.
F
L
Current term = 1
Votes
S1=S1
S2=S2
S3=S3
Server state: candidate
S2
C
Votes
S1
F
C
Votes
S1=S1
S2=S2
S3=S3
F
L
Current term = 1
S1=S1
S2=S2
S3=S3
S3
L
C
Current term = 1
How do we make this less likely to occur?
F
L
Current term = 1
Votes
S1=S1
S2=S2
S3=S3
Server state: candidate
S2
C
Votes
S1
F
C
Votes
S1=S1
S2=S2
S3=S3
F
L
Current term = 1
S1=S1
S2=S2
S3=S3
S3
L
C
Current term = 1
How do we make this less likely to occur?
Randomize election timeouts
i.e., wait a random period before new election
F
L
Current term = 1
Votes
S1=S1
S2=S2
S3=S3
Server state: candidate
S2
C
S1
F
C
F
L
Current term = 1
S3
L
C
Current term = 1
Can we ever have two leaders?
F
L
Current term = 1
Server state: candidate
S2
C
S1
F
C
F
L
Current term = 1
S3
L
C
Current term = 1
Can we ever have two leaders?
No, votes either split or converge w/ three nodes.
F
L
Current term = 1
Key elements of consensus
• Leader election
• Who is in charge?
• Log replication
• What are the actions and what is their order?
• Safety
• What is true for all states, in all executions (including failures)?
• e.g., either we haven’t agreed or we all agree on the same value
Each node maintains an action log.
Each entry contains an action and a term.
The term indicates when the leader
received the action.
Leader=L
Leader=L
1
x3
1
y1
Leader=L
1
x3
1
y1
1
x3
1
y1
When a request comes in, leader appends
action to its log.
Leader=L
y5
Client
Leader=L
1
x3
1
y1
Leader=L
1
x3
1
y1
1
y5
1
x3
1
y1
Next, the leader tells other servers to
append action.
Leader=L
y5
Client
Leader=L
1
y5
1
x3
1
y1
Leader=L
1
y5
1
x3
1
y1
1
y5
1
x3
1
y1
The leader waits for confirmation that the
entry was appended.
Leader=L
y5
Client
Leader=L
ack
1
x3
1
y1
1
y5
Leader=L
ack
1
x3
1
y1
1
y5
1
x3
1
y1
1
y5
As with leader election, majority rules.
Entry is committed once the leader that
received it has replicated the entry on a
majority of servers (for now).
Leader=L
y5
Client
Leader=L
3/3 …
commit
!
1
x3
1
y1
1
y5
Leader=L
1
x3
1
y1
1
y5
1
x3
1
y1
1
y5
If action commits, leader updates state
machine.
Leader=L
y5
Client
Leader=L
1
x3
1
y1
1
y5
Leader=L
1
x3
1
y1
1
y5
1
x3
1
y1
1
y5
Leader reports success to client and other
servers if action commits
y5
Client
C
Leader=L
Leader=L
C
1
x3
C
1
x3
1
y1
1
y1
1
y5
Leader=L
1
y5
1
x3
1
y1
1
y5
What happens if one follower fails?
Action will still commit, since 2/3 are alive.
Leader=L
Leader=L
1
x3
1
y1
Leader=L
1
x3
1
y1
1
y5
1
x3
1
y1
1
y5
What happens if both followers fail: (1) after adding
entry to logs, and (2) before acking?
Leader=L
The leader will keep trying to append. If one server
comes back, the action will eventually commit.
Leader=L
1
x3
1
y1
1
y5
Leader=L
1
x3
1
y1
1
y5
1
x3
1
y1
1
y5
How long might a client have to wait in this case?
A long time, i.e., until two machines append the
action
Leader=L
Leader=L
1
x3
1
y1
1
y5
Leader=L
1
x3
1
y1
1
y5
1
x3
1
y1
1
y5
What happens if leader fails: (1) after adding entry to
log, and (2) before contacting followers?
Leader=L
Action will not commit. New entries will occur
under next term.
Leader=L
1
x3
1
y1
Leader=L
1
x3
1
y1
1
y5
1
x3
1
y1
As with leader election, majority rules for commit.
For now: entry is committed once the leader that received it has
replicated the entry on a majority of servers
Log index
1
2
3
4
5
6
1
x3
1
y1
1
y9
2
x2
3
3
x0 y7
1
x3
1
y1
1
y9
2
x2
3
x0
1
x3
1
y1
1
y9
2
3
3
x2 x0 y7
7
8
3
x5
3
x4
3
x5
3
x4
Leader
Followers
1
x3
1
y1
1
x3
1
y1
1
y9
2
x2
3
3
x0 y7
3
x5
In this example, which committed entry has the highest index?
Entry 7
Log index
1
2
3
4
5
6
1
x3
1
y1
1
y9
2
x2
3
3
x0 y7
1
x3
1
y1
1
y9
2
x2
3
x0
1
x3
1
y1
1
y9
2
3
3
x2 x0 y7
7
8
3
x5
3
x4
3
x5
3
x4
Leader
Followers
1
x3
1
y1
1
x3
1
y1
1
y9
2
x2
3
3
x0 y7
3
x5
Committing a log entry also commits all entries in leader’s log with lower index.
Note: the current leader’s index is the one that counts!
Log index
1
2
3
4
5
6
1
x3
1
y1
1
y9
2
x2
3
3
x0 y7
1
x3
1
y1
1
y9
2
x2
3
x0
1
x3
1
y1
1
y9
2
3
3
x2 x0 y7
7
8
3
x5
3
x4
3
x5
3
x4
Leader
Followers
1
x3
1
y1
1
x3
1
y1
1
y9
2
x2
3
3
x0 y7
3
x5
<sigh>This picture is far too orderly and easy to understand.
No guarantee the world will look like this.</sigh>
This can be the state of the logs when a leader comes to power.
Each server has assigned each entry (1) a term, and (2) an index.
1
2
3
4
5
6
7
8
9
10
1
1
1
4
4
5
5
6
6
6
(a
)
1
1
1
4
4
5
5
6
6
(b
)
1
1
1
4
(c
)
1
1
1
4
4
5
5
6
6
6
(d
)
1
1
1
4
4
5
5
6
6
6 7
(e
)
1
1
1
4
4
4
4
(f)
1
1
1
2
2
2
3
3
3
3
Log
index
11
12
Leader
6
3
Followers
7
What is the current term?
We are in term 8.
1
2
3
4
5
6
7
8
9
10
1
1
1
4
4
5
5
6
6
6
(a
)
1
1
1
4
4
5
5
6
6
(b
)
1
1
1
4
(c
)
1
1
1
4
4
5
5
6
6
6
(d
)
1
1
1
4
4
5
5
6
6
6 7
(e
)
1
1
1
4
4
4
4
(f)
1
1
1
2
2
2
3
3
3
3
Log
index
11
12
Leader
6
3
Followers
7
Why aren’t there any entries for term 8?
Because the leader hasn’t received any requests yet.
1
2
3
4
5
6
7
8
9
10
1
1
1
4
4
5
5
6
6
6
(a
)
1
1
1
4
4
5
5
6
6
(b
)
1
1
1
4
(c
)
1
1
1
4
4
5
5
6
6
6
(d
)
1
1
1
4
4
5
5
6
6
6 7
(e
)
1
1
1
4
4
4
4
(f)
1
1
1
2
2
2
3
3
3
3
Log
index
11
12
Leader
6
3
Followers
7
This can be the state of the logs when the leader comes to power.
Each server has assigned each entry (1) a term, and (2) an index.
1
2
3
4
5
6
7
8
9
10
1
1
1
4
4
5
5
6
6
6
(a
)
1
1
1
4
4
5
5
6
6
(b
)
1
1
1
4
Log
index
What’s wrong with the logs of (a) and
(b)?
They are missing log entries.
11
12
Leader
This can be the state of the logs when the leader comes to power.
Each server has assigned each entry (1) a term, and (2) an index.
1
2
3
4
5
6
7
8
9
10
1
1
1
4
4
5
5
6
6
6
(a
)
1
1
1
4
4
5
5
6
6
(b
)
1
1
1
4
Log
index
How might this have happened?
They could have gone offline and come
back; (a) during term 6, (b) during term 4
11
12
Leader
This can be the state of the logs when the leader comes to power.
Each server has assigned each entry (1) a term, and (2) an index.
Log
index
1
2
3
4
5
6
7
8
9
10
1
1
1
4
4
5
5
6
6
6
What’s wrong with the logs of (c) and
(d)?
11
12
Leader
They have extra log entries.
(c
)
1
1
1
4
4
5
5
6
6
6
(d
)
1
1
1
4
4
5
5
6
6
6 7
6
7
This can be the state of the logs when the leader comes to power.
Each server has assigned each entry (1) a term, and (2) an index.
Log
index
1
2
3
4
5
6
7
8
9
10
1
1
1
4
4
5
5
6
6
6
11
12
Leader
(c) was leader for term 6, added
entry and crashed; (d) was leader
for 7, added entries and crashed
How might this have happened?
(c
)
1
1
1
4
4
5
5
6
6
6
(d
)
1
1
1
4
4
5
5
6
6
6 7
6
7
This can be the state of the logs when the leader comes to power.
Each server has assigned each entry (1) a term, and (2) an index.
Log
index
1
2
3
4
5
6
7
8
9
10
1
1
1
4
4
5
5
6
6
6
What’s wrong with the logs of (e) and (f)?
(e
)
1
1
1
4
4
4
4
(f)
1
1
1
2
2
2
3
11
12
Leader
They have extra log entries and
missing log entries.
3
3
3
3
This can be the state of the logs when the leader comes to power.
Each server has assigned each entry (1) a term, and (2) an index.
Log
index
1
2
3
4
5
6
7
8
9
10
1
1
1
4
4
5
5
6
6
6
1
1
1
4
4
4
4
(f)
1
1
1
2
2
2
3
12
Leader
(f) was leader for term 2, added
several entries and crashed.
(f) quickly restarted and became
leader for term 3, added more
entries and crashed before any
entries from terms 2 or 3 could
commit
How could this have happened to (f)?
(e
)
11
3
3
3
3
Goal: (eventually) converge on a sane state from logs like this.
1
2
3
4
5
6
7
8
9
10
1
1
1
4
4
5
5
6
6
6
(a
)
1
1
1
4
4
5
5
6
6
(b
)
1
1
1
4
(c
)
1
1
1
4
4
5
5
6
6
6
(d
)
1
1
1
4
4
5
5
6
6
6 7
(e
)
1
1
1
4
4
4
4
(f)
1
1
1
2
2
2
3
3
3
3
Log
index
11
12
Leader
6
3
Followers
7
Servers have to keep track of committed log
entry with highest index. What are those here?
Leader=L
2 for all three servers.
y5
Client
Leader=L
1
x3
1
y1
1
2
Leader=L
1
x3
1
y1
1
y5
1
2
3
1
x3
1
y1
1
2
Leader reports its highest index of committed action
when forwarding request to followers.
Leader=L
This is how followers update their state machines.
y5
Client
Leader=L
Term=1
y5
High=2
1
x3
1
y1
1
2
Leader=L
Term=1
y5
High=2
1
x3
1
y1
1
y5
1
2
3
1
x3
1
y1
1
2
Leader also reports highest index immediately
preceding current append request.
Leader=L
y5
Client
Leader=L
Term=1
y5
High=2
Pred=2
1
x3
1
y1
1
2
Leader=L
Term=1
y5
High=2
Pred=2
1
x3
1
y1
1
y5
1
2
3
1
x3
1
y1
1
2
Could this happen?
Yes, if follower failed before it received action
with index 2 from term 1
Leader=L
y5
Client
Leader=L
Term=1
y5
High=2
Pred=2
1
x3
1
Leader=L
Term=1
y5
High=2
Pred=2
1
x3
1
y1
1
y5
1
2
3
1
x3
1
y1
1
2
Should the recovered follower append the new action?
If it did, then it would have a different action in
index 2 during term 1. Better to reject new action
and append missing committed actions first.
Leader=L
y5
Client
Leader=L
Term=1
y5
High=2
Pred=2
1
x3
1
Leader=L
Term=1
y5
High=2
Pred=2
1
x3
1
y1
1
y5
1
2
3
1
x3
1
y1
1
2
Can new action commit while recovered server receives actions
it missed while down?
Leader=L
Yes, since we can still achieve majority w/o it.
y5
Client
Leader=L
Term=1
y5
High=2
Pred=2
1
x3
1
Leader=L
Term=1
y5
High=2
Pred=2
1
x3
1
y1
1
y5
1
2
3
1
x3
1
y1
1
2
This gives us a very important system property:
Two entries in different logs w/ same index and
term always have the same action.
Leader=L
y5
Client
Leader=L
Term=1
y5
High=2
Pred=2
1
x3
1
Leader=L
Term=1
y5
High=2
Pred=2
1
x3
1
y1
1
y5
1
2
3
1
x3
1
y1
1
2
Two entries in different logs w/ same index and term
always have the same action. Why?
Leader=L
In a term, leader creates at most one entry with a given
index. Followers catch up first when behind.
y5
Client
Leader=L
Term=1
y5
High=2
Pred=2
1
x3
1
Leader=L
Term=1
y5
High=2
Pred=2
1
x3
1
y1
1
y5
1
2
3
1
x3
1
y1
1
2
Log Matching Property
• Two entries in different logs with same index and term
 The entries store the same action
 The logs are identical in all preceding entries
• Does the matching property hold below?
Log index
1
2
3
4
5
6
7
8
9
10
1
1
1
4
4
5
5
6
6
6
1
1
1
4
4
5
5
6
6
6
1
1
1
4
4
4
4
1
1
1
2
2
2
3
11
12
Leader
7
7
Followers
3
3
3
3
Log Matching Property
• Two entries in different logs with same index and term
 The entries store the same action
 The logs are identical in all preceding entries
• Why is the first part always true?
•
•
•
•
One leader/term
If leader fails, term changes
Leader inserts a new entry once at a given index
NOTE: this is not enough to ensure the second part
Log Matching Property
• Two entries in different logs with same index and term
 The entries store the same action
 The logs are identical in all preceding entries
• Ensuring the second part requires extra check by follower
•
•
•
•
Leader sends followers append(term, last_index, action)
Follower must check that last entry has same term and last_index
If not, the follower refuses the new entry
Otherwise, follower may append new entry containing action
Log Matching Property
• Two entries in different logs with same index and term
 The entries store the same action
 The logs are identical in all preceding entries
• Do logs always agree on indexes, terms?
• No, indexes at logs may have different actions and terms
• Goal of repairing logs is to increase entries w/ same index, term
1
1
1
4
4
5
5
1
1
1
4
4
4
4
1
1
1
2
2
2
3
6
6
6
7
3
3
3
3
7
How do we make the logs look more like one another?
1
2
3
4
5
6
7
8
9
10
1
1
1
4
4
5
5
6
6
6
(a)
1
1
1
4
4
5
5
6
6
(b
)
1
1
1
4
(c)
1
1
1
4
4
5
5
6
6
Log index
6
11
12
Leader
6
Followers
(d
)
1
1
1
4
4
5
5
(e)
1
1
1
4
4
4
4
(f)
1
1
1
2
2
2
3
6
6
6
7
3
3
3
3
7
How do we make the logs look more like one another?
Raft: How do we make the logs look like the leader’s?
1
2
3
4
5
6
7
8
9
10
1
1
1
4
4
5
5
6
6
6
(a)
1
1
1
4
4
5
5
6
6
(b
)
1
1
1
4
(c)
1
1
1
4
4
5
5
6
6
Log index
6
11
12
Leader
6
Followers
(d
)
1
1
1
4
4
5
5
(e)
1
1
1
4
4
4
4
(f)
1
1
1
2
2
2
3
6
6
6
7
3
3
3
3
7
Repairing logs
• Leader Append-only Property
•
•
•
•
Leader’s log is treated as ground truth
Leader never overwrites or deletes log entries
Leader only appends to its log
Basic idea: make the leader’s life simple
• How would a leader know that follower is missing entries?
• Follower refuses an append request (based on last_index)
• Leader must monitor each follower’s progress (next_index)
• Send log entries until followers are caught up
next_index[a]=11
next_index[b]=11
next_index[c]=11
Leader initially assumes
followers’ logs look like hers
1
2
3
4
5
6
7
8
9
10
1
1
1
4
4
5
5
6
6
6
(a)
1
1
1
4
4
5
5
6
6
(b
)
1
1
1
4
(c)
1
1
1
4
Log index
11
12
Leader
Followers
4
5
5
6
6
6
6
next_index[a]=11
next_index[b]=11
next_index[c]=11
1
2
3
4
5
6
7
8
9
10
11
1
1
1
4
4
5
5
6
6
6
8
(a)
1
1
1
4
4
5
5
6
6
8
(b
)
1
1
1
4
8
(c)
1
1
1
4
4
6
6
6
Log index
12
Leader
Prev={term 6,
index 10}
Prev={term 6,
index 10}
5
5
6
8
Prev={term 6,
index 10}
next_index[a]=10
next_index[b]=10
next_index[c]=10
1
2
3
4
5
6
7
8
9
10
11
1
1
1
4
4
5
5
6
6
6
8
(a)
1
1
1
4
4
5
5
6
6
8
✗
(b
)
1
1
1
4
8
✗
(c)
1
1
1
4
4
6
6
6
Log index
12
Leader
Prev={term 6,
index 10}
Prev={term 6,
index 10}
5
5
6
8
✗
Prev={term 6,
index 10}
next_index[a]=10
next_index[b]=10
next_index[c]=10
1
2
3
4
5
6
7
8
9
10
11
1
1
1
4
4
5
5
6
6
6
8
(a)
1
1
1
4
4
5
5
6
6
6
(b
)
1
1
1
4
6
(c)
1
1
1
4
4
6
6
6
Log index
Prev={term 6,
index 9}
Prev={term 6,
index 9}
5
5
12
6
Prev={term 6,
index 9}
Leader
next_index[a]=10
next_index[b]=10
next_index[c]=10
1
2
3
4
5
6
7
8
9
10
11
1
1
1
4
4
5
5
6
6
6
8
(a)
1
1
1
4
4
5
5
6
6
6
✓
(b
)
1
1
1
4
6
✗
(c)
1
1
1
4
4
6
6
✓6
Log index
Prev={term 6,
index 9}
Prev={term 6,
index 9}
5
5
6
Prev={term 6,
index 9}
What should happen to this
log entry?
It should be deleted
12
Leader
next_index[a]=11
next_index[b]=9
next_index[c]=11
1
2
3
4
5
6
7
8
9
10
11
1
1
1
4
4
5
5
6
6
6
8
(a)
1
1
1
4
4
5
5
6
6
6
(b
)
1
1
1
4
(c)
1
1
1
4
4
5
5
6
6
6
Log index
What entry does the leader send to (a) next?
Entry at index 11 (term 8); this succeeds.
12
Leader
next_index[a]=11
next_index[b]=9
next_index[c]=11
1
2
3
4
5
6
7
8
9
10
11
1
1
1
4
4
5
5
6
6
6
8
(a)
1
1
1
4
4
5
5
6
6
6
(b
)
1
1
1
4
(c)
1
1
1
4
4
5
5
6
6
6
Log index
12
Leader
What entry does the leader send to (b) next?
Entry at index 9 (term 6); this fails and next_index[b]  8.
Eventually, (b) will accept entry from term 4 at index 5, and
it will catch up with everyone else.
next_index[a]=11
next_index[b]=9
next_index[c]=11
1
2
3
4
5
6
7
8
9
10
11
1
1
1
4
4
5
5
6
6
6
8
(a)
1
1
1
4
4
5
5
6
6
6
(b
)
1
1
1
4
(c)
1
1
1
4
4
5
5
6
6
6
Log index
What entry does the leader send to (c) next?
Entry at index 11 (term 8); this succeeds.
12
Leader
next_index[a]=11
next_index[b]=9
next_index[c]=11
1
2
3
4
5
6
7
8
9
10
11
1
1
1
4
4
5
5
6
6
6
8
(a)
1
1
1
4
4
5
5
6
6
6
(b
)
1
1
1
4
(c)
1
1
1
4
4
5
5
6
6
6
Log index
As described, this should make you uncomfortable. Why?
We could choose a bad leader (e.g., one w/ an empty log).
This could delete committed entries!
12
Leader
next_index[a]=11
next_index[b]=9
next_index[c]=11
1
2
3
4
5
6
7
8
9
10
11
1
1
1
4
4
5
5
6
6
6
8
(a)
1
1
1
4
4
5
5
6
6
6
(b
)
1
1
1
4
(c)
1
1
1
4
4
5
5
6
6
6
Log index
How do we prevent this from happening?
We have to be smarter about electing an leader…
12
Leader
Refined leader election
• Goal
• Leader must have all committed actions
• How actions are committed
•
•
•
•
•
Leader accepts action from client
Leader forwards action to followers
Majority append must occur before next term
If new term starts before majority append, no commit
Commit cannot occur until later action commits
Refined leader election
• Initial election criterion
• Any candidate can be elected
• Normally the candidate who starts the election wins
• Problem: leader can force stale log on followers
• What must be true of an elected leader?
• Leader must have all prior committed actions
• If leader is missing a committed action, it may be lost
Refined leader election
• New election criterion
• Vote for candidate with most “up to date” log
• If voter’s log is more recent than candidate’s, vote self
• Which log is more up to date?
• (a) because its last entry is from term 4 (> term 3)
• If last log entry is from later term, log is more up to date
(a)
1
1
1
4
4
4
4
(b
)
1
1
1
2
2
2
3
3
3
3
3
Refined leader election
• New election criterion
• Vote for candidate with most “up to date” log
• If voter’s log is more recent than candidate’s, vote self
• Which log is more up to date?
• (b) because its last term-4 entry has higher index
• When last log entries are from same term, higher index is more
up to date
(a)
1
1
1
4
4
4
4
(b
)
1
1
1
4
4
4
4
4
Refined leader election
• New election criterion
• Vote for candidate with most “up to date” log
• If voter’s log is more recent than candidate’s, vote self
• How to decide which log is more up to date
• If last log entry is from later term, log is more up to date
• When last log entries are from same term, higher index is more
up to date
Refined leader election
• Goal
• Leader must have all committed actions
• Recall how actions had been committed
•
•
•
•
•
Leader accepts action from client
Leader forwards action to followers
If a majority of followers append action, it’s committed
If action commits, leader returns to client
If leader fails, future leaders can commit
• Unfortunately, there’s a problem …
Committing actions
Log index
1
2
(a)
1
2
(b
)
1
2
(c)
1
(d
)
1
(e)
1
Committing actions
Log index
1
2
(a)
1
2
(b
)
1
2
(c)
1
(d
)
1
(e)
1
How was (e) elected if (b)
has a more up-to-date log?
Votes from (c)
and (d)
Committing actions
Log index
1
2
(a)
1
2
(b
)
1
2
(c)
1
(d
)
1
(e)
1
3
Committing actions
Log index
1
2
(a)
1
2
(b
)
1
2
(c)
1
(d
)
1
(e)
1
3
Committing actions
Log index
1
2
3
(a)
1
2
4
(b
)
1
2
(c)
1
(d
)
1
(e)
1
What is the first thing that
(a) will do?
Forward entry at index 3 to
followers, but this will fail at
(c), because (c) does not
have entry at index 2 yet
3
Committing actions
Log index
1
2
3
(a)
1
2
4
(b
)
1
2
(c)
1
2
(d
)
1
(e)
1
Which entry will (a) send to
(c) first?
Entry at index 2
3
Committing actions
Log index
1
2
3
(a)
1
2
4
(b
)
1
2
(c)
1
2
(d
)
1
(e)
1
Has the entry at index 2
been committed?
By our previous definition,
yes. But note that it reached
a majority of nodes in term
4, not term 2 (i.e., after it
was created)
3
Committing actions
Log index
1
2
3
(a)
1
2
4
(b
)
1
2
(c)
1
2
(d
)
1
(e)
1
3
Committing actions
Log index
1
2
3
(a)
1
2
4
(b
)
1
2
(c)
1
2
(d
)
1
(e)
1
How could (e) have been
elected?
Votes from
(b),(c), and (d)
3
Committing actions
Log index
1
2
3
(a)
1
2
4
(b
)
1
2
(c)
1
2
(d
)
1
(e)
1
What is the first thing that
(e) will do?
Forward entry
at index 2 to
followers
3
Committing actions
Log index
1
2
3
(a)
1
2
4
(b
)
1
2
3
(c)
1
2
3
(d
)
1
3
(e)
1
3
What will happen to other
logs’ entries at index 2?
They will be
replaced by
the leader’s
entry from term
3
Committing actions
Log index
1
2
3
(a)
1
2
3
4
(b
)
1
2
3
(c)
1
2
3
(d
)
1
3
(e)
1
3
What will happen to (a)’s
entry at index 3?
It will be deleted
Committing actions
So, are actions stored on a majority really committed?
No, only if they reach a majority in the term they’re created.
Log index
1
2
(a)
1
2
3
(b
)
1
2
3
(c)
1
2
3
(d
)
1
3
(e)
1
3
3
Committing actions
Log index
1
2
(a)
1
2
(b
)
1
2
(c)
1
2
(d
)
1
(e)
1
Can (e) win the election?
No, only (d) will
vote for it
Committing actions
Can (b) or (c) win the
election?
Yes, either can win.
Log index
1
2
(a)
1
2
(b
)
1
2
(c)
1
2
(d
)
1
(e)
1
Committing actions
What will (b) send to (d)
and (e) first?
Entry at index 2
3
Log index
1
2
(a)
1
2
(b
)
1
2
3
(c)
1
2
3
(d
)
1
2
3
(e)
1
2
3
Committing actions
• (say 2 reached majority in term 4)
Log index
1
2
3
(a)
1
2
4
(b
)
1
2
(c)
1
2
(d
)
1
(e)
1
Can entry 2 ever commit?
Yes, if a subsequent entry
(e.g., index 3, term 4)
commits, all prior entries are
also committed
3
Committing actions
• (say 2 reached majority in term 4)
Log index
1
2
3
(a)
1
2
4
(b
)
1
2
4
(c)
1
2
4
(d
)
1
(e)
1
Entry 3 is now committed,
so entry 2 is as well
3
Intertwining elections and log repair
Log index
1
2
3
4
5
6
7
8
9
10
(a)
1
1
1
4
4
5
5
6
6
6
(b
)
1
1
1
4
4
4
4
(c)
1
1
1
2
2
2
3
3
3
7
11
12
Is it possible for servers’ logs to reach this state?
Why or why not?
Intertwining elections and log repair
Log index
1
2
3
4
5
6
7
8
9
(a)
1
1
1
5
5
6
6
6
6
(b
)
1
1
1
5
5
5
5
(c)
1
1
1
2
4
10
11
12
Is it possible for servers’ logs to reach this state?
Why or why not?
Intertwining elections and log repair
Log index
1
2
3
4
5
(a)
1
1
1
2
6
(b
)
1
1
1
2
4
(c)
1
1
1
2
5
6
7
8
9
10
11
12
Is it possible for servers’ logs to reach this state?
Why or why not?
Key elements of consensus
• Leader election
• Who is in charge?
• Log replication
• What are the actions and what is their order?
• Safety
• What is true for all states, in all executions?
• e.g., either we haven’t agreed or we all agree on the same value
Review: Log Matching Property
• Two entries in different logs with same index and term
 The entries store the same action
 The logs are identical in all preceding entries
• Critical for Safety proof that follows
Log Matching Property
• Two entries in different logs with same index and term
 The entries store the same action
 The logs are identical in all preceding entries
• Why is the first part always true?
•
•
•
•
•
One leader/term
If leader fails, term changes
Leader inserts a new entry once at a given index
 every action is assigned a unique term and index
NOTE: this is not enough to ensure the second part
Log Matching Property
• Two entries in different logs with same index and term
 The entries store the same action
 The logs are identical in all preceding entries
• Ensuring the second part requires extra check by follower
•
•
•
•
Leader sends followers append(term, last_index, action)
Follower must check that last entry has same term and last_index
If not, the follower refuses the new entry
Otherwise, follower may append new entry containing action
Log Matching Property
• Two entries in different logs with same index and term
 The entries store the same action
 The logs are identical in all preceding entries
• Do logs always agree on indexes, terms?
• No, indexes at logs may have different actions and terms
• Goal of repairing logs is to increase entries w/ same index, term
1
1
1
4
4
5
5
1
1
1
4
4
4
4
1
1
1
2
2
2
3
6
6
6
7
3
3
3
3
7
Safety
• Leader Completeness Property
If a log entry is committed in a given term
 entry will be in leaders’ logs in all future terms
• Proof setup
• Majorities required to commit and elect
• What must be true of any two majorities?
• They must overlap
• This will be the linchpin of the proof
• Question: is a node with committed entries to be elected?
Proof by contradiction
• Assume that node missing a committed entry e is elected
• Entry e was committed in term T
• Leader missing committed entry e is elected in term U
• U is earliest term after T whose leader is missing e
Proof by contradiction
• Assume that node missing a committed entry e is elected
• Entry e was committed in term T
• Leader missing committed entry e is elected in term U
• U is earliest term after T whose leader is missing e
• Majorities are needed to commit and elect
 At least one node accepted e in term T and voted for leader in U
 Call this node “the voter”
Proof by contradiction
• Assume that node missing a committed entry e is elected
• Entry e was committed in term T
• Leader missing committed entry e is elected in term U
• U is earliest term after T whose leader is missing e
• Majorities are needed to commit and elect
 At least one node accepted e in term T and voted for leader in U
 Call this node “the voter”
• Possible that the voter obtained entry e after voting in U?
• If so, the voter would have rejected e from the leader in term T
• And we know that the voter accepted e in term T
• So, we know that the voter obtained e before voting in term U
Proof by contradiction
• Assume that node missing a committed entry e is elected
• Entry e was committed in term T
• Leader missing committed entry e is elected in term U
• U is earliest term after T whose leader is missing e
• Majorities are needed to commit and elect
 At least one node accepted e in term T and voted for leader in U
 Call this node “the voter”
 Voter obtained e before voting for leader in U
• We know that the voter voted for leader in U
 Leader in U’s log must have been as up to date as the voter’s log
Proof by contradiction
• Assume that node missing a committed entry e is elected
• Entry e was committed in term T
• Leader missing committed entry e is elected in term U
• U is earliest term after T whose leader is missing e
• Majorities are needed to commit and elect
 At least one node accepted e in term T and voted for leader in U
 Call this node “the voter”
 Voter obtained e before voting for leader in U
 Leader in U’s log must have been as up to date as the voter’s log
• This will give us our contradictions
Proof by contradiction
•
Leader in U’s log must have been as up to date as the voter’s log
•
If leader’s log was the same as the voter’s, then what must be true?
•
•
Leader’s log must have contained committed entry e
This contradicts our assumption, so leader’s log must ≠ voter’s
Proof by contradiction
•
Leader in U’s log must have been as up to date as the voter’s log
•
If leader’s log was the same as the voter’s, then what must be true?
•
•
•
Leader’s log must have contained committed entry e
This contradicts our assumption, so leader’s log must ≠ voter’s
What if leader’s log has same last term as voter’s?
•
•
•
•
•
Then leader’s log must have been longer than the voter’s
If longer with same last term, then there must be an entry from last term in common
Log Matching Property: entries with same index and term  all preceding are same
Thus, the leader’s log must contain entry e
This contradicts our assumption, so leader’s last log term > voter’s last log term
Proof by contradiction
•
Leader in U’s log must have been as up to date as the voter’s log
•
If leader’s log was the same as the voter’s, then what must be true?
•
•
•
What if leader’s log has same last term as voter’s?
•
•
•
•
•
•
Leader’s log must have contained committed entry e
This contradicts our assumption, so leader’s log must ≠ voter’s
Then leader’s log must have been longer than the voter’s
If longer with same last term, then there must be an entry from last term in common
Log Matching Property: entries with same index and term  all preceding are same
Thus, the leader’s log must contain entry e
This contradicts our assumption, so leader’s last log term > voter’s last log term
What if leader’s last log term is later than voter’s last log term?
Log index
1
2
3
4
5
6
7
8
9
10
11
12
L
Voter
T
Leader
(L)
Last term in Leader(U)’s log is L, and L > the last term in Voter’s log
What does this tell us about the relationship of L and T?
L > T, since Voter has entry e from T
Entry e: term T at index 6 in Voter’s log
Leader(U)
Log index
1
2
3
4
5
6
7
8
9
10
11
12
L
Voter
T
Leader
(L)
T
Last term in Leader(U)’s log is L, and L > the last term in Voter’s log
Do we know whether the leader for L, Leader(L), has e?
Yes, it must, since we assumed that Leader(U) was first w/o e
Entry e: term T at index 6 in Voter’s log
Leader(U)
Log index
1
2
3
4
5
6
7
8
9
10
11
12
L
Voter
T
Leader
(L)
T
Leader(U)
L
Last term in Leader(U)’s log is L, and L > the last term in Voter’s log
Do we know whether Leader(L) has the last entry in Leader(U)’s log?
Yes, it must, since updates only flow from leaders to followers
Entry e: term T at index 6 in Voter’s and Leader(L)’s log
Log index
1
2
3
4
5
6
7
8
9
10
T
Voter
T
Leader
(L)
T
11
12
L
Leader(U)
L
Last term in Leader(U)’s log is L, and L > the last term in Voter’s log
Finally, apply the Log Matching property to infer that Leader(U) has e.
Since Leader(L) and Leader(U) both have the last entry in Leader(U)’s log
at the same index and term, all preceding entries are identical.
And since Leader(L) has e and (T < L), Leader(U) must have e.
Entry e: term T at index 6 in Voter’s and Leader(L)’s log
Log index
1
2
3
4
5
6
T
Voter
T
Leader
(L)
T
7
8
9
10
11
12
L
Leader(U)
L
Last term in Leader(U)’s log is L, and L > the last term in Voter’s log
Finally, apply the Log Matching property to infer that Leader(U) has e.
Since Leader(L) and Leader(U) both have the last entry in Leader(U)’s log
at the same index and term, all preceding entries are identical.
And since Leader(L) has e and (T < L), Leader(U) must have e.
Entry e: term T at index 6 in Voter’s, Leader(L)’s, and Leader(U)’s log
✓
Proof by contradiction
•
Leader in U’s log must have been as up to date as the voter’s log
•
If leader’s log was the same as the voter’s, then what must be true?
•
•
•
What if leader’s log has same last term as voter’s?
•
•
•
•
•
•
Leader’s log must have contained committed entry e
This contradicts our assumption, so leader’s log must ≠ voter’s
Then leader’s log must have been longer than the voter’s
If longer with same last term, then there must be an entry from last term in common
Log Matching Property: entries with same index and term  all preceding are same
Thus, the leader’s log must contain entry e
This contradicts our assumption, so leader’s last log term > voter’s last log term
What if leader’s last log term is later than voter’s last log term?
•
•
•
Last term in leader’s log is L, and L > T since voter’s log contains entry e
Leader in L must have had e, since leader in U is first to not have it
By Log Matching Property, leader in U’s log must also contain e
Proof by contradiction
• Assume that node missing a committed entry e is elected
• Entry e was committed in term T
• Leader missing committed entry e is elected in term U
• U is earliest term after T whose leader is missing e
• Majorities are needed to commit and elect
 At least one node accepted e in term T and voted for leader in U
 Call this node “the voter”
 Voter obtained e before voting for leader in U
 Leader in U’s log must have been as up to date as the voter’s log
 Nodes missing committed entries cannot be elected
 Committed entries are always preserved across terms/elections
One final issue
• Cluster membership changes
• Membership is part of the cluster’s configuration
• What is the easy way to handle this?
• Take the cluster down
• Change configuration on all machines
• Restart the cluster
• Why isn’t this ideal?
• Would rather not take the availability hit
• Manually editing configuration can be dangerous
Should never have two leaders
• Problem: cannot guarantee when change happens
What makes this period dangerous?
old
(a)
(b
)
(c)
(d
)
(e)
Time
new
old
old
new
new
new
new
Disjoint
majorities
Should never have two leaders
• Problem: cannot guarantee when change happens
How can you have disjoint majorities?
old
(a)
(b
)
(c)
(d
)
(e)
Time
new
old
old
new
new
new
new
Old (a,b)
New (c,d,e)
Raft’s approach
• Change configuration in two phases
• First, “joint consensus”
• Second, transition to new configuration
• Joint consensus rules
• Log entries replicated to all servers in both configs
• Any server, from either config, may serve as leader
• Agreement requires majorities from both configs
Configuration change
• Leaders initiate configuration change from old to new
• They send out a special log entry, Cold,new, describing both
• As soon as a server logs Cold,new, it plays by joint consensus rules
• For Cold,new to commit, it must reach a majority of Cold and Cnew
• If the leader fails, the new leader could be in Cold or Cold,new
• Important thing is that membership of Cnew cannot act unilaterally
• In other words, majority of Cold still required for commit, elections
• Eventually, Cold,new will commit (we may have to try several times)
• If Cold,new commits, what must be true of future leaders?
• By Leader Completeness, they must be running Cold,new config
Configuration change
• Once, Cold,new commits, we’re in good shape
• Leader can now try to commit Cnew
• Leader logs and uses Cnew immediately
• i.e., only requires a majority of Cnew to commit entries
• Nasty corner case: what if the leader isn’t in Cnew?
(i.e., managing a cluster of which they are not a member)
• It’s fine! Majorities don’t have to include the leader
• The leader continues to count votes and push updates
• When and how should the leader exit?
• Wait until Cnew commits, then shut down
• Cnew will elect a new leader
Midterm exam
• Logistics
• Wednesday, February 22nd during lecture
• In-class, closed-note
• Ali will proctor
• Topics
•
•
•
•
Concurrency and virtual memory
Single-host: THE, Multics, and UNIX
Distributed: RPC/MAUI, Grapevine, VAXClusters, and Raft
Anything from reading/lecture is fair game
• Format
• Multi-part, short answer