Ω Leader Rounds - LPD

Timeliness, Failure Detectors,
and Consensus Performance
Alex Shraer
Joint work with Dr. Idit Keidar
Technion – Israel Institute of Technology
In PODC 2006
How do you survive failures and
achieve high availability?
Replication
State Machine Replication
• Replicas are identical deterministic state
machines
• Process operations in the same order 
remain consistent
a
b
a
a
b
c
Consensus
• Building block for state machine replication
• Each process has an input, should decide on an
output so that–
Agreement: decisions are the same
Validity: decision is input of one process
Termination: eventually all correct processes decide
Basic Model
• Message passing
• Links between every pair of processes
– do not create, duplicate or alter messages (integrity)
• Process and link failures
Keidar & Shraer, Technion, Israel
PODC 2006
Synchronous Model
• Known bound Δ on message delay, processing
• Very convenient for algorithms
• Requires very conservative timeouts
– in practice: avg. latency <
max. latency
100
[Cardwell, Savage, Anderson 2000], [Bakr-Keidar 2002]
– Computation might be too sloooow!
Keidar & Shraer, Technion, Israel
PODC 2006
Asynchronous Model
• Unbounded message delay
• Much more practical
Fault-tolerant consensus impossible
[FLP85]
Keidar & Shraer, Technion, Israel
PODC 2006
Eventually Stable
(Indulgent) Models
• Initially asynchronous
– for unbounded period of time
• Eventually reach stabilization
– GST (Global Stabilization Time)
– following GST certain assumptions hold
• Examples
– ES (Eventual Synchrony) – starting from GST all links have a
bound on message delay
[Dwork, Lynch, Stockmeyer 88]
– failure detectors
• Example: W (leader) failure detector
– Outputs one trusted process
– From some point, all correct processes trust the same correct process
[Chandra, Toueg 96], [Chandra, Hadzilacos, Toueg 96]
Keidar & Shraer, Technion, Israel
PODC 2006
Indulgent Models: Research Trend
• Weaken post-GST assumptions as much as
possible [Guerraoui, Schiper96], [Aguilera et al.
03, 04], [Malkhi et al. 05]
Weaker = better?
Keidar & Shraer, Technion, Israel
PODC 2006
Indulgent Models: Research Trend
You only need ONE machine with eventually ONE timely link.
Buy the hardware to ensure it, set the timeout accordingly,
and EVERYTHING WILL WORK.
Keidar & Shraer, Technion, Israel
PODC 2006
Consensus with Weak Assumptions
Why isn’t anything happening ???
Network
Don’t worry!
It will eventually happen!
Keidar & Shraer, Technion, Israel
PODC 2006
Consensus with Weak Assumptions
Network
Keidar & Shraer, Technion, Israel
PODC 2006
What’s Going On?
• In practice, bounds just need to hold “long
enough” for the algorithm (TA) to finish
• But TA depends on our synchrony
assumptions
– with weak assumptions, TA might be unbounded
• For practical systems, eventual completion
of the job is not enough!
Keidar & Shraer, Technion, Israel
PODC 2006
Our Goal
• Understand the relationship between:
– assumptions (1 timely link, failure detectors,
etc.) that eventually hold
– performance of algorithms that exploit these
assumptions, and only them
• Challenge: How do we understand the
performance of asynchronous algorithms
that make very different assumptions?
Keidar & Shraer, Technion, Israel
PODC 2006
Typical Metric: Count “Rounds”
• Algorithms normally progress in rounds,
though rounds are not synchronized among
processes
at process pi:
forever do
send messages
receive messages while (!some conditions)
compute…
• Previous work:
– look at synchronous runs (every message takes
exactly  time)
– count rounds or “s”
[Keidar, Rajsbaum 01], [Dutta, Guerraoui 02],
[Guerraoui, Raynal 04] [Dutta et al. 03], etc.
Keidar & Shraer, Technion, Israel
PODC 2006
Are All “Rounds” the Same?
• Algorithm 1 waits for messages from a
majority that includes a pre-defined leader
in each round
– takes 3 rounds
• Algorithm 2 waits for messages from all
(unsuspected) processes in each round
– E.g., group membership
– takes 2 rounds
Keidar & Shraer, Technion, Israel
PODC 2006
Do All Rounds Cost the Same?
LAN Market
Oranges $1.00
Keidar & Shraer, Technion, Israel
Apples $1.00
PODC 2006
Do All “Rounds” Cost the Same?
• On the Internet, n2 timely links can be a rarity, [Bakr, Keidar 02]
• Timely communication
require timeouts
– with leader
orders of magnitude smaller
– with majority
Oranges
$100.00
WAN Market
Keidar & Shraer, Technion, Israel
Apples $1.00
PODC 2006
GIRAF
General Round-based Algorithm Framework
• Inspired by Gafni’s RRFD, generalizes it
• Organize algorithms into rounds
• Separate algorithm logic from
waiting condition
• Waiting condition defines model
• Allows reasoning about lower and upper
bounds for rounds of different types
Defining Properties in GIRAF
• Environment can have
– perpetual properties
– eventual properties
• In every run r, there exists a round GSR(r)
• GSR(r) – the first round from which:
– no process fails
– all eventual properties hold in each round
Keidar & Shraer, Technion, Israel
PODC 2006
Defining Properties
• Timeliness of incoming, outgoing and
bidirectional links.
• Some known failure detector properties
• Use properties to clearly define models
Keidar & Shraer, Technion, Israel
PODC 2006
Some Results: Context
• Consensus problem
• Global decision time metric
– Time until all correct processes decide
• Message passing
• Crash failures
– t < n/2 potential failures out of n>1 processes
Keidar & Shraer, Technion, Israel
PODC 2006
◊LM Model: Leader and Majority
• Nothing required before GSR
• In every round k ≥ GSR
– Every correct process receives a round k
message from a majority of processes, one of
which is the Ω-leader.
• Practically requires much shorter timeouts
than Eventual Synchrony [Bakr, Keidar]
Keidar & Shraer, Technion, Israel
PODC 2006
◊LM: Previous Work
• Most Ω-based algorithms wait for
majority in each round (not ◊LM)
• Paxos [Lamport 98] works for ◊LM
– Takes constant number of rounds in
Eventual Synchrony (ES)
– But how many rounds without ES?
Keidar & Shraer, Technion, Israel
PODC 2006
Paxos Run in ES
Ω Leader
(“prepare”,2)
(“prepare”,21)
yes
1
2
5
5
.
.
.
. no
.
.
20
20
20
21
21
21
21
.
.
.
.
.
.
.
.
.
21
21
21
21
21
BallotNum
number of attempts
to decide initiated by leaders
Keidar & Shraer, Technion, Israel
(Commit, 21, v1)
21
yes
(Commit, 21 ,v1)
decide v1
PODC 2006
Ω Leader
Paxos in ◊LM (w/out ES)
(“prepare”,2)
1
2
(“prepare”,9)
ok
2
9
no (5)
5
5
Commit
takes
no
(8)
8
BallotNum
8
9
13
13
13
20
20
20
GSR+1
Keidar & Shraer, Technion, Israel
GSR+2
ok
9
9
O(n) rounds!
GSR
(“prepare”,14)
ok
no (13)
GSR+3
PODC 2006
What Can We Hope For?
• Tight lower bound for ES: 3 rounds from
GSR [DGK05]
• ◊LM weaker than ES
• One might expect it to take a longer time
in ◊LM than in ES
Keidar & Shraer, Technion, Israel
PODC 2006
Result 1: Don't Need ES
• Leader and majority can give you the
same performance!
• Algorithm that matches lower bound for
ES!
Keidar & Shraer, Technion, Israel
PODC 2006
Our ◊LM Algorithm in a Nutshell
• Commit with increasing ballot numbers, decide on value
committed by majority
– like Paxos, etc.
• Challenge: Don’t know all ballots, how to choose the new
one to be highest one?
• Solution: Choose it to be the round number
• Challenge: rounds are wasted if a prepare/commit fails.
• Solution: pipeline prepares and commits: try in each round
• Challenge: do they really need to say no?
• Solution: support leader’s prepare even if have a higher
ballot number
– challenge: higher number may reflect later decision! Won’t
agreement be compromised?
– solution: new field “trustMe” ensures supported leader doesn't miss
real decisions
Keidar & Shraer, Technion, Israel
PODC 2006
Example Run: GSR=100
<PREPARE, …, trustMe>
All PREPARE
with !trustMe
All COMMIT
Ω Leader
Did not lead to
decision
All DECIDE
1
8
101
101
5
8
101
101
8
13
101
101
13
13
101
101
20
20
101
101
Keidar & Shraer, Technion, Israel
Rounds:
GSR
GSR+1
GSR+2
PODC 2006
Question 2: ◊S and Ω
Equivalent?
• ◊S and Ω equivalent in the “classical”
sense [Chandra, Hadzilacos, Toueg 96]
– Weakest for consensus
• ◊S: eventually (from GSR onward),
– all faulty processes are suspected by every
correct process
– there exists one correct process that is not
suspected by any correct process.
• Can we substitute Ω with ◊S in ◊LM?
Keidar & Shraer, Technion, Israel
PODC 2006
Result 2: ◊S and Ω not that
Equivalent
• Consensus takes linear time from
GSR
• By reduction to mobile failure model
[Santoro, Widmayer 89]
Keidar & Shraer, Technion, Israel
PODC 2006
Result 3: Do We Need Oracles?
• Timely communication with majority
suffices!
• ◊AFM (All-From-Majority) simplified:
– In every round k ≥ GSR, every correct
process p receives round k message from a
majority of processes, and p’s message
reaches a majority of processes.
• Decision in 5 rounds from GSR
– 1st constant time algorithm w/out oracle or ES
– idea: information passes to all nodes in 2
rounds
Keidar & Shraer, Technion, Israel
PODC 2006
Result 4: Can We Assume Less?
• ◊MFM: Majority from Majority
– The rest receive a message from a minority
• Only a little missing for ◊AFM
• Stronger than models in literature
[Aguilera et al. 03, 04], [Malkhi et al. 05]
• Bounded time from GSR impossible!
Keidar & Shraer, Technion, Israel
PODC 2006
Conclusions
• Which guarantees should one implement ?
– weaker ≠ better
• some previously suggested assumptions are too
weak
– sometimes a little stronger = much better
• worth longer timeouts / better hardware
– ES is not essential
• not worth longer timeouts / better hardware
– future: more models, bounds to explore
• GIRAF