Timeliness, Failure Detectors, and Consensus Performance Alex Shraer Joint work with Dr. Idit Keidar Technion – Israel Institute of Technology In PODC 2006 How do you survive failures and achieve high availability? Replication State Machine Replication • Replicas are identical deterministic state machines • Process operations in the same order remain consistent a b a a b c Consensus • Building block for state machine replication • Each process has an input, should decide on an output so that– Agreement: decisions are the same Validity: decision is input of one process Termination: eventually all correct processes decide Basic Model • Message passing • Links between every pair of processes – do not create, duplicate or alter messages (integrity) • Process and link failures Keidar & Shraer, Technion, Israel PODC 2006 Synchronous Model • Known bound Δ on message delay, processing • Very convenient for algorithms • Requires very conservative timeouts – in practice: avg. latency < max. latency 100 [Cardwell, Savage, Anderson 2000], [Bakr-Keidar 2002] – Computation might be too sloooow! Keidar & Shraer, Technion, Israel PODC 2006 Asynchronous Model • Unbounded message delay • Much more practical Fault-tolerant consensus impossible [FLP85] Keidar & Shraer, Technion, Israel PODC 2006 Eventually Stable (Indulgent) Models • Initially asynchronous – for unbounded period of time • Eventually reach stabilization – GST (Global Stabilization Time) – following GST certain assumptions hold • Examples – ES (Eventual Synchrony) – starting from GST all links have a bound on message delay [Dwork, Lynch, Stockmeyer 88] – failure detectors • Example: W (leader) failure detector – Outputs one trusted process – From some point, all correct processes trust the same correct process [Chandra, Toueg 96], [Chandra, Hadzilacos, Toueg 96] Keidar & Shraer, Technion, Israel PODC 2006 Indulgent Models: Research Trend • Weaken post-GST assumptions as much as possible [Guerraoui, Schiper96], [Aguilera et al. 03, 04], [Malkhi et al. 05] Weaker = better? Keidar & Shraer, Technion, Israel PODC 2006 Indulgent Models: Research Trend You only need ONE machine with eventually ONE timely link. Buy the hardware to ensure it, set the timeout accordingly, and EVERYTHING WILL WORK. Keidar & Shraer, Technion, Israel PODC 2006 Consensus with Weak Assumptions Why isn’t anything happening ??? Network Don’t worry! It will eventually happen! Keidar & Shraer, Technion, Israel PODC 2006 Consensus with Weak Assumptions Network Keidar & Shraer, Technion, Israel PODC 2006 What’s Going On? • In practice, bounds just need to hold “long enough” for the algorithm (TA) to finish • But TA depends on our synchrony assumptions – with weak assumptions, TA might be unbounded • For practical systems, eventual completion of the job is not enough! Keidar & Shraer, Technion, Israel PODC 2006 Our Goal • Understand the relationship between: – assumptions (1 timely link, failure detectors, etc.) that eventually hold – performance of algorithms that exploit these assumptions, and only them • Challenge: How do we understand the performance of asynchronous algorithms that make very different assumptions? Keidar & Shraer, Technion, Israel PODC 2006 Typical Metric: Count “Rounds” • Algorithms normally progress in rounds, though rounds are not synchronized among processes at process pi: forever do send messages receive messages while (!some conditions) compute… • Previous work: – look at synchronous runs (every message takes exactly time) – count rounds or “s” [Keidar, Rajsbaum 01], [Dutta, Guerraoui 02], [Guerraoui, Raynal 04] [Dutta et al. 03], etc. Keidar & Shraer, Technion, Israel PODC 2006 Are All “Rounds” the Same? • Algorithm 1 waits for messages from a majority that includes a pre-defined leader in each round – takes 3 rounds • Algorithm 2 waits for messages from all (unsuspected) processes in each round – E.g., group membership – takes 2 rounds Keidar & Shraer, Technion, Israel PODC 2006 Do All Rounds Cost the Same? LAN Market Oranges $1.00 Keidar & Shraer, Technion, Israel Apples $1.00 PODC 2006 Do All “Rounds” Cost the Same? • On the Internet, n2 timely links can be a rarity, [Bakr, Keidar 02] • Timely communication require timeouts – with leader orders of magnitude smaller – with majority Oranges $100.00 WAN Market Keidar & Shraer, Technion, Israel Apples $1.00 PODC 2006 GIRAF General Round-based Algorithm Framework • Inspired by Gafni’s RRFD, generalizes it • Organize algorithms into rounds • Separate algorithm logic from waiting condition • Waiting condition defines model • Allows reasoning about lower and upper bounds for rounds of different types Defining Properties in GIRAF • Environment can have – perpetual properties – eventual properties • In every run r, there exists a round GSR(r) • GSR(r) – the first round from which: – no process fails – all eventual properties hold in each round Keidar & Shraer, Technion, Israel PODC 2006 Defining Properties • Timeliness of incoming, outgoing and bidirectional links. • Some known failure detector properties • Use properties to clearly define models Keidar & Shraer, Technion, Israel PODC 2006 Some Results: Context • Consensus problem • Global decision time metric – Time until all correct processes decide • Message passing • Crash failures – t < n/2 potential failures out of n>1 processes Keidar & Shraer, Technion, Israel PODC 2006 ◊LM Model: Leader and Majority • Nothing required before GSR • In every round k ≥ GSR – Every correct process receives a round k message from a majority of processes, one of which is the Ω-leader. • Practically requires much shorter timeouts than Eventual Synchrony [Bakr, Keidar] Keidar & Shraer, Technion, Israel PODC 2006 ◊LM: Previous Work • Most Ω-based algorithms wait for majority in each round (not ◊LM) • Paxos [Lamport 98] works for ◊LM – Takes constant number of rounds in Eventual Synchrony (ES) – But how many rounds without ES? Keidar & Shraer, Technion, Israel PODC 2006 Paxos Run in ES Ω Leader (“prepare”,2) (“prepare”,21) yes 1 2 5 5 . . . . no . . 20 20 20 21 21 21 21 . . . . . . . . . 21 21 21 21 21 BallotNum number of attempts to decide initiated by leaders Keidar & Shraer, Technion, Israel (Commit, 21, v1) 21 yes (Commit, 21 ,v1) decide v1 PODC 2006 Ω Leader Paxos in ◊LM (w/out ES) (“prepare”,2) 1 2 (“prepare”,9) ok 2 9 no (5) 5 5 Commit takes no (8) 8 BallotNum 8 9 13 13 13 20 20 20 GSR+1 Keidar & Shraer, Technion, Israel GSR+2 ok 9 9 O(n) rounds! GSR (“prepare”,14) ok no (13) GSR+3 PODC 2006 What Can We Hope For? • Tight lower bound for ES: 3 rounds from GSR [DGK05] • ◊LM weaker than ES • One might expect it to take a longer time in ◊LM than in ES Keidar & Shraer, Technion, Israel PODC 2006 Result 1: Don't Need ES • Leader and majority can give you the same performance! • Algorithm that matches lower bound for ES! Keidar & Shraer, Technion, Israel PODC 2006 Our ◊LM Algorithm in a Nutshell • Commit with increasing ballot numbers, decide on value committed by majority – like Paxos, etc. • Challenge: Don’t know all ballots, how to choose the new one to be highest one? • Solution: Choose it to be the round number • Challenge: rounds are wasted if a prepare/commit fails. • Solution: pipeline prepares and commits: try in each round • Challenge: do they really need to say no? • Solution: support leader’s prepare even if have a higher ballot number – challenge: higher number may reflect later decision! Won’t agreement be compromised? – solution: new field “trustMe” ensures supported leader doesn't miss real decisions Keidar & Shraer, Technion, Israel PODC 2006 Example Run: GSR=100 <PREPARE, …, trustMe> All PREPARE with !trustMe All COMMIT Ω Leader Did not lead to decision All DECIDE 1 8 101 101 5 8 101 101 8 13 101 101 13 13 101 101 20 20 101 101 Keidar & Shraer, Technion, Israel Rounds: GSR GSR+1 GSR+2 PODC 2006 Question 2: ◊S and Ω Equivalent? • ◊S and Ω equivalent in the “classical” sense [Chandra, Hadzilacos, Toueg 96] – Weakest for consensus • ◊S: eventually (from GSR onward), – all faulty processes are suspected by every correct process – there exists one correct process that is not suspected by any correct process. • Can we substitute Ω with ◊S in ◊LM? Keidar & Shraer, Technion, Israel PODC 2006 Result 2: ◊S and Ω not that Equivalent • Consensus takes linear time from GSR • By reduction to mobile failure model [Santoro, Widmayer 89] Keidar & Shraer, Technion, Israel PODC 2006 Result 3: Do We Need Oracles? • Timely communication with majority suffices! • ◊AFM (All-From-Majority) simplified: – In every round k ≥ GSR, every correct process p receives round k message from a majority of processes, and p’s message reaches a majority of processes. • Decision in 5 rounds from GSR – 1st constant time algorithm w/out oracle or ES – idea: information passes to all nodes in 2 rounds Keidar & Shraer, Technion, Israel PODC 2006 Result 4: Can We Assume Less? • ◊MFM: Majority from Majority – The rest receive a message from a minority • Only a little missing for ◊AFM • Stronger than models in literature [Aguilera et al. 03, 04], [Malkhi et al. 05] • Bounded time from GSR impossible! Keidar & Shraer, Technion, Israel PODC 2006 Conclusions • Which guarantees should one implement ? – weaker ≠ better • some previously suggested assumptions are too weak – sometimes a little stronger = much better • worth longer timeouts / better hardware – ES is not essential • not worth longer timeouts / better hardware – future: more models, bounds to explore • GIRAF
© Copyright 2024 Paperzz