Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, Haryadi S. Gunawi Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, Haryadi S. Gunawi TaxDC @ ASPLOS ‘16 More people develop distributed systems Distributed Hard 4/5/16 systems are hard largely because of concurrency Concurrency leads to unexpected timings X should arrive before Y, but X can arrive after Y Unexpected timings lead to distributed concurrency (DC) bugs 4 TaxDC @ ASPLOS ‘16 4/5/16 “… be able to reason about the correctness of increasingly more complex distributed systems that are used in production” – Azure engineers & managers Uncovering Bugs in Distributed Storage Systems during Testing (Not in Production!) [FAST ‘16] Understanding distributed system bugs is important! 5 TaxDC @ ASPLOS ‘16 Bugs 4/5/16 caused by non-deterministic timing Non-deterministic timing of concurrent events involving more than one node Messages, crashes, reboots, timeouts, computations 6 TaxDC @ ASPLOS ‘16 (LC bug: multi-threaded single machine software) Top 10 most cited ASPLOS paper 4/5/16 7 TaxDC @ ASPLOS ‘16 104 4 4/5/16 bugs varied distributed systems Bugs Study in 2011-2014 description, source code, patches 8 TaxDC @ ASPLOS’16 4/5/16 TaxDC Trigger Timin Timin gg Order Violation Input Input Error & Failure Scope Fault Nodes Atomicity Violation Reboot Messag es Fault Timing Workload Protocols Reboot Timing Error Error Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Glob Miss Glob Silence Failure Failure Fix Timin Timing g Handlin g Downtime Global Sync Retry Data Loss Local Sync Ignore Op Fail Accept Performanc e Others 9 TaxDC @ ASPLOS ‘16 ZooKeeper-1264 1. Follower F crashes, reboots, and joins cluster 2. Leader L sync snapshot with F 3. Client requests new update, F applies this only in memory 4. Sync finishes 5. Client requests other update, F writes this to disk correctly 6. F crashes, reboots, and joins cluster again 7. This time L sends only diff after update in step 5. 8. F loses update in step 3. 4/5/16 10 TaxDC @ ASPLOS ‘16 4/5/16 ZooKeeper-1264 Input: - 4 Protocols - 2 faults - 2 reboots Fix: Delay msg. 1. Follower F crashes, reboots, and joins cluster 2. Leader L sync snapshot with F 3. Client requests new update, F applies this only in memory 4. Sync finishes 5. Client requests other update, F writes this to disk correctly 6. F crashes, reboots, and joins cluster again 7. This time L sends only diff after update in step 5. 8. F loses update in step 3. Timing: - Atomicity violation - Fault Timing Error: - Global Failure: Data inconsistenc y 11 TaxDC @ ASPLOS’16 4/5/16 TaxDC Trigger Timin Timin gg Order Violation Input Input Error & Failure Scope Fault Nodes Atomicity Violation Reboot Messag es Fault Timing Workload Protocols Reboot Timing Error Error Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Glob Miss Glob Silence Failure Failure Fix Timin Timing g Handlin g Downtime Global Sync Retry Data Loss Local Sync Ignore Op Fail Accept Performanc e Others 12 TaxDC @ ASPLOS’16 4/5/16 TaxDC Trigger Timin Timin gg Order Violation Input Input Error & Failure Scope Fault Nodes Atomicity Violation Reboot Messag es Fault Timing Workload Protocols Reboot Timing Error Error Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Failure Failure Fix Timin Timing g Downtime Global Sync Retry Data Loss Local Sync Ignore Conditions that make Op Fail bugs happen Performanc Glob Miss Glob Silence Handlin g e Accept Others 13 TaxDC @ ASPLOS’16 4/5/16 TaxDC Trigger Timin Timin gg Order Violation Input Input Fault Atomicity Violation Reboot Fault Timing Workload Reboot Timing Error & Failure Error Error Scope Nodes Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Failure Failure Fix Timin Timing g Downtime Global Sync Messag UntimelyData Local What: moment Loss es Sync that makes bugOphappens Fail Protocols Why: Help Performanc design e bug detection tools Glob Miss Glob Silence Handlin g Retry Ignore Accept Others 14 TaxDC @ ASPLOS ‘16 4/5/16 15 Trigger Timing Message “Does the timing involve many messages?” Ex: MapReduce-3274 TaxDC @ ASPLOS ‘16 4/5/16 16 Trigger Timing Message Order violation (44%) “Does the timing involve many messages?” 2 events, X and Y Y must happen after X But Y happens before X Ex: MapReduce-3274 TaxDC @ ASPLOS ‘16 Trigger 4/5/16 Kill Submit Kill Submit 17 Timing Message Order violation (44%) Msg-msg race “Does the timing involve many messages?” 2 events, X and Y Y must happen after X But Y happens before X Ex: MapReduce-3274 TaxDC @ ASPLOS ‘16 18 4/5/16 Trigger A Timing Message Order violation (44%) Msg-msg race A B Kill Ne w key Ne w End report A Ne w key (late ) Receivereceive race MapReduce3274 B Old key B Expired! Receivesend HBaserace 5780 A Kill B End report Kill what job? Send-send race MapReduce5358 TaxDC @ ASPLOS ‘16 4/5/16 19 Trigger Timing Message Order violation (44%) Msg-msg race Msg-compute race Order violation: 2 events, X and Y Y must happen after X But Y happens before X cmp cmp Ex: MapReduce-4157 TaxDC @ ASPLOS ‘16 Trigger 20 4/5/16 A B A B Timing Message Order violation (44%) Atomicity violation (20%) A B A message comes in the middle of atomic operation Ex: Cassandra-1011, Hbase-4729, MapReduce-5009, Zookeeper-1496 TaxDC @ ASPLOS ‘16 Trigger 21 4/5/16 A A B B C Timing Message Fault (21%) Fault at specific timing A B C No fault timing in LC bugs Only in DC bugs Ex: Cassandra-6415, Hbase-5806, MapReduce-3858, Zookeeper-1653 TaxDC @ ASPLOS ‘16 Trigger 4/5/16 A B A B 22 Timing Message Fault Reboot (11%) Reboot at specific timing Ex: Cassandra-2083, Hadoop-3186, MapReduce-5489, Zookeeper-975 TaxDC @ ASPLOS ‘16 4/5/16 Trigger Timing Message Fault Reboot Mix (4%) ZooKeeper-1264 1. Follower F crashes, reboots, and joins cluster 2. Leader L sync snapshot with F 3. Client requests new update, F applies this only in memory (in the middle of sync snapshot) 4. Sync finishes 5. Client requests other update, F writes this to disk correctly 6. F crashes, reboots, and joins cluster again 7. This time L sends only diff after update in step 5. 8. F loses update in step 3. Atomicity violation Fault timing Failure 23 TaxDC @ ASPLOS ‘16 Trigger Timing 4/5/16 24 Implication: simple patterns can inform pattern-based bug detection tools, etc. cmp cmp Message timing Fault timing Reboot timing TaxDC @ ASPLOS’16 4/5/16 TaxDC Trigger Timin Timin gg Order Violation Input Input Error & Failure Error Error Scope Fault Nodes Atomicity Violation Reboot Messag es Fault Timing Workload Protocols Reboot Timing Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Failure Failure Fix Timin Timing g Downtime Global Sync Data Local What: Input Lossto Sync exercise buggy code Op Fail Performanc Why: Improve e testing coverage Glob Miss Glob Silence Handlin g Retry Ignore Accept Others 25 TaxDC @ ASPLOS ‘16 4/5/16 Trigger Timing Input ZooKeeper-1264 1. Follower F crashes, reboots, and joins cluster 2. Leader L sync snapshot with F 3. Client requests new update, F applies this update only in memory 4. Sync finishes 5. Client requests other update, F writes this to disk correctly 6. F crashes, reboots, and joins cluster again 7. This time L sends only diff after update in step 5. 8. F loses update in step 3. Fault & reboot 2 crashes 2 reboots 26 TaxDC @ ASPLOS ‘16 4/5/16 Trigger Timing Input Fault “How many bugs require fault injection?” 37% = No fault 63% = Yes “What kind of fault? & How many times?” 88% = No timeout 53% = No crash 35% = 1 crash 12% 12% Real-world DC bugs are NOT just about message re-ordering, but faults as well 27 TaxDC @ ASPLOS ‘16 28 4/5/16 Trigger Timing Input Fault Reboot “How many reboots?” 73% = No reboot 20% = 1 7% TaxDC @ ASPLOS ‘16 4/5/16 Trigger Cassandra Paxos bug Timing Input Fault Reboot Workload 29 (Cassandra-6023) 3 concurrent user requests! n m “How many protocols to run as input?” 20% = 1 p o r q 80% = 2+ protocols Implication: multiple protocols for DC testing TaxDC @ ASPLOS ’16 4/5/16 TaxDC Trigger Timin Timin gg Order Violation Error & Failure Input Input Scope Fault Nodes Atomicity Violation Reboot Messag es Fault Timing Workload Protocols Reboot Timing Error Error Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Failure Failure Fix Timin Timing g Handlin g Downtime Global Sync Retry Data Loss Local Sync Ignore Op Fail of What: First effect Accept Performanc untimely ordering e Others Why: Help failure diagnosis Glob and Miss bug detection Glob Silence 30 TaxDC @ ASPLOS ‘16 4/5/16 31 Trigger Error Local Error can be observed in one triggering node (46%) Null pointer, false assertion, etc. Implication: identify opportunities for failure diagnosis and bug detection TaxDC @ ASPLOS ‘16 4/5/16 Trigger Error Local Global Error cannot be observed in one node (54%) ?? Many are silent errors and hard to diagnose (hidden errors, no error messages, long debugging) 32 TaxDC @ ASPLOS’16 4/5/16 TaxDC Trigger Timin Timin gg Order Violation Input Input Error & Failure Scope Fault Nodes Atomicity Violation Reboot Messag es Fault Timing Workload Protocols Reboot Timing Error Error Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Failure Failure Fix Timin Timing g Handlin g Downtime Global Sync Retry Data Loss Local Sync Ignore Op Fail What: How developers fix bugs Accept Performanc e Others Why: Help design runtime prevention Glob and automatic patch generation Miss Glob Silence 33 TaxDC @ ASPLOS ‘16 4/5/16 Trigger Error Fix Comple x Add new states & transitions Are patches complicated? Are patches adding synch.? Similar to fixing LC bugs: add synchronization e.g. lock() Add Global Synchro nization 34 TaxDC @ ASPLOS ‘16 Trigger Error Fix Comple x Simple Delay 4/5/16 35 TaxDC @ ASPLOS ‘16 Trigger Error Fix Comple x Simple Delay Ignore/discard 4/5/16 36 TaxDC @ ASPLOS ‘16 Trigger Error Fix Comple x Simple Delay Ignore/Discard Retry 4/5/16 37 TaxDC @ ASPLOS ‘16 4/5/16 Trigger Error Fix Comple x Simple Delay Ignore/Discard Retry Accept f(msg); g(msg); 38 TaxDC @ ASPLOS ‘16 4/5/16 39 Trigger Error Fix Comple x Simple Delay Ignore 40% are easy to fix (no new computation logic) Implication: many fixes can inform automatic runtime prevention f(msg); g(msg); Retry Accept TaxDC @ ASPLOS ‘16 4/5/16 Trigger Error Fix Comple x Simple Fix Sync. Delay Ignore/Discard Retry Accept DC bugs vs. LC bugs 40 TaxDC @ ASPLOS ‘16 Distributed Formal DC system model checker verification bug detection Runtime failure prevention 4/5/16 41 TaxDC @ ASPLOS ‘16 Event 4/5/16 Modist Demeter MaceMC SAMC NSDI’11 SOSP’11 NSDI’07 OSDI’14 Reality Message Crash Let’s Multiple find out how to re-order all events crashes without exploding the state space! Reboot Multiple reboots Timeout Computation Disk fault 42 TaxDC @ ASPLOS ‘16 State-of-the-art Verdi [PLDI ‘15] - Raft update ~ 6,000 lines of proof 4/5/16 Challenges Foreground & 29% = 52% = BG 19%=FG Background Let’s find out how to better verify IronFleet [SOSP ‘15] more protocol interactions! - Paxos update - - Lease-based read/write ~ 5,000 – 10,000 lines of proof Only verify foreground protocols Mix 20%= #Protocol 80%interactions = 2+ Protocols 1 Foreground & background 43 TaxDC @ ASPLOS ‘16 State-of-the-art: LC bug detection 4/5/16 Opportunities: DC bug detection? Pattern-based Pattern-based Message timing detection detection Fault timing and Let’s leverage these timing patterns Error-based detection Reboot timing explicit error Statistical bug to do DC bug detection! detection 53% Error-based detection = Explicit 47% = Silent 44 TaxDC @ ASPLOS ‘16 State-of-the-art: LC bug prevention 4/5/16 Opportunities: DC bug prevention Deadlock Immunity Fixes [OSDI ‘08] Let’s build runtime prevention 60% = Complex 40% = Simple Aviso [ASPLOS ‘13] technique that leverage this simplicity! ConAir [ASPLOS ‘13] Etc. 45 TaxDC @ ASPLOS ‘16 4/5/16 “Why seriously address DC bugs now?” Everything is distributed and large-scale! DC bugs are not uncommon! “Why is tackling DC bugs possible now?” 1. Open access to source code 2. Pervasive documentations 3. Detailed bug descriptions 46 TaxDC @ ASPLOS ‘16 http://ucare.cs.uchicago.edu 4/5/16 47
© Copyright 2026 Paperzz