TaxDC - UCARE

Tanakorn Leesatapornwongsa,
Jeffrey F. Lukman, Shan Lu, Haryadi S.
Gunawi
Tanakorn Leesatapornwongsa,
Jeffrey F. Lukman, Shan Lu, Haryadi S.
Gunawi
TaxDC @ ASPLOS ‘16
 More
people develop distributed systems
 Distributed
 Hard
4/5/16
systems are hard
largely because of concurrency
 Concurrency
leads to unexpected timings
 X should arrive before Y, but X can arrive after Y
 Unexpected
timings lead to
distributed concurrency (DC) bugs
4
TaxDC @ ASPLOS ‘16
4/5/16
“… be able to reason about
the correctness of
increasingly more complex distributed
systems
that are used in production”
– Azure engineers & managers
Uncovering Bugs in Distributed Storage Systems during Testing
(Not in Production!) [FAST ‘16]
Understanding distributed system bugs is
important!
5
TaxDC @ ASPLOS ‘16
 Bugs
4/5/16
caused by non-deterministic timing
 Non-deterministic
timing of concurrent
events involving more than one node
 Messages, crashes, reboots, timeouts,
computations
6
TaxDC @ ASPLOS ‘16
(LC bug: multi-threaded single machine software)
Top 10 most cited ASPLOS paper
4/5/16
7
TaxDC @ ASPLOS ‘16
 104
4
4/5/16
bugs
varied distributed systems
 Bugs
 Study
in 2011-2014
description, source code, patches
8
TaxDC @ ASPLOS’16
4/5/16
TaxDC
Trigger
Timin
Timin
gg
Order
Violation
Input
Input
Error & Failure
Scope
Fault
Nodes
Atomicity
Violation
Reboot
Messag
es
Fault
Timing
Workload
Protocols
Reboot
Timing
Error
Error
Loc
Mem
Loc
Sem
Loc
Hang
Loc
Silence
Glob
Wrong
Glob
Miss
Glob
Silence
Failure
Failure
Fix
Timin
Timing
g
Handlin
g
Downtime
Global
Sync
Retry
Data
Loss
Local
Sync
Ignore
Op Fail
Accept
Performanc
e
Others
9
TaxDC @ ASPLOS ‘16
ZooKeeper-1264
1. Follower F crashes, reboots,
and joins cluster
2. Leader L sync snapshot with
F
3. Client requests new update, F
applies this only in memory
4. Sync finishes
5. Client requests other update,
F writes this to disk correctly
6. F crashes, reboots, and joins
cluster again
7. This time L sends only diff
after update in step 5.
8. F loses update in step 3.
4/5/16
10
TaxDC @ ASPLOS ‘16
4/5/16
ZooKeeper-1264
Input:
- 4 Protocols
- 2 faults
- 2 reboots
Fix:
Delay msg.
1. Follower F crashes, reboots,
and joins cluster
2. Leader L sync snapshot with
F
3. Client requests new update, F
applies this only in memory
4. Sync finishes
5. Client requests other update,
F writes this to disk correctly
6. F crashes, reboots, and joins
cluster again
7. This time L sends only diff
after update in step 5.
8. F loses update in step 3.
Timing:
- Atomicity
violation
- Fault Timing
Error:
- Global
Failure:
Data
inconsistenc
y
11
TaxDC @ ASPLOS’16
4/5/16
TaxDC
Trigger
Timin
Timin
gg
Order
Violation
Input
Input
Error & Failure
Scope
Fault
Nodes
Atomicity
Violation
Reboot
Messag
es
Fault
Timing
Workload
Protocols
Reboot
Timing
Error
Error
Loc
Mem
Loc
Sem
Loc
Hang
Loc
Silence
Glob
Wrong
Glob
Miss
Glob
Silence
Failure
Failure
Fix
Timin
Timing
g
Handlin
g
Downtime
Global
Sync
Retry
Data
Loss
Local
Sync
Ignore
Op Fail
Accept
Performanc
e
Others
12
TaxDC @ ASPLOS’16
4/5/16
TaxDC
Trigger
Timin
Timin
gg
Order
Violation
Input
Input
Error & Failure
Scope
Fault
Nodes
Atomicity
Violation
Reboot
Messag
es
Fault
Timing
Workload
Protocols
Reboot
Timing
Error
Error
Loc
Mem
Loc
Sem
Loc
Hang
Loc
Silence
Glob
Wrong
Failure
Failure
Fix
Timin
Timing
g
Downtime
Global
Sync
Retry
Data
Loss
Local
Sync
Ignore
Conditions that
make
Op Fail
bugs happen
Performanc
Glob
Miss
Glob
Silence
Handlin
g
e
Accept
Others
13
TaxDC @ ASPLOS’16
4/5/16
TaxDC
Trigger
Timin
Timin
gg
Order
Violation
Input
Input
Fault
Atomicity
Violation
Reboot
Fault
Timing
Workload
Reboot
Timing
Error & Failure
Error
Error
Scope
Nodes
Loc
Mem
Loc
Sem
Loc
Hang
Loc
Silence
Glob
Wrong
Failure
Failure
Fix
Timin
Timing
g
Downtime
Global
Sync
Messag UntimelyData
Local
What:
moment
Loss
es
Sync
that makes bugOphappens
Fail
Protocols
Why: Help Performanc
design
e
bug detection tools
Glob
Miss
Glob
Silence
Handlin
g
Retry
Ignore
Accept
Others
14
TaxDC @ ASPLOS ‘16
4/5/16
15
Trigger
Timing
Message
“Does the timing
involve many messages?”
Ex: MapReduce-3274
TaxDC @ ASPLOS ‘16
4/5/16
16
Trigger
Timing
Message
Order violation (44%)
“Does the timing
involve many messages?”
2 events, X and Y
Y must happen after X
But Y happens before X
Ex: MapReduce-3274
TaxDC @ ASPLOS ‘16
Trigger
4/5/16
Kill
Submit
Kill
Submit
17
Timing
Message
Order violation (44%)
Msg-msg race
“Does the timing
involve many messages?”
2 events, X and Y
Y must happen after X
But Y happens before X
Ex: MapReduce-3274
TaxDC @ ASPLOS ‘16
18
4/5/16
Trigger
A
Timing
Message
Order violation (44%)
Msg-msg race
A
B
Kill
Ne
w
key
Ne
w
End
report
A
Ne
w
key
(late
)
Receivereceive
race
MapReduce3274
B
Old
key
B
Expired!
Receivesend
HBaserace
5780
A
Kill
B
End
report
Kill
what
job?
Send-send
race
MapReduce5358
TaxDC @ ASPLOS ‘16
4/5/16
19
Trigger
Timing
Message
Order violation (44%)
Msg-msg race
Msg-compute race
Order violation:
2 events, X and Y
Y must happen after X
But Y happens before X
cmp
cmp
Ex: MapReduce-4157
TaxDC @ ASPLOS ‘16
Trigger
20
4/5/16
A
B
A
B
Timing
Message
Order violation (44%)
Atomicity
violation (20%)
A
B
A message comes in
the middle of atomic operation
Ex: Cassandra-1011, Hbase-4729, MapReduce-5009, Zookeeper-1496
TaxDC @ ASPLOS ‘16
Trigger
21
4/5/16
A
A
B
B
C
Timing
Message
Fault (21%)
Fault at specific timing
A
B
C
No fault timing in LC bugs
Only in DC bugs
Ex: Cassandra-6415, Hbase-5806, MapReduce-3858, Zookeeper-1653
TaxDC @ ASPLOS ‘16
Trigger
4/5/16
A
B
A
B
22
Timing
Message
Fault
Reboot (11%)
Reboot at specific timing
Ex: Cassandra-2083, Hadoop-3186, MapReduce-5489, Zookeeper-975
TaxDC @ ASPLOS ‘16
4/5/16
Trigger
Timing
Message
Fault
Reboot
Mix (4%)
ZooKeeper-1264
1. Follower F crashes, reboots,
and joins cluster
2. Leader L sync snapshot with
F
3. Client requests new update, F
applies this only in memory
(in the middle of sync snapshot)
4. Sync finishes
5. Client requests other update,
F writes this to disk correctly
6. F crashes, reboots, and joins
cluster again
7. This time L sends only diff
after update in step 5.
8. F loses update in step 3.
Atomicity
violation
Fault timing
Failure
23
TaxDC @ ASPLOS ‘16
Trigger
Timing
4/5/16
24
Implication: simple patterns can inform
pattern-based bug detection tools, etc.
cmp
cmp
Message timing
Fault timing
Reboot timing
TaxDC @ ASPLOS’16
4/5/16
TaxDC
Trigger
Timin
Timin
gg
Order
Violation
Input
Input
Error & Failure
Error
Error
Scope
Fault
Nodes
Atomicity
Violation
Reboot
Messag
es
Fault
Timing
Workload
Protocols
Reboot
Timing
Loc
Mem
Loc
Sem
Loc
Hang
Loc
Silence
Glob
Wrong
Failure
Failure
Fix
Timin
Timing
g
Downtime
Global
Sync
Data
Local
What: Input
Lossto
Sync
exercise buggy
code
Op Fail
Performanc
Why: Improve
e
testing coverage
Glob
Miss
Glob
Silence
Handlin
g
Retry
Ignore
Accept
Others
25
TaxDC @ ASPLOS ‘16
4/5/16
Trigger
Timing
Input
ZooKeeper-1264
1. Follower F crashes, reboots,
and joins cluster
2. Leader L sync snapshot with
F
3. Client requests new update, F
applies this update only in
memory
4. Sync finishes
5. Client requests other update,
F writes this to disk correctly
6. F crashes, reboots, and joins
cluster again
7. This time L sends only diff
after update in step 5.
8. F loses update in step 3.
Fault & reboot
2 crashes
2 reboots
26
TaxDC @ ASPLOS ‘16
4/5/16
Trigger
Timing
Input
Fault
“How many bugs require fault
injection?”
37% = No fault
63% = Yes
“What kind of fault? & How many
times?”
88% = No timeout
53% = No crash
35% = 1 crash
12%
12%
Real-world DC bugs are NOT just
about
message re-ordering, but faults as well
27
TaxDC @ ASPLOS ‘16
28
4/5/16
Trigger
Timing
Input
Fault
Reboot
“How many reboots?”
73% = No reboot
20% = 1
7%
TaxDC @ ASPLOS ‘16
4/5/16
Trigger
Cassandra Paxos bug
Timing
Input
Fault
Reboot
Workload
29
(Cassandra-6023)
3 concurrent
user requests!
n
m
“How many protocols to run
as input?”
20% = 1
p
o
r
q
80% = 2+ protocols
Implication: multiple protocols
for DC testing
TaxDC @ ASPLOS ’16
4/5/16
TaxDC
Trigger
Timin
Timin
gg
Order
Violation
Error & Failure
Input
Input
Scope
Fault
Nodes
Atomicity
Violation
Reboot
Messag
es
Fault
Timing
Workload
Protocols
Reboot
Timing
Error
Error
Loc
Mem
Loc
Sem
Loc
Hang
Loc
Silence
Glob
Wrong
Failure
Failure
Fix
Timin
Timing
g
Handlin
g
Downtime
Global
Sync
Retry
Data
Loss
Local
Sync
Ignore
Op Fail of
What: First effect
Accept
Performanc
untimely ordering
e
Others
Why: Help
failure diagnosis
Glob
and Miss
bug detection
Glob
Silence
30
TaxDC @ ASPLOS ‘16
4/5/16
31
Trigger
Error
Local
Error can be observed
in one triggering node
(46%)
Null pointer,
false assertion,
etc.
Implication: identify opportunities for
failure diagnosis and bug detection
TaxDC @ ASPLOS ‘16
4/5/16
Trigger
Error
Local
Global
Error cannot be
observed in one node
(54%)
??
Many are silent errors and hard to diagnose
(hidden errors, no error messages, long
debugging)
32
TaxDC @ ASPLOS’16
4/5/16
TaxDC
Trigger
Timin
Timin
gg
Order
Violation
Input
Input
Error & Failure
Scope
Fault
Nodes
Atomicity
Violation
Reboot
Messag
es
Fault
Timing
Workload
Protocols
Reboot
Timing
Error
Error
Loc
Mem
Loc
Sem
Loc
Hang
Loc
Silence
Glob
Wrong
Failure
Failure
Fix
Timin
Timing
g
Handlin
g
Downtime
Global
Sync
Retry
Data
Loss
Local
Sync
Ignore
Op Fail
What: How developers fix bugs Accept
Performanc
e
Others
Why: Help design runtime prevention
Glob
and automatic
patch generation
Miss
Glob
Silence
33
TaxDC @ ASPLOS ‘16
4/5/16
Trigger
Error
Fix
Comple
x
Add new
states
& transitions
Are patches
complicated?
Are patches adding
synch.?
Similar to fixing LC bugs:
add synchronization
e.g. lock()
Add
Global
Synchro
nization
34
TaxDC @ ASPLOS ‘16
Trigger
Error
Fix
Comple
x
Simple
Delay
4/5/16
35
TaxDC @ ASPLOS ‘16
Trigger
Error
Fix
Comple
x
Simple
Delay
Ignore/discard
4/5/16
36
TaxDC @ ASPLOS ‘16
Trigger
Error
Fix
Comple
x
Simple
Delay
Ignore/Discard
Retry
4/5/16
37
TaxDC @ ASPLOS ‘16
4/5/16
Trigger
Error
Fix
Comple
x
Simple
Delay
Ignore/Discard
Retry
Accept
f(msg);
g(msg);
38
TaxDC @ ASPLOS ‘16
4/5/16
39
Trigger
Error
Fix
Comple
x
Simple
Delay
Ignore
40% are easy to fix
(no new computation logic)
Implication:
many fixes can inform
automatic runtime
prevention
f(msg);
g(msg);
Retry
Accept
TaxDC @ ASPLOS ‘16
4/5/16
Trigger
Error
Fix
Comple
x
Simple
Fix
Sync.
Delay
Ignore/Discard
Retry
Accept
DC
bugs
vs.
LC
bugs
40
TaxDC @ ASPLOS ‘16
 Distributed
 Formal
 DC
system model checker
verification
bug detection
 Runtime
failure prevention
4/5/16
41
TaxDC @ ASPLOS ‘16
Event
4/5/16
Modist
Demeter
MaceMC
SAMC
NSDI’11
SOSP’11
NSDI’07
OSDI’14
Reality
Message
Crash
Let’s
Multiple find out how to re-order all events
crashes
without exploding the state space!
Reboot
Multiple
reboots
Timeout
Computation
Disk fault
42
TaxDC @ ASPLOS ‘16
 State-of-the-art
 Verdi [PLDI ‘15]
-
Raft update
~ 6,000 lines of proof
4/5/16
 Challenges
Foreground &
29% =
52% = BG
19%=FG Background
Let’s
find
out
how
to
better
verify
 IronFleet [SOSP ‘15]
more
protocol interactions!
- Paxos
update
-
-
Lease-based
read/write
~ 5,000 – 10,000 lines
of proof
Only verify
foreground
protocols
Mix
20%=
#Protocol
80%interactions
= 2+ Protocols
1
Foreground
& background
43
TaxDC @ ASPLOS ‘16
 State-of-the-art:
LC bug detection
4/5/16
 Opportunities:
DC bug detection?
 Pattern-based
 Pattern-based
Message timing
detection
detection
Fault timing and
Let’s
leverage
these
timing
patterns
 Error-based detection
Reboot timing
explicit
error
 Statistical
bug to do DC bug detection!
detection
53%
Error-based
detection
= Explicit
47% = Silent
44
TaxDC @ ASPLOS ‘16
 State-of-the-art:
LC bug prevention
4/5/16
 Opportunities:
DC bug prevention
 Deadlock Immunity
Fixes
[OSDI ‘08]
Let’s
build
runtime
prevention
60% = Complex
40% = Simple
 Aviso [ASPLOS ‘13]
technique
that leverage
this simplicity!
 ConAir [ASPLOS
‘13]
 Etc.
45
TaxDC @ ASPLOS ‘16
4/5/16
“Why seriously address DC bugs now?”
Everything is distributed and large-scale!
DC bugs are not uncommon!
“Why is tackling DC bugs possible now?”
1. Open access to source code
2. Pervasive documentations
3. Detailed bug descriptions
46
TaxDC @ ASPLOS ‘16
http://ucare.cs.uchicago.edu
4/5/16
47