Intel-Garg07 - The University of Texas at Austin

Paradise: A Toolkit for Building Reliable Concurrent Systems
Trace Verification for Parallel Systems
Vijay K. Garg
Department of Electrical and Computer Engineering
The University of Texas at Austin
Austin, TX 78712
email: [email protected]
Parallel and Distributed Systems Laboratory
Talk Outline
 Motivation and Overview
 Instrumentation
– Clock : Tracking Dependency
 Property Checking
– Sensor : Detecting Global Properties
– Slicer : Computation Slicing
2
Motivation: Reliable System
 Concurrent systems are prone to errors.
– Concurrency, nondeterminism, process and channel
failures
Techniques to ensure correctness
 Modeling: Model Checking and Formal Verification
 Bug Hunting: Simulation, Debugging and Verification
 Fault-Tolerance
4
Paradise Environment
Predicate
Slicer
Observe
Monitor
Program
Control
5
Talk Outline
 Motivation and Overview
 Instrumentation
– Clock : Tracking Dependency
 Property Checking
– Sensor : Detecting Global Properties
– Slicer : Computation Slicing
6
Trace Model: Total Order vs Partial Order
 Total order: interleaving of events in a trace
 Partial order: Lamport’s happened-before model
Successful Trace
¬CS1
CS2
f1
e1
¬CS2
CS1
f2
e2
Partial Order Trace
¬ CS1
Specification:
CS1 Λ CS2
Faulty Trace
CS1
P1
e1
e2
CS2

¬CS2
P2
f1
f2
¬CS1
e1
CS2
CS1
¬ CS2
f1
e2
f2
7
Tracking Dependency
computation: a set of events ordered by “happened before”
relation
 Problem: Timestamp events to answer
– e happened before f ?
– e concurrent with f ?
8
Clocks in a Distributed System
P1
P2
P3
(1,0,0)
(0,1,0)
(0,0,1)
(2,1,0)
(3,1,0)
(0,2,0)
(0,0,2)
(2,1,3)
Result: s happened before t i the vector at s is less than the
vector at t.
Vector Clocks [Fidge 89, Mattern 89]
9
Dynamic Chain Clocks
 Problem with vector clocks: scalability, dynamic process
structure
 Idea: Computing the “chains” in an online fashion [Aggarwal and
Garg PODC 05] for relevant events
a
P1
a
P2
P3
P4
b
b
c
d
c
e
d
f
h
e
f
g
h
g
A computation with 4 processes
The relevant subcomputation
10
Experimental Results
Simulation of a computation with 1% relevant events
Measured
– number of components vs number of threads
– total time overhead vs number of threads
11
Talk Outline
 Motivation and Overview
 Instrumentation
– Clock : Tracking Dependency
 Property Checking
– Sensor : Detecting Global Properties
– Slicer : Computation Slicing
12
Global Property Detection
Predicate: A global condition expressed using variables
on processes
– e.g., more than one process is in critical section, there is
no token in the system
Problem: find a global state that satisfies the given
predicate
P1
G1
G2
P2
Critical section
13
The Main Difficulty in Partial Order
Algorithm for general predicate [Cooper and Marzullo 91]
{e2, e1, f2, f1, ┴
e1
P1
e2
┴
P2
{e2, e1, f1, ┴}
T
f1
f2
{e1, f2, f1, ┴}
{e1, f1, ┴}
{e2, e1, ┴}
{e1, ┴}
{f1, ┴}
{┴}
Too many global states : A computation may contain as
many as O(kn) global states
• k: maximum number of events on a process
• n: number of processes
14
Efficient Predicate Detection for Special Cases
 stable predicate: [Chandy and Lamport 85]
• once the predicate becomes true, it stays true e.g.,
deadlock
 unstable predicate:
• observer independent predicate [Charron-Bost et al 95]
occurs in one interleaving  occurs in all interleavings
e.g., any disjunction of local predicate
• linear predicate [Chase and Garg 95]
e.g., conjunctive predicates such as there is no leader in
the system
• relational predicate: x1 + x2 +…+ xn ≥ k [Chase and Garg 95]
e.g., violation of k-mutual exclusion
15
Algorithms for Conjunctive Predicates
Centralized Algorithm [Garg and Waldecker 92]
Each non-checker process maintains its local vector and sends
to the checker process the chain clock whenever
– local predicate is true
– at most once in each message interval.
Time complexity: Checker requires at most O(n2m) comparisons.
– token based algorithm [Garg and Chase 95]
– completely distributed algorithm [Garg and Chase 95]
– keeping queues shorter [Chiou and Korfhage 95]
– avoiding control messages [Hurfin, Mizuno, Raynal, Singhal 96]
16
Other Special Classes of Predicates
 Relational Predicates
– Let xi: number of token at Pi
– Σxi < k: loss of tokens
– Algorithms: max-flow techniques [Groselj 93, Chase and Garg
95, Wu and Chen 98]
– Dilworth's partition [Tomlinson and Garg 96]
17
Talk Outline
 Motivation and Overview
 Instrumentation
– Clock : Tracking Dependency
 Property Checking
– Sensor : Detecting Global Properties
– Slicer : Computation Slicing
18
The Main Idea of Computation Slicing
state explosion
Partial order
trace
slicing
keep all red
global states
slice
19
How does Computation Slicing Help?
Partial order
trace
check b1 Λ b2
slicing for b1
satisfy b1
retain all
global states
satisfying b1
slice
check b2
20
Example
 Detect predicate (x*y + z < 5) Λ (x ≥1) Λ (z ≤3)
P1 x
P2 y
P3 z
1
2
-1
0
a
b
c
d
{a,e,f,u,v}
0
2
1
3
e
f
g
h
4
1
2
4
u
v
w
x
Computation
{b}
{w}
{g}
Slice with respect to
(x ≥1) Λ (z ≤3)
21
Computation Slice
computation slice: a sub-computation such that: [Mittal
and Garg 01]
1. it contains all global states of the computation
satisfying the given predicate, and
2. it contains the least number of global states
22
POTA Architecture
[Sen Dissertation 04]
Predicate (Specification)
Program
Analyzer
Slice
Slicer
Instrumentor
Instrumented
Program
Execute
Program
Trace
Trace
Specification
Predicate
Detector
yes/
witness
no/
counter
example
Slice
Promela
Translator
Execute
SPIN
yes
no/
counter
example
23
Results
Efficient polynomial-time algorithms for computing the slice for:
– linear predicates: [Garg and Mittal 01]
• time-complexity: O(n2m)
– general predicate:
• Theorem: Given a computation, if a predicate b can be detected
efficiently then the slice for b can also be computed efficiently.
[Mittal,Sen and Garg 03]
– combining slices: Boolean operators
– temporal logic operators: EF, AG, EG
– approximate slice: For arbitrary boolean expression
• n: number of processes
• m: number of events
24
Experiments: Dining Philosophers Trace Verification
 POTA: Partial Order Trace Analyzer (based on slicing) [Sen and Garg 03]
 SPIN: A widely used model checking tool
[Holzmann 97]
– SPIN: 250 seconds for n = 6, runs out of memory for n > 6.
– POTA: can handle n= 200. Used 400 seconds.
Predicate: Two neighboring dining philosophers do not eat concurrently
25
Conclusions
 Bug-hunting in concurrent systems
 Total order vs. Partial Order
 Abstraction like slicing to combat state space
explosion problem
26
Questions
?
27