Slides - National Taiwan University

Symbolic Program Consistency
Checking of OpenMP Parallel
Programs with Relaxed
Memory Models
Based on an LCTES 2012 paper.
Fang Yu
National Cheng Chi
University
Shun-Ching Yang
Guan-Cheng Chen
Che-Chang Chan
National Taiwan
University
Farn Wang
National Taiwan
University
& Academia Sinica
Outline
• Introduction
– Motivation
– Parallel program correctness
– Related work
• 2-step program consistency checking
– Step 1: Static race constraint solution
– Step 2: Guided simulation
• Extended finite-state machine (EFSM), relaxed memory models
• Implementation
• Experiments
• Conclusion
2
Motivation (1/4)
• Parallel Programming
– Multi-cores,
– General purpose computation on GPU (GPGPU)
– Distributed computing, cloud computing
• Challenges:
– Parallel loops, chunk sizes, # threads, schedules
– Arrays, pointer aliases,
– Relaxed memory models
3
Motivation (2/4)
A Running example of C & OpenMP
for(k=0;k<size-1;k++){
#pragma omp parallel for default (none)
shared(M,L,size,k)
private(i,j)
schedule(static,1)
num_thread(4)
for(i=k+1,i<size;i++){
L[i][k] = M[i][k]/M[k][k]
for(j=k+1;j<size;j++){
M[i][j] = M[i][j] – L[i-1][k]*M[k][j]
} } }
4
Motivation (3/4)
for(k=0;k<size-1;k++){
#pragma omp parallel for default (none)
shared(M,L,size,k)
Thread1: k+1, … , k+1+c-1,
Thread2: k+1+c , … , k+1+2c-1
private(i,j)
Thread3: k+1+2c , … , k+1+3c-1
schedule(static,c)
Thread4: k+1+3c, … , k+1+4c-1
num_thread(4)
for(i=k+1,i<size;i++){
Thread1: k+1+4c, … , k+1+5c-1
L[i][k] = M[i][k]/M[k][k]
…….
for(j=k+1;j<size;j++){
M[i][j] = M[i][j] – L[i-1][k]*M[k][j]
} } }
5
Motivation (4/4)
Many programming supports
• forks & joins
• P-threads
• Open Multi-Processing (OpenMP)
• Thread Building Blocks
• Microsoft …
6
Parallel Program Correctness (1/4)
Program level, what users care about
• Determinism:
– For all input, all executions yield the same output.
• Consistency:
– All executions yield the same output as the sequential
execution.
• Race-freedom:
– Parallel executions do not yield different results.
All seemingly equivalent at program level.
• unless sequential execution is not a parallel execution.
7
Parallel Program Correctness (2/4)
parallel
for
parallel
while
parallel
for
parallel
for
• Checking the correctness property of each
parallel region (PR)
• Correctness at PRs
 correctness of the program
8
Parallel Program Correctness (3/4)
In practice
• It may be unclear what the program result is.
• Instead, properties for correctness at PR level
are usually checked.
– determinism
– consistency
– race-freedom
• At RW schedule levels, values do not count.
– linearizability (transaction levels)
9
Parallel Program Correctness (4/4)
Linearizability (Transaction level)
 race-freedom (PR RW level)
 determinism (PR level)
= consistency (PR level)
 race-freedom (program level)
= determinism (program level)
= consistency (program level)
 program correctness
10
Related Work (1/4)
• Thread analyzer of Sun Studio [Lin 2008]
– Static race detection, no arrays
• Intel Thread Checker [Petersen & Shah 2003]
– Dynamic approach
• Instrumentation approach on client-server for
race detection [Kang et al. 2009]
– Run-time monitoring in OpenMP programs
• OmpVerify [Basupalli et al. 2011]
– Polyhedral analysis for Affine Control Loops
11
Related Work in PLDI 2012 (2/4)
no simulation as the 2nd step
• Detect races via liquid effects [Kawaguchi,
Rondon, Bakst, Jhala]
– type inferencing for precise race detection.
– no arrays.
• Speculative Linearizability [Guerraoui,Kuncak,Losa]
• Reasoning about Relaxed Programs [Carbin, Kim,
Misailovic, Rinard]
• Parallelizing Top-Down Interprocedural Analysis
[Albarghouthi, Kumar, Nori, Rajamani]
12
Related Work in PLDI 2012 (3/4)
no simulation as the 2nd step
• Sound and Precise Analysis of Parallel
Programs through Schedule-Specialization
[Wu, Tang, Hu, et al]
• Race Detection for Web Applications [Petrov,
Vechev, Sridharan, Dolby]
• Concurrent Data Representation Synthesis
[Hawkins, Aiken, Fisher2, et al]
• Dynamic Synthesis for Relaxed Memory
Models [Liu, Nedev, Prisadnikov, et al]
Related Work in PLDI 2012 (4/4)
no simulation as the 2nd step
Tools:
• Parcae [Raman, Zaks, Lee 3, et al]
• Chimera [Lee, Chen, Flinn, Narayanasamy]
• Janus [Tripp1, Manevich, Field, Sagiv]
• Reagents [Turon]
Methodology (1/2)
Assumptions:
• Arrays do not overlap.
• No pointers other than arrays.
• Fixed #threads, chunk size, scheduling policy.
– We analyze consistency of program implementation.
• Focusing on OpenMP.
– The techniques should be applicable to other
frameworks.
• Output result prescribed by users.
15
Why OpenMP ?
• Complicate enough
• Practical enough
– Parallelizes programs automatically;
– Is an industry standard of application
programming interface (API);
– Is supported by Sun Studio, Intel Parallel Studio,
Visual C++, GNU Compiler Collection (GCC).
16
Methodology (2/2)
2-step program consistency checking.
Program Consistency checking
Potential race analysis at PR level
Potential race report
Guided simulation for program consistency violations
end
17
Step 1: Potential Races at PR level
Necessary constraints as Presburger formulas
• A race constraint between each pair of
memory references to the same location by
different threads.
• Solution of the pairwise constraints via
Presburger formula solving.
18
Step 1: Potential Race Analysis
C program with OpenMP
Pairwise Constraints Generator
Pairwise Race Constraints
Consraint Solver
Racefreedom
No
Sat?
Yes
Potential races
(Truth Assignment)
19
Potential Race Constraint
A Potential Race Constraint
= Thread Path Condition Λ Race Condition
• Thread Path Condition
– Necessary for a thread to access a memory location in a
statement
– Obtained by symbolic postcondition analysis
• Race Condition
– The necessary condition of an access by two threads in a
parallel region
20
Running example
for(k=0;k<size-1;k++){
#pragma omp parallel for default (none)
shared(M,L,size,k)
Thread1: k+1, … , k+1+c-1,
Thread2: k+1+c , … , k+1+2c-1
private(i,j)
Thread3: k+1+2c , … , k+1+3c-1
schedule(static,c)
Thread4: k+1+3c, … , k+1+4c-1
num_thread(4)
for(i=k+1,i<size;i++){
Thread1: k+1+4c, … , k+1+5c-1
L[i][k] = M[i][k]/M[k][k]
…….
for(j=k+1;j<size;j++){
M[i][j] = M[i][j] – L[i-1][k]*M[k][j]
} } }
21
Thread Path Condition of L[i][k]
for(k=0;k<size-1;k++){
#pragma omp parallel for default (none)
shared(M,L,size,k)
Thread 1:
private(i,j)
it1-(k+1)%4=0 Λ k+1≤ i t1< size
schedule(static,c)
num_thread(4)
for(i=k+1,i<size;i++){
L[i][k] = M[i][k]/M[k][k]
for(j=k+1;j<size;j++){
M[i][j] = M[i][j] – L[i-1][k]*M[k][j]
} } }
22
Thread Path Conditions of L[i-1][k]
for(k=0;k<size-1;k++){
#pragma omp parallel for default (none)
shared(M,L,size,k)
Thread 2:
private(i,j)
it2-(k+1)-1 % 4 = 0
schedule(static,c)
Λ k+1 ≤ it2 < size
num_thread(4)
Λ k+1 ≤ jt2 < size
for(i=k+1,i<size;i++){
L[i][k] = M[i][k]/M[k][k]
for(j=k+1;j<size;j++){
M[i][j] = M[i][j] – L[i-1][k] *M[k][j]
} } }
23
Race Condition of L[i][k] & L[i-1][k]
for(k=0;k<size-1;k++){
#pragma omp parallel for default (none)
shared(M,L,size,k)
it1-(k+1) % 4 = 0
private(i,j)
Λ k+1 ≤ it1 < size
Λ it2-(k+1)-1 % 4=0
schedule(static,c)
Λ k+1 ≤ it2 < size
num_thread(4)
Λ k+1 ≤ jt2 < size
for(i=k+1,i<size;i++){
L[i][k] = M[i][k]/M[k][k]
Λ k = k Λ it1 = it2 -1
for(j=k+1;j<size;j++){
M[i][j] = M[i][j] – L[i-1][k] *M[k][j]
} } }
24
Potential Race Constraint Solving
All Presburger
it1-(k+1) % 4 = 0
Λ k+1 ≤ it1 < size
Λ it2-(k+1)-1 % 4=0
Λ k+1 ≤ it2 < size
Λ k+1 ≤ jt2 < size
Λ k = k Λ it1 = it2 -1
Potential races (Omega lib.):
. . i_1 = k+1+4alpha
. . i_2 = k+2+4alpha
. . i_2 = i_1+1
. . i_1 < size
. . i_2 < size
. . k+1 <= i_1
. . k+1 <= i_2
. . k+1 <= j_2
. . j_2 < size
i_1 – 0 [0,), not_tight
i_2 – 0 [0,), not_tight
25
Step 2: Guided symbolic simulation
• Program models:
– Extended finite-state machine (EFSM)
– Relaxed memory model
• Simulator of EFSM
– Stepwise, backtrack, fixed point
• Witness of program consistency violations
– comparison with the sequential execution result.
26
Guided Simulation
C program with OpenMP
Model Generator
Model (EFSM)
Potential races
(from step 1)
No
Simulation
Yes
No
Consistency
violations
Consistency ?
fixed
point ?
Yes
Consistency
(w. benign races)
27
C Program Model Construction (1/2)
Example: #pragma omp for schedule(Static, c) num_threads(m)
for(x=i;x<=j;x++) S
y is an auxiliary local variable for chunk.
start
(true) t is the serial number of the thread.
x=(t-1) *c +i; y=0;
(x<jy<c-1)
x++; y++;
S
(x> j)
(x-y+m*c j  y=c-1)
x=x-y+m*c; y=0;
(x-y+m*c>j  y=c-1)
stop
28
C Program Model Construction (2/2)
To model races in a C statement:
y = f(x1, x2, …, xn)
assume reads x1, x2,…, xn in order.
– other orders can also be modeled.
Translate to the following n+1 EFSM transitions:
a1=x1; a2=x2; …; an=xn; y=f(a1,…,an);
a1, a2, …, an are auxiliary variables in EFSM.
29
Relaxed Memory Models
• Out-of-order execution of accesses to the
memory for hardware efficiency.
– local caches, multiprocessors
– for customized synchronizations, controlled races
• May lead to unexpected result.
A classical example:
initially x=0  y = 0
thread 1:
x=1;
thread 2: y = 1;
z = y;
w = x;
assert z=1w=1
30
Relaxed Memory Models
A classical example:
initially x=0  y = 0
thread 1: x=1;
z=y;
cache 1
thread 2: y=1;
w=x;
assert
cache 2
z=1w=1
store
load
store
load
memory
x.c1=1
y.c1=1
load(w.c2,x)
load(z.c1,y)
store(x.c1)
x=x.c1
store(y.c2)
y=y.c2
31
Relaxed Memory Models
Total store order (TSO)
• From SPARC
• Adapted to Intel 80x86 series
• Description:
– Local reads can use pending writes in the local
store.
• Problem: Peer reads are not aware of the local pending
writes.
– Local stores must be FIFO.
32
Modeling TSO w. m threads (1/4)
• An array x[0..m] for each shared variable x
– x[0] is the memory copy.
– x[i] is the cache copy of x of thread i [1,m]
– x now becomes an address variable instead of the
value variable for x.
33
Modeling TSO w. m threads (2/4)
• An arrays ls[0..n] of objects for load-store (LS)
buffer of size n+1.
– ls_st[k]: status of load-store buffer cell k
• 0: not used, 1:load, 2: store
– ls_th[k]: thread that use load-store buffer cell k.
– ls_dst[k], ls_src[k]: destination and source addresses
– ls_value: value to store
Purely for convenience.
Can be changed to m load-store buffers for each thread.
Need know mappings from threads to cores
34
Modeling TSO w. m threads (3/4)
Load a x by thread j, ‘a’ is private.
PW ?
steps EFSM transitions
1
Pending
Write
(PW)
Thread J: !load@Q  ls_src@(Q) = &x; ls_dst = &a;
LS Q: must be the largest PW LS object.
?load@J  ls_th=J; ls_status = 1;
2
Thread J: ?load_finish
LS Q: !load_finish@(ls_th)l
s_dst[0]=ls_value; ls_th=0; ls_status=0; compact LS array;
No
pending
Write
1
Thread J: !load@Q  ls_src@(Q) = &x; ls_dst = &a;
LS Q: must be the smallest idle LS obj.
?load@J  ls_th=J; ls_status = 1;
2
Thread J: ?load_finish
LS Q: !load_finish@(ls_th) 
ls_dst[0]=ls_src[0];ls_th=0; ls_status=0; compact LS array;
35
Modeling TSO w. m threads (4/4)
Store a x by thread j, ‘a’ is private.
steps EFSM transitions
1
2
Thread J: !store@Q  ls_dst@(Q) = &x; ls_value = a;
LS Q: must be the smallest idle LS obj.
?store@J  ls_th=J; ls_status = 2;
LS Q: ls_dst[0] = ls_value; ls_th=0; ls_status=0;
compact LS array;
36
Guided Simulation
• For each pairwise race condition truth
assignment, perform a simulation session.
• Use a stack to explore the simulation paths.
• Explore all paths compatible with the truth
assignment.
• Check consistency at the end of each path.
• Mark benign races.
37
Implementation
Pathg – path generator
• Pontential race condition solving
– Presburger Omega library
• Model construction:
– REDLIB for EFSM with synchronizations, arrays,
variable declarations, address arithmetics
• Guided EFSM simulation
– REDLIB semi-symbolic simulator
– step, backtrack, check fixpoint/consistency
38
Implementation
Guided Symbolic Simulation
Sequential execution(Golden model)
Master
Thread
Parallel
Task 1
Parallel
Task 2
Memory
Accessing Sequence
Read:L[2][1]
Read:L[2][1]
Write:L[2][1]
Read:L[2][1]
Write:L[2][1]
.
.
.
.
Guided Multi-Threaded Simulation
Memory
Accessing Sequence
Read:L[2][1]
Write:L[2][1]
Read:L[2][1]
Read:L[2][1]
Write:L[2][1]
.
.
.
.
Parallel
Task 1
Master
Thread
Parallel
Task 2
Parallel
Task 3
Parallel
Task 3
Master
Thread
output
output
Master
Thread
39
Implementation
Potential Race Report
===tg:i_4,i_1=====tw:i_4
Race::L[5][1]
===tg:i_3,i_4=====tw:i_3
Race::L[4][1]
===tg:i_2,i_3=====tw:i_2
Race:: L[3][1]
===tg:i_1,i_2=====tw:i_1
Race:: L[2][1]
tg indicates threads involved
in the race.
tw indicates threads WRITE the
Memory address.
Race is where the race
condition is.
We enumerate variables to
limit the solution
40
Experiments
• Environment
– Ubuntu 9.10 64bit
– i5-760 2.8GHz and 2GB RAM
• Benchmarks
– OpenMP Source Code Repository (OmpSCR)
– NAS Parallel Benchmarks (NPB)
41
Constraint Solving of OmpSCR
 Bug v1: Races manually introduced (between any two threads dealing with the
consecutive iterations)
 Bug v2: Rare races introduced (only between two specific threads on a
particular share memory)
 Fixed: A barrier statement manually inserted (remove the race in Bug v2)
Benchmark
Original
Bug v1
Bug v2
Fixed
#Const.
#Sat
Time
#Const.
#Sat
Time
#Const.
#Sat
Time
#Const.
#Sat
Time
c_lu.c
71
0
0.18s
629
29
1.810s
935
30
4.110s
935
0
5.15s
c_ja01.c
95
0
0.39s
95
8
0.42s
155
1
0.75s
95
0
0.77s
c_ja02.c
95
0
0.03s
95
8
0.35s
155
1
0.67s
95
0
1.03s
c_loopA.c
17
0
0.04s
47
4
0.07s
95
1
0.32s
17
0
0.84s
c_loopB.c
17
0
0.03s
29
4
0.08s
95
1
0.15s
17
0
1.13s
c_md.c
65
0
0.25s
77
4
0.30s
131
1
0.53s
65
0
1.25s
42
Symbolic Simulation of OmpSCR
• Blindly simulation needs to explore (much) more traces to hit a
consistency violation!
• Standard OpenMP tools fail to report races of these benchmarks.
Benchmark
s
Guided simulation
Random simulation
Sun Studio
Intel Thread Checker
#Traces
Time
#Trace
Time
race
Race/total
c_lu_bug1
1
23.35s
25.3
52.11s
N
4/10
c_lu_bug2
1
23.22s
178.9
110.58s
N
1/10
c_ja01_bug1
1
6.65s
10.6
26.60s
N
4/10
c_ja01_bug2
1
13.91s
42.1
58.16s
N
3/10
c_ja_02_bug1
1
14.86s
25
28.83s
N
2/10
c_ja_02_bug2
1
15.19s
41.3
52.25s
N
2/10
c_loopA_bug1
1
10.76s
11.7
36.82s
N
3/10
c_loopA_bug2
1
56.86s
27.6
98.40s
N
2/10
c_loopB_bug1
1
14.54s
9.4
29.58s
N
2/10
c_loopB_bug2
1
41.50s
38.6
66.48s
N
2/10
c_md_bug1
1
12.19s
10.4
26.21s
N
3/10
c_md_bug2
1
19.38s
44.3
83.52s
N
2/10
43
NAS Parallel Benchmarks
• Middle-size benchmarks (1200+~3500+ loc)
• Efficient race constraint solving
– e.g., 150000+ race constraints solved in 38 minutes
by omega library
• Rare satisfiable constraints
– 8/85067 constraints of nas_lu.c
Benchmark
#loc
#Access
#Const.
#Sat
Time
nas_lu.c
3481
13736
85067
8
27m30.37s
bt.c
3616
15916
157047
0
37m33.32s
mg.c
1250
4636
2269
0
0m17.19s
sp.c
2983
13604
45209
0
4m0.32s
44
nas_lu.c
• Slice the program to the segment of the
paralleled region with satisfiable race conditions
• Construct the symbolic model of the sliced
segment:
– 35 Modes (EFSM)
– Reaching the fixed point without consistency violation after
205 steps and 16.93secs
• Benign races
– All of them are used as mutual exclusion semaphores
– nas_lu.c is consistent
45
Conclusion
• Static analysis of program consistency
– for real C/C++ program with OpenMP directives
• Highly automated solution
– Constraint solving
– Symbolic simulation
•
•
•
•
High precision: relaxed memory models
High efficiency
Extension to TBB, other memory models ?
Partial order reduction ?
46
Conclusion
Symbolic approach for static consistency checking
– Detect and identify races by solving race constraints
(Presburger formulas)
– Construct symbolic models and perform guided simulation
with races
– Support relaxed memory models
– Find consistency violations effectively (when existing)
47