SPIRE: A Methodology for Sequential to Parallel Intermediate

SPIRE: A Methodology for Sequential to Parallel
Intermediate Representation Extension
Dounia Khaldi Pierre Jouvelot
François Irigoin
Corinne Ancourt
CRI, Mathématiques et systèmes
MINES ParisTech
35 rue Saint-Honoré, 77300 Fontainebleau, France
17th Workshop on Compilers for Parallel Computing, Lyon, France
July 3-5, 2013
1/14
Context and Motivation
Cilk, UPC, X10, Habanero-Java, Chapel, OpenMP, MPI, OpenCL...
Parallelism handling in compilers
Towards a High-level Parallel Intermediate Representation
Trade-off between expressibility and conciseness of representation
Simplification of transformations and semantic analyses
Huge compiler platforms
GCC (more than 7 million lines of code)
LLVM (more than 1 million lines of code)
PIPS (600 000 lines of code)
OSCAR (400 000 lines of code)
Proposal
SPIRE: Sequential to Parallel Intermediate Representation Extension
methodology
2/14
Related Work
AST
Sequential IR
Parallel IR
Parallelism
GIMPLE
[Merrill, 2003]
GCC (GOMP)
[Novillo, 2006]
Additional built-ins
(__sync_fetch_and_add)
Habanero-Java
[Cavé et al., 2011]
AST IR
HPIR
[Zhao and Sarkar, 2011]
New nodes:
AsyncRegionEntry,
AsyncRegionExit
FinishRegionEntry,
FinishRegionExit
—
InsPIRe [Insieme, ]
Built-ins
—
PLASMA [Pai et al., 2010]
Vector operators:
add, reduce and par
Program Dependence
Graph [Ferrante et al., 1987]
Parallel Program Graph
[Sarkar and Simons, 1993]
mgoto control and
synchronization edges
—
Hierarchical Task Graph
[Girkar and Polychronopoulos, 1992]
New vertices for tasks
—
Macro Task Graph
[Kasahara et al., 1992]
Vertices for macro-tasks: basic,
repetition and subroutine blocks
—
Stream Graph [Choi et al., 2009]
Nodes for streams
LLVM IR [llv, 2010]
LLVM PIR
Metadata:
llvm.loop.parallel
Graph
3/14
Outline
1
Design Approach
2
PIPS (Sequential) IR
3
SPIRE, a Sequential to Parallel IR Extension
4
SPIRE Operational Semantics
5
Validation: Application to LLVM
6
Conclusion
4/14
Design Approach
“Parallelism or concurrency are operational concepts that refer not
to the program, but to its execution.” [Dijkstra, 1977]
Language
Cilk
(MIT)
Chapel
(Cray)
X10(IBM),
HabaneroJava(Rice)
OpenMP
OpenCL
MPI
SPIRE
Execution
Parallelism
Task
creation
spawn
Synchronization
Task
Point-tojoin
point
sync
—
begin
sync
async
future
omp for
omp sections
EnqueueNDRangeKernel
MPI_Init
sequential,
parallel
—
forall
coforall
cobegin
foreach
Memory
Data
distribution
—
Atomic
section
cilk_lock
Model
sync
sync
atomic
PGAS
(Locales)
(on)
finish
next
force get
atomic
isolated
PGAS
(Places)
(at)
omp task
omp section
EnqueueTask
MPI_spawn
omp taskwait
omp barrier
Finish
EnqueueBarrier
MPI_Finalize
MPI_Barrier
—
omp critical
omp atomic
atom_add,
...
—
Shared
private,
shared...
ReadBuffer
WriteBuffer
MPI_Send
MPI_Recv...
spawn
barrier
signal,
wait
atomic
Shared, Distributed
events
—
Shared
Distributed
Distributed
send, recv
5/14
PIPS (Sequential) IR in Newgen
instruction
= call + f o r l o o p + s e q u e n c e;
statement
= i n s t r u c t i o n x d e c l a r a t i o n s : entity * ;
entity
= name : string x type x i n i t i a l: value ;
forloop
= index : entity x
lower : e x p r e s s i o n x upper : e x p r e s s i o n x
step : e x p r e s s i o n x body : s t a t e m e n t ;
sequence
= s t a t e m e n t s: s t a t e m e n t* ;
6/14
SPIRE: Execution
Add execution to control constructs:
forloop
sequence
execution
= s e q u e n t i a l: unit + p a r a l l e l: unit ;
forall in Chapel, and its SPIRE core language representation
forall I in 1.. n do
t [ i ] = 0;
f o r l o o p(I ,1 , n ,1 ,
t[i] = 0,
p a r a l l e l)
sections in OpenMP, and its SPIRE representation
# pragma omp s e c t i o n s
{
# pragma omp s e c t i o n
x = foo ();
# pragma omp s e c t i o n
y = bar ();
}
z = baz (x , y );
s e q u e n c e(
s e q u e n c e( x = foo ();
y = bar () ,
p a r a l l e l );
z = baz (x , y ) ,
s e q u e n t i a l)
7/14
SPIRE: Synchronization
Add synchronization to statement
s y n c h r o n i z a t i o n = none : unit +
spawn : entity + b a r r i e r: unit +
single : bool + atomic : r e f e r e n c e;
OpenCL example illustrating spawn and barrier statements, and its
SPIRE core language representation
mode =
OUT_OF_ORDER_EXEC_MODE_ENABLE ;
b a r r i e r(
c = c l C r e a t e C o m m a n d Q u e u e ( context ,
spawn ( zero , k_A ());
device_id , mode ,& err );
spawn ( one , k_B ())
c l E n q u e u e T a s k (c , k_A ,0 , NULL , NULL );
);
c l E n q u e u e T a s k (c , k_B ,0 , NULL , NULL );
spawn ( zero , k_C (..))
c l E n q u e u e B a r r i e r ( c );
c l E n q u e u e T a s k (c , k_C ,0 , NULL , NULL );
8/14
SPIRE: Point-to-Point Synchronization (Event API)
Add event as a new type
event n e w E v e n t( int i );
void f r e e E v e n t( event e );
void signal ( event e );
void wait ( event e );
A phaser in Habanero-Java, and its SPIRE core language
representation
finish {
phaser ph = new phaser ();
for ( j = 1; j <= n ; j ++){
async phased (
ph < SIG_WAIT >){
S ; next ; S ′ ;
}
}
}
b a r r i e r(
ph = n e w E v e n t ( -(n -1));
f o r l o o p(j , 1 , n , 1 ,
spawn (j , S ;
signal ( ph );
wait ( ph );
signal ( ph );
S′ ) ,
p a r a l l e l );
f r e e E v e n t( ph )
)
9/14
Sequential Core Language
S ∈ Stmt ::= nop | I = E | S1 ;S2 | loop (E,S )
S ∈ SPIRE ( Stmt )::=
nop | I = E | S1 ;S2 | loop (E,S ) |
spawn (I , S ) |
b a r r i e r( S ) |
wait ( I ) | signal ( I ) |
send (I , I ′ ) | recv (I , I ′ )
m ∈ Memory = Ide → Value
κ ∈ Configuration = Memory × Stmt
ζ ∈ Exp → Memory → Value
v = ζ(E)m
(m, I = E) → (m[I → v ], nop)
10/14
SPIRE as a Language Transformer
κ ∈ Configuration = Memory × Stmt
π ∈ State = Proc → Configuration × Procs
i ∈ Proc = N
c ∈ Procs = P(Proc)
dom(π) = {i ∈ Proc/π(i) is defined }
π[i → (κ, c)] the state π extended at i with (κ, c)
κ → κ′
π[i → (κ, c)] ֒→ π[i → (κ′ , c)]
n = ζ(I)m
π[i → ((m, spawn(I,S)), c)] ֒→
π[i → ((m, nop), c ∪ {n})]
[n → ((m, S), ∅)]
11/14
Validation: LLVM
function
block
= blocks : block * ;
= label : entity x p h i _ n o d e s: p h i _ n o d e* x
i n s t r u c t i o n s : i n s t r u c t i o n* x t e r m i n a t o r;
phi_node
= call ;
i n s t r u c t i o n = call ;
terminator = conditional_branch +
u n c o n d i t i o n a l _ b r a n c h + return ;
c o n d i t i o n a l _ b r a n c h = value : entity x l a b e l _ t r u e: entity x
l a b e l _ f a l s e: entity ;
u n c o n d i t i o n a l _ b r a n c h = label : entity ;
return
= value : entity ;
entry :
br label % bb1
bb :
; preds = % bb1
%0 = add nsw i32 % sum .0 , 2
sum = 42;
%1 = add nsw i32 % i .0 , 1
for ( i =0; i <10; i ++){ br label % bb1
sum = sum + 2;
bb1 :
; preds = % bb , % entry
}
% sum .0= phi i32 [42 ,% entry ] ,[%0 ,% bb ]
% i .0 = phi i32 [0 ,% entry ] ,[%1 ,% bb ]
%2 = icmp sle i32 % i .0 , 10
br i1 %2 , label % bb , label % bb2
12/14
Validation: SPIRE (LLVM)
function
block
= blocks : block * ;
= label : entity x p h i _ n o d e s: p h i _ n o d e* x
i n s t r u c t i o n s : i n s t r u c t i o n* x t e r m i n a t o r;
phi_node
= call ;
i n s t r u c t i o n = call ;
terminator = conditional_branch +
u n c o n d i t i o n a l _ b r a n c h + return ;
c o n d i t i o n a l _ b r a n c h = value : entity x l a b e l _ t r u e: entity x
l a b e l _ f a l s e: entity ;
u n c o n d i t i o n a l _ b r a n c h = label : entity ;
return
= value : entity ;
function
; f u n c t i o n x e x e c u t i o n;
block
; block x e x e c u t i o n;
instruction ; instruction x synchronization;
I n t r i n s i c F u n c t i o n s: send , recv , signal , wait
13/14
Conclusion
SPIRE methodology as an IR transformer:
10 concepts collected in three groups
execution, synchronization, data distribution
SPIRE (PIPS), SPIRE (LLVM)
Operational semantics for SPIRE [Khaldi et al., 2013]
Trade-off between expressibility and conciseness of representation
Applications:
Implementation of SPIRE (PIPS)
Implementation of a new BDSC-based task parallelization
algorithm [Khaldi et al., 2012]
Generation of OpenMP and MPI code from the same parallel IR
Future Work
Transformations for parallel languages (see [Nandivada et al., 2013])
encoded in SPIRE
Representation of the PGAS memory model
Addition of programming features such as exceptions and speculation
14/14
References I
(2010).
The LLVM Reference Manual (Version 2.6).
The LLVM Development Team.
Cavé, V., Zhao, J., and Sarkar, V. (2011).
Habanero-Java: the New Adventures of Old X10.
In 9th International Conference on the Principles and Practice of Programming in Java (PPPJ).
Choi, Y., Lin, Y., Chong, N., Mahlke, S., and Mudge, T. (2009).
Stream Compilation for Real-Time Embedded Multicore Systems.
In Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’09,
pages 210–220, Washington, DC, USA.
Dijkstra, E. W. (1977).
Re: “Formal Derivation of Strongly Correct Parallel Programs” by Axel van Lamsweerde and M.Sintzoff.
circulated privately.
Ferrante, J., Ottenstein, K. J., and Warren, J. D. (1987).
The Program Dependence Graph And Its Use In Optimization.
ACM Trans. Program. Lang. Syst., 9(3):319–349.
Girkar, M. and Polychronopoulos, C. D. (1992).
Automatic Extraction of Functional Parallelism from Ordinary Programs.
IEEE Trans. Parallel Distrib. Syst., 3:166–178.
Insieme.
Insieme - an Optimization System for OpenMP, MPI and OpenCL Programs.
http://www.dps.uibk.ac.at/insieme/architecture.html .
References II
Kasahara, H., Honda, H., Mogi, A., Ogura, A., Fujiwara, K., and Narita, S. (1992).
A Multi-Grain Parallelizing Compilation Scheme for OSCAR (Optimally Scheduled Advanced Multiprocessor).
In Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing, pages
283–297, London, UK. Springer-Verlag.
Khaldi, D., Jouvelot, P., and Ancourt, C. (2012).
Parallelizing with BDSC, a Resource-Constrained Scheduling Algorithm for Shared and Distributed Memory Systems.
Technical Report CRI/A-499 (Submitted to Parallel Computing), MINES ParisTech.
Khaldi, D., Jouvelot, P., Irigoin, F., and Ancourt, C. (2013).
SPIRE: A Methodology for Sequential to Parallel Intermediate Representation Extension.
In Proceedings of the 17th Workshop on Compilers for Parallel Computing, CPC’13.
Merrill, J. (2003).
GENERIC and GIMPLE: a New Tree Representation for Entire Functions.
In GCC developers summit 2003, pages 171–180.
Nandivada, V. K., Shirako, J., Zhao, J., and Sarkar, V. (2013).
A Transformation Framework for Optimizing Task-Parallel Programs.
ACM Trans. Program. Lang. Syst., 35(1):3:1–3:48.
Novillo, D. (2006).
OpenMP and Automatic Parallelization in GCC.
In the Proceedings of the GCC Developers Summit.
Pai, S., Govindarajan, R., and Thazhuthaveetil, M. J. (2010).
PLASMA: Portable Programming for SIMD Heterogeneous Accelerators.
In Workshop on Language, Compiler, and Architecture Support for GPGPU, held in conjunction with HPCA/PPoPP
2010, Bangalore, India.
References III
Sarkar, V. and Simons, B. (1993).
Parallel Program Graphs and their Classification.
In Banerjee, U., Gelernter, D., Nicolau, A., and Padua, D. A., editors, LCPC, volume 768 of Lecture Notes in Computer
Science, pages 633–655. Springer.
Zhao, J. and Sarkar, V. (2011).
Intermediate Language Extensions for Parallelism.
In Proceedings of the compilation of the co-located workshops on DSM’11, TMC’11, AGERE!’11, AOOPES’11,
NEAT’11, & VMIL’11, SPLASH ’11 Workshops, pages 329–340, New York, NY, USA. ACM.
SPIRE: A Methodology for Sequential to Parallel
Intermediate Representation Extension
Dounia Khaldi Pierre Jouvelot
François Irigoin
Corinne Ancourt
CRI, Mathématiques et systèmes
MINES ParisTech
35 rue Saint-Honoré, 77300 Fontainebleau, France
17th Workshop on Compilers for Parallel Computing, Lyon, France
July 3-5, 2013
SPIRE: Execution
Add also execution to the control construct of instruction:
unstructured
A C code, and its unstructured parallel control flow graph
representation
nop;
x
y
z
u
=
=
=
=
y=5;
x=3;
3;
5;
x + y;
x + 1;
entry
z=x+y;
u=x+1;
nop;
exit
19/21
SPIRE: Data Distribution
void send ( int dest , entity buf );
void recv ( int source , entity buf );
MPI_Init (& argc , & argv []);
MPI_Comm_rank ( MPI_COMM_WORLD , & rank );
MPI_Comm_size ( MPI_COMM_WORLD , & size );
if ( rank == 0)
M P I _ R e c v ( sum , s i z e o f ( sum ) , MPI_FLOAT ,
1 ,1 , MPI_COMM_WORLD ,& stat );
e l s e if ( rank == 1){
sum = 42;
M P I _ S e n d ( sum , s i z e o f ( sum ) , MPI_FLOAT ,
0 ,1 , MPI_COMM_WORLD );
}
MPI_Finalize ();
forloop ( rank ,0 , size ,1 ,
test ( rank ==0 ,
r e c v ( one , sum ) ,
test ( rank ==1 ,
sum =42;
s e n d ( zero , sum ) ,
nop )
),
parallel )
20/21
SPIRE: Data Distribution
void send ( int dest , entity buf );
void recv ( int source , entity buf );
MPI_Init (& argc , & argv []);
MPI_Comm_rank ( MPI_COMM_WORLD , & rank );
MPI_Comm_size ( MPI_COMM_WORLD , & size );
if ( rank == 0)
M P I _ R e c v ( sum , s i z e o f ( sum ) , MPI_FLOAT ,
1 ,1 , MPI_COMM_WORLD ,& stat );
e l s e if ( rank == 1){
sum = 42;
M P I _ S e n d ( sum , s i z e o f ( sum ) , MPI_FLOAT ,
0 ,1 , MPI_COMM_WORLD );
}
MPI_Finalize ();
forloop ( rank ,0 , size ,1 ,
test ( rank ==0 ,
r e c v ( one , sum ) ,
test ( rank ==1 ,
sum =42;
s e n d ( zero , sum ) ,
nop )
),
parallel )
20/21
SPIRE: Data Distribution
Non-blocking send ≡
spawn ( new , send ( zero , sum ))
Non-blocking receive ≡
f i n i s h _ r e c v = false ;
spawn ( new , recv ( one , sum );
atomic ( f i n i s h _ r e c v= true ))
Broadcast ≡
test ( rank ==0 ,
f o r l o o p( rank ,1 , size ,1 ,
send ( rank , sum ) ,
p a r a l l e l) ,
recv ( zero , sum ))
21/21

Download Report

SPIRE: A Methodology for Sequential to Parallel Intermediate

Paperzz.com

Your Paperzz