Parallel Computing Lab 10: Parallel Game of Life (GOL) with

Parallel Computing Lab 10:
Parallel Game of Life (GOL) with structured grid
November 17, 2012
This lab requires to write a program in Bulk Synchronous Parallelism (BSP), particularly parallel
Game of Life (GOL) sumulation on regular grid. It also shows how to use VTK visual library (no
details yet).
1
Code and setup
There is a skeleton code given (see gol1-template.tgz ), in case you don’t want to write everything
from scratch. You can use any part of the code, although its easier to evaluate a solution if it
follows my code structure.
Unzip the code on aur and compile with
module load intelmpi
module load VTK-5.10.1
make
In case you want to compile on your computer change VTK directories in the Makefile.
1.1
Sequential GOL
There are 2 sequential version:
1. Fortran with terminal output – ./main_gol-seq
(a) sources are main_gol-seq.f90, gol-seq.f90, commonf.f90, and common.h
(b) this program prints out GOL field new state to console every time a key is pressed
2. C++ with VTK visualization – ./gol-visual-seq
(a) you need to give -X parameter to ssh and possibly set LIBGL_ALWAYS_INDIRECT
environment variable to 1
(b) sources are gol-visual-seq.cc
(c) this program draws GOL field new state with VTK every second
(d) you can use mouse to rotate and zoom the field (it is 3D)
As you can see the grid is structured. If 3D is too slow with ssh try NX, see instructions on HPC
web site[1].
1
compComm compute 3
compute 1
MPI_COMM_WORLD
compute 4
vis process
compute 2
Figure 1: Proccesses and communicators
2
Parallel GOL
There is one “parallel” version which is incomplete (just a template) – main_gol-mpi.f90. It may
run in two modes
1. mpirun -np 1 ./main_gol-mpi initializes MPI, runs the sequential GOL and prints output to
console
2. mpirun -np 1 ./main_gol-mpi : -np 1 ./gol-visual-mpi – runs one visualize MPI process (gol-visualmpi.cc) and one compute MPI process (main_gol-mpi.f90)
• visualizer waits for messages from compute process
• compute process calculates GOL field new state on a key press
In any case compComm communicator is created (see Figure 1) from MPI_COMM_WORLD that
combines only MPI processes with ’compute’ role (main_gol-mpi.f90 ). Use this communicator in
the following task.
Task 1
ReWrite the program, so that GOL is run in parallel. For that create gol-mpi.f90 and write
code similar to gol-seq.f90 that runs in parallel by decomposing the domain into n × n subdomains and exchanging neighbours (ghost) values as necessary (see lecture slides). You may
completely ignore visual VTK C++ part if you don’t like it and just use the console for
output.
An example of splitting the field into 3 × 3 parts for 9 processes is shown on Figure 2. The idea is
to create larger local (lXN+2)×(lYN+2) data field that comprises:
• local values that computed by the current process,
• inner values that in addition are not needed by any other process and can also be computed
without any ghost values,
• ghost values that is a thin region of non-local values along the boundary that are needed by
local computations and must be eventually copied from other procceses.
2
all values data(1:lXN+2,1:lYN+2)
inner values data(3:lXN,3:lYN)
local values data(2:lXN+1,2:lYN+1)
YN
lYN
XN
lXN
Figure 2: Decomposition of GOL between 9 processes; local, inner and all values for one process
It is possible to exchange latest ghost values every time before doing local computations. In this
case the code I have is the following:
!> Do one step in gol
subroutine gol_parfield_step(pf)
type(ParField_t),intent(inout) :: pf
type(GhostInfo_t) :: ghostInfo
! exchange ghosts
call gol_parfield_exchange_ghosts__start(pf, ghostInfo)
call gol_parfield_exchange_ghosts__finish(pf, ghostInfo)
! make the step
call gol_field_calculate(pf%base, (/2,2,pf%lXN+1,pf%lYN+1/))
call gol_field_step_finish(pf%base)
end subroutine gol_parfield_step
The second argument in gol_field_calculate routine specifies the region to compute. In this
case these are local values, but in the next task we will need more fine control. Value ghostInfo
holds all MPI request handles, buffers, and other useful info between MPI_Isend/MPI_Irecv and
MPI_Wait calls.
3
Parallel optimized GOL
One optimization idea is to initiate the exchange ghost values and in the while compute inner local
values, then as ghost values are available compute non-inner local values. The code of one GOL
step than changes to the following:
subroutine gol_parfield_step_async(pf)
3
(a) simple ghost exchange
(b) overlap ghost exchange with inner computations
Figure 3: Intel TAC charts for parallel Game of Life with 9 processes
type(ParField_t),intent(inout) :: pf
type(GhostInfo_t) :: ghostInfo
! start exchanging ghost values and calculate inner part
call gol_parfield_exchange_ghosts__start(pf, ghostInfo)
call gol_field_calculate(pf%base, (/3,3,pf%lXN,pf%lYN/))
call gol_parfield_exchange_ghosts__finish(pf, ghostInfo)
! calculate outer part
! top and bottom borders
call gol_field_calculate(pf%base, (/2,2,pf%lXN+1,2/))
call gol_field_calculate(pf%base, (/2,pf%lYN+1,pf%lXN+1,pf%lYN+1/))
! left and right borders
call gol_field_calculate(pf%base, (/2,3,2,pf%lYN/))
call gol_field_calculate(pf%base, (/pf%lXN+1,3,pf%lXN+1,pf%lYN/))
! finish the step
call gol_field_step_finish(pf%base)
end subroutine gol_parfield_step_async
Task 2
ReWrite the GOL code, so that each process first calculates its boundary values needed for
other proccesses, sends them out and then calculates the rest. Benchmark the new code
against the non-optimized one for large GOL field and provide table with run times for 1,4,9,
and 16 processes for both versions. Disable visualization and provide event charts from Intel
TAC for 4,9 and 16 MPI processes for both versions. Is optimization helpful?
For the numbers I tried (3000 × 1500 grid) and 4,9,16,36 processes I did not notice any substantial
win in time. The charts from ITAC are shown on Figure 3.
References
[1] http://www.hpc.ut.ee/en/user_guides/using_nx
4
Appendix A: Using VTK
Here follow the details of VTK programming.
5