node - Shodor

Hybrid MPI/CUDA
Scaling accelerator code
Blue Waters Undergraduate Petascale Education Program
May 29 – June 10 2011
BWUPEP2011, UIUC, May 29 - June 10 2011
1
Why Hybrid CUDA?
 CUDA is fast! (for some problems)
 CUDA on a single card is like OpenMP (doesn’t scale)
 MPI can only scale so far
 Excessive power
 Communication overhead
 Large amount of work remains for each node
 What if you can harness the power of multiple
accelerators on multiple MPI processes?
BWUPEP2011, UIUC, May 29 - June 10 2011
2
Hybrid Architectures
 Tesla S1050 connected to nodes
 1 GPU, connected directly to a
node
 Al-Salam @ Earlham (as11 &
as12)
Node
RAM
 Tesla S1070
 A server node with 4 GPUs,
typically connected via PCI-E to
2 nodes
 Sooner @ OU has some of these
 Lincoln @ NCSA (192 nodes)
 Accelerator Cluster (AC) @
NCSA (32 nodes)
BWUPEP2011, UIUC, May 29 - June 10 2011
GPU
GPU
GPU
GPU
Node
RAM
3
MPI/CUDA Approach
 CUDA will be:
 Doing the computational heavy lifting
 Dictating your algorithm & parallel layout (data parallel)
 Therefore:
 Design CUDA portions first
 Use MPI to move work to each node
BWUPEP2011, UIUC, May 29 - June 10 2011
4
Implementation
 Do as much work as possible on the GPU before
bringing data back to the CPU and communicating it
 Sometimes you won’t have a choice…
 Debugging tips:
 Develop/test/debug onenode version first
 Then test it with multiple
nodes to verify communication
move data to each node
while not done:
copy data to GPU
do work <<< >>>
get new state out of GPU
communicate with others
aggregate results from all nodes
BWUPEP2011, UIUC, May 29 - June 10 2011
5
Multi-GPU Programming
 A CPU thread can only have a single active
context to communicate with a GPU
 cudaGetDeviceCount(int * count)
 cudaSetDevice(int device)
 Be careful using MPI rank alone, device count
only counts the cards visible from each node
 Use MPI_Get_processor_name() to determine
which processes are running where
BWUPEP2011, UIUC, May 29 - June 10 2011
6
Compiling
 CUDA needs nvcc, MPI needs mpicc
 Dirty trick: wrap mpicc with nvcc
nvcc --compiler-bindir mpicc main.c kernel.cu
 nvcc processes .cu files, sends the rest to its wrapped
compiler
 Kernel, kernel invocation, cudaMalloc, are all best off in a
.cu file somewhere
 MPI calls should be in .c files
 There are workarounds, but this is the simplest approach
BWUPEP2011, UIUC, May 29 - June 10 2011
7
Executing
 Typically one MPI process per available GPU
 On Sooner (OU), each node has 2 GPUs available, so ppn
should be 2.
#BSUB -R "select[cuda > 0]“
#BSUB -R "rusage[cuda=2]“
#BSUB –l nodes=1:ppn=2
 On AC, each node has 4 GPUs and correspond to the
number of processors requested, so this requests a total of
8 GPUs on 2 nodes:
#BSUB -l nodes=2:tesla:cuda3.2:ppn=4
BWUPEP2011, UIUC, May 29 - June 10 2011
8
Hybrid CUDA Lab
 We already have Area Under a Curve code for MPI
and CUDA independently.
 You can write a hybrid code that has each GPU
calculate a portion of the area, then use MPI to
combine subtotals for the complete area.
 Otherwise feel free to take any code we’ve used so
far and experiment!
BWUPEP2011, UIUC, May 29 - June 10 2011
9