Hybrid MPI/CUDA Scaling accelerator code Blue Waters Undergraduate Petascale Education Program May 29 – June 10 2011 BWUPEP2011, UIUC, May 29 - June 10 2011 1 Why Hybrid CUDA? CUDA is fast! (for some problems) CUDA on a single card is like OpenMP (doesn’t scale) MPI can only scale so far Excessive power Communication overhead Large amount of work remains for each node What if you can harness the power of multiple accelerators on multiple MPI processes? BWUPEP2011, UIUC, May 29 - June 10 2011 2 Hybrid Architectures Tesla S1050 connected to nodes 1 GPU, connected directly to a node Al-Salam @ Earlham (as11 & as12) Node RAM Tesla S1070 A server node with 4 GPUs, typically connected via PCI-E to 2 nodes Sooner @ OU has some of these Lincoln @ NCSA (192 nodes) Accelerator Cluster (AC) @ NCSA (32 nodes) BWUPEP2011, UIUC, May 29 - June 10 2011 GPU GPU GPU GPU Node RAM 3 MPI/CUDA Approach CUDA will be: Doing the computational heavy lifting Dictating your algorithm & parallel layout (data parallel) Therefore: Design CUDA portions first Use MPI to move work to each node BWUPEP2011, UIUC, May 29 - June 10 2011 4 Implementation Do as much work as possible on the GPU before bringing data back to the CPU and communicating it Sometimes you won’t have a choice… Debugging tips: Develop/test/debug onenode version first Then test it with multiple nodes to verify communication move data to each node while not done: copy data to GPU do work <<< >>> get new state out of GPU communicate with others aggregate results from all nodes BWUPEP2011, UIUC, May 29 - June 10 2011 5 Multi-GPU Programming A CPU thread can only have a single active context to communicate with a GPU cudaGetDeviceCount(int * count) cudaSetDevice(int device) Be careful using MPI rank alone, device count only counts the cards visible from each node Use MPI_Get_processor_name() to determine which processes are running where BWUPEP2011, UIUC, May 29 - June 10 2011 6 Compiling CUDA needs nvcc, MPI needs mpicc Dirty trick: wrap mpicc with nvcc nvcc --compiler-bindir mpicc main.c kernel.cu nvcc processes .cu files, sends the rest to its wrapped compiler Kernel, kernel invocation, cudaMalloc, are all best off in a .cu file somewhere MPI calls should be in .c files There are workarounds, but this is the simplest approach BWUPEP2011, UIUC, May 29 - June 10 2011 7 Executing Typically one MPI process per available GPU On Sooner (OU), each node has 2 GPUs available, so ppn should be 2. #BSUB -R "select[cuda > 0]“ #BSUB -R "rusage[cuda=2]“ #BSUB –l nodes=1:ppn=2 On AC, each node has 4 GPUs and correspond to the number of processors requested, so this requests a total of 8 GPUs on 2 nodes: #BSUB -l nodes=2:tesla:cuda3.2:ppn=4 BWUPEP2011, UIUC, May 29 - June 10 2011 8 Hybrid CUDA Lab We already have Area Under a Curve code for MPI and CUDA independently. You can write a hybrid code that has each GPU calculate a portion of the area, then use MPI to combine subtotals for the complete area. Otherwise feel free to take any code we’ve used so far and experiment! BWUPEP2011, UIUC, May 29 - June 10 2011 9
© Copyright 2025 Paperzz