Exploiting Maximal Overlap for Non

ExploitingMaximalOverlapforNonContiguousDataMovementProcessing
onModernGPU-enabledSystems
Ching-HsiangChu,KhaledHamidouche,Akshay Venkatesh,
DipS.Banerjee,HariSubramoni andDhabaleswar K.(DK)Panda
Network-basedComputingLaboratory
DepartmentofComputerScienceandEngineering
TheOhioStateUniversity
Outline
• Introduction
• ProposedDesigns
• PerformanceEvaluation
• Conclusion
NetworkBasedComputing Laboratory
IPDPS2016
2
DriversofModernHPCClusterArchitectures
Multi-coreProcessors
•
•
•
•
HighPerformanceInterconnects- InfiniBand
<1uslatency,>100Gbps Bandwidth
Accelerators/Coprocessors
highcomputedensity, highperformance/watt
>1Tflop/s DPonachip
Multi-coreprocessorsareubiquitous
InfiniBandisverypopularinHPCclusters
Accelerators/Coprocessorsbecomingcommoninhigh-endsystems
PushingtheenvelopeforExascale computing
Tianhe– 2
NetworkBasedComputing Laboratory
Titan
Stampede
IPDPS2016
Tianhe– 1A
3
AcceleratorsinHPCSystems
• GrowthofAccelerator-enabledclustersin
thelast3years
– 22%ofTop50clustersareboostedbyNVIDIAGPUsinNov’15
– FromTop500list(http://www.top500.org)
SystemCount
100
29
80
60
40
11
20
31
12
22
0
8
15
June-2013 Nov-2013
NVIDIAKepler
NetworkBasedComputing Laboratory
30
16
20
20
18
15
23
28
33
June-2014 Nov-2014 June-2015
NVIDIAFermi
IntelXeonPhi
IPDPS2016
14
52
Nov-2015
4
Motivation
• Parallel applications on GPU clusters
– CUDA (Compute Unified Device Architecture):
• Kernel computation on NVIDIA GPUs
– CUDA-Aware MPI (Message Passing Interface):
• Communications across processes/nodes
• Non-blocking communication to overlap with CUDA
kernels
MPI_Isend(Buf1, ...,request1);
MPI_Isend(Buf2, ...,request2);
/* Independent computations on CPU/GPU */
MPI_Wait (request1, status1);
MPI_Wait (request2, status2);
NetworkBasedComputing Laboratory
IPDPS2016
5
Motivation
• Use of non-contiguous data becoming common
– Easy to represent complex data structure
• MPI Datatypes
– E.g., Fluid dynamic, image processing…
• WhatifthedataareonGPUmemory?
1. CopydatatoCPUtoperformthepacking/unpacking
• Slowerforlargemessage
• DatamovementsbetweenGPUandCPUareexpensive
2. UtilizeGPUkerneltoperformthepacking/unpacking*
• Noexplicitcopies,fasterforlargemessage
*R. Shi et al., “HAND: A Hybrid Approach to Accelerate Non- contiguous Data Movement
Using MPI Datatypes on GPU Clusters,” in 43rd ICPP, Sept 2014, pp. 221–230.
NetworkBasedComputing Laboratory
IPDPS2016
6
Motivation–
Non-ContiguousDataMovementinMPI
CommonScenario
WasteofcomputingresourcesonCPUandGPU
MPI_Isend(Buf1,...,req1);
MPI_Isend(Buf2,...,req2);
Applicationworkonthe
CPU/GPU
MPI_Waitall(req,…)
Timeline
*Buf1, Buf2…contain noncontiguous MPI Datatype
NetworkBasedComputing Laboratory
IPDPS2016
7
ProblemStatement
• LowoverlapbetweenCPUandGPUfor
applications
– Packing/Unpackingoperationsareserialized
– GPUthreadsremainidleformostofthetime
– Lowutilization,lowefficiency
NetworkBasedComputing Laboratory
IPDPS2016
UserNaive
UserAdvanced
Productivity
Overlap
Performanc
e
Canwehavedesignsto
leveragenewGPU
technologytoaddressthese
issues?
Resource
Utilization
• CPU/GPUresourcesarenotfullyutilized
Proposed
Fartherfrom
thecenteris
Better
8
Goalsofthiswork
• ProposesnewdesignsleveragenewNVIDIA
GPUtechnologies
ØHyper-Qtechnology(Multi-Streaming)
ØCUDAEvent andCallback
• Achieving
ØHighperformanceandresourceutilizationfor
applications
ØHighproductivityfordevelopers
NetworkBasedComputing Laboratory
IPDPS2016
9
Outline
• Introduction
• ProposedDesigns
– Event-based
– Callback-based
• PerformanceEvaluation
• Conclusion
NetworkBasedComputing Laboratory
IPDPS2016
10
Overview
Isend(1)
Wait For
Kernel
(WFK)
Wait For
Kernel
(WFK)
Wait For
Kernel
(WFK)
Wait CPU
Start
Send
Isend(1)
Start
Send
Initiate
Kernel
Isend(1)
Start
Send
Initiate
Kernel
Initiate
Kernel
Existing Design
Progress
GPU
Kernel on Stream
Kernel on Stream
Kernel on Stream
Proposed Design
Start
Send
WFK
Start
Send
Start
Send
WFK
Progress
WFK
Initiate
Kernel
Initiate
Kernel
Initiate
Kernel
CPU
Wait
Isend(1) Isend(2)Isend(3)
GPU
Kernel on Stream
Expected Benefits
Kernel on Stream
Kernel on Stream
Start
NetworkBasedComputing Laboratory
Time
IPDPS2016
Finish Proposed
Finish Existing
11
Event-basedDesign
• CUDAEventManagement
– Providesamechanismtosignalwhentaskshave
occurredinaCUDAstream
• Basicdesignidea
1. CPUlaunchesaCUDApacking/unpackingkernel
2. CPUcreatesCUDAeventandthenreturnsimmediately
• GPUsetsthestatusas‘completed’ whenthekernelis
completed
3. InMPI_Wait/MPI_Waitall:
• CPUqueriestheeventswhenthepacked/unpacked
dataisrequiredforcommunication
NetworkBasedComputing Laboratory
IPDPS2016
12
Event-basedDesign
HCA
CPU
GPU
pack_kernel1<<< >>>
MPI_Isend()
cudaEventRecord()
pack_kernel2<<< >>>
MPI_Isend()
cudaEventRecord()
pack_kernel3<<< >>>
MPI_Isend()
cudaEventRecord()
MPI_Waitall()
Query / Progress
Send
Completion
NetworkBasedComputing Laboratory
Request
Complete
IPDPS2016
13
Event-basedDesign
• Majorbenefits
– OverlapbetweenCPUcommunicationandGPU
packingkernel
– GPUresourcesarehighlyutilized
• Limitation
– CPUisrequiredtokeepcheckingthestatusof
theevent
• LowerCPUutilization
NetworkBasedComputing Laboratory
MPI_Isend(Buf1, ...,request1);
MPI_Isend(Buf2, ...,request2);
MPI_Wait (request1, status1);
MPI_Wait (request2, status2);
IPDPS2016
14
Callback-basedDesign
• CUDAStreamCallback
– Launchingworkautomatically ontheCPUwhensomethinghas
completedontheCUDAstream
– Restrictions:
• Callbacksareprocessedbyadriverthread,wherenoCUDAAPIscan
becalled
• Overheadwheninitializing callbackfunction
• Basicdesignidea
1. CPUlaunchesaCUDApacking/unpackingkernel
2. CPUaddsCallbackfunctionandthenreturnsimmediately
3. Callbackfunctionwakesupahelperthreadtoprocessthe
communication
NetworkBasedComputing Laboratory
IPDPS2016
15
Callback-basedDesign
CPU
HCA
main
MPI_Isend()
helper
GPU
callback
pack_kernel1<<< >>>
addCallback()
pack_kernel2<<< >>>
MPI_Isend()
addCallback()
pack_kernel3<<< >>>
MPI_Isend()
CPU
Computations
addCallback()
Callback
Send
Callback
Callback
Completion
Request
Complete
MPI_Waitall()
NetworkBasedComputing Laboratory
IPDPS2016
16
Callback-basedDesign
• Majorbenefits
– OverlapbetweenCPUcommunicationandGPUpacking
kernel
– OverlapbetweenCPUcommunicationandother
computations
– HigherCPUandGPUutilization
MPI_Isend(Buf1, ...,&requests[0]);
MPI_Isend(Buf2, ...,&requests[1]);
MPI_Isend(Buf3, ...,&requests[2]);
// Application work on the CPU
MPI_Waitall(requests, status);
NetworkBasedComputing Laboratory
IPDPS2016
17
Outline
• Introduction
• ProposedDesigns
• PerformanceEvaluation
– Benchmark
– HaloExchange-basedApplicationKernel
• Conclusion
NetworkBasedComputing Laboratory
IPDPS2016
18
OverviewoftheMVAPICH2Project
•
HighPerformanceopen-source MPILibraryforInfiniBand,10-40Gig/iWARP, andRDMAoverConverged Enhanced
Ethernet(RoCE)
–
MVAPICH (MPI-1),MVAPICH2 (MPI-2.2andMPI-3.0),Availablesince2002
–
MVAPICH2-X (MPI+PGAS),Availablesince2011
– SupportforGPGPUs(MVAPICH2-GDR)andMIC (MVAPICH2-MIC),Availablesince2014
–
Support forVirtualization(MVAPICH2-Virt),Availablesince2015
–
Support forEnergy-Awareness(MVAPICH2-EA),Availablesince2015
–
Usedbymorethan 2,575organizations in 80countries
–
More than 376,000(0.37 million)downloads fromtheOSUsitedirectly
–
EmpoweringmanyTOP500clusters(Nov‘15ranking)
•
•
•
–
10 th ranked519,640-corecluster(Stampede)atTACC
13 th ranked185,344-corecluster(Pleiades)atNASA
25 th ranked76,032-corecluster(Tsubame2.5)atTokyoInstituteofTechnologyandmanyothers
Availablewithsoftwarestacksofmanyvendors andLinux Distros(RedHatandSuSE)
– http://mvapich.cse.ohio-state.edu
•
EmpoweringTop500systemsforoveradecade
–
System-XfromVirginiaTech(3rd inNov2003, 2,200processors,12.25 TFlops)->
–
StampedeatTACC(10th inNov’15, 519,640cores,5.168 Plops)
NetworkBasedComputing Laboratory
IPDPS2016
19
ExperimentalEnvironments
1. Wilkescluster@UniversityofCambridge
– 2NVIDIAK20cGPUspernode
• Upto32GPUnodes
2. CSCScluster@SwissNationalSupercomputing
Centre
– CrayCS-Stormsystem
– 8NVIDIAK80GPUspernode
• Upto96GPUsover12nodes
NetworkBasedComputing Laboratory
IPDPS2016
20
Benchmark-levelEvaluation- Performance
• Modified‘CUDA-Aware’DDTBench
(http://htor.inf.ethz.ch/research/datatypes/ddtbench/)
Lowerisbetter
NormalizedExecutionTime
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Default
NAS_MG_y
1.5X
Event-based
SPECFEM3D_OC
2.6X
Callback-based
WRF_sa
3.4X
SPECFEM3D_CM
2.7X
InputSize
NetworkBasedComputing Laboratory
IPDPS2016
21
Benchmark-levelEvaluation- Overlap
• Modified‘CUDA-Aware’DDTBenchfor
NAS_MG_y test
Higherisbetter
– Injecteddummycomputations
Overlap(%)
100
MPI_Isend(Buf1, ...,request1);
MPI_Isend(Buf2, ...,request2);
MPI_Isend(Buf3, ...,request3);
Dummy_comp();
// Application work on the CPU
MPI_Waitall(requests, status);
NetworkBasedComputing Laboratory
Default
Event-based
Callback-based
80
60
40
20
0
IPDPS2016
InputSize
22
Application-level Evaluation- HaloDataExchange
• MeteoSwiss weatherforecastingCOSMO*application
kernel@CSCScluster
• Multi-dimensionaldata
•
Contiguous ononedimension
•
Non-contiguousonotherdimensions
• Halodataexchange
•
Duplicatetheboundary
•
Exchangetheboundary
*http://www.cosmo-model.org/
NetworkBasedComputing Laboratory
IPDPS2016
23
Application-level(HaloExchange)Evaluation
MPI_Isend(Buf1, ...,request1);
MPI_Isend(Buf2, ...,request2);
// Computations on GPU
MPI_Wait (request1, status1);
MPI_Wait (request2, status2);
WilkesGPUCluster
Callback-based
CSCSGPUcluster
Event-based
1.5
2X
1
0.5
0
4
8
16
NumberofGPUs
NetworkBasedComputing Laboratory
32
Default
NormalizedExecutionTime
NormalizedExecutionTime
Default
Lowerisbetter
Callback-based
Event-based
1.2
1
0.8
0.6
0.4
0.2
0
IPDPS2016
1.6X
16
32
64
NumberofGPUs
96
24
Conclusion
• Proposed designs can improve the overall
performance and utilization of CPU as well as GPU
– Event-based design: Overlapping CPU communication with
GPU computation
– Callback-based design: Further overlapping with CPU
computation
• Future Work
– Non-blocking collective operations
– Contiguous data movements
– Next generation GPU architectures
– Will be available in the MVAPICH2-GDR library
NetworkBasedComputing Laboratory
IPDPS2016
25
ThankYou!
Ching-HsiangChu
[email protected]
Network-BasedComputing Laboratory
http://nowlab.cse.ohio-state.edu/
TheMVAPICH2Project
http://mvapich.cse.ohio-state.edu/
NetworkBasedComputing Laboratory
TheHigh-PerformanceBigDataProject
http://hibd.cse.ohio-state.edu/
IPDPS2016
26
Motivation– NVIDIAGPUFeature
• NVIDIA- CUDAHyper-Q(Multi-stream)technology
– MultipleCPUthreads/processes tolaunchkernelonasingle
GPUsimultaneously
– IncreasingGPUutilizationandreducingCPUidletimes
http://www.hpc.co.jp/images/hyper-q.png
NetworkBasedComputing Laboratory
IPDPS2016
27
Motivation– Non-ContiguousDataMovementinMPI
sbuf=malloc(…);rbuf=malloc(…);
/*Packing*/
for(i=1;i<n;i+=2)
sbuf[i]=matrix[n][0];
MPI_Send(sbuf,n,MPI_DOUBLE,…);
MPI_Recv(rbuf,n,MPI_DOUBLE,…);
/*Unpacking*/
for(i=1;i<n;i+=2)
matrix[i][0]=rbuf[i]
free(sbuf);free(rbuf);
NetworkBasedComputing Laboratory
MPI_Datatype nt;
MPI_Type_Vector(1,1,n,MPI_DOUBLE,&nt);
MPI_Type_commit(&nt);
MPI_Send(matrix,1,nt,…);
MPI_Recv(matrix,1,nt,…);
Using MPI Datatypes
• No explicit copies
applications
ØBette performance
• Less code
ØHigher productivity
IPDPS2016
in
28