ExploitingMaximalOverlapforNonContiguousDataMovementProcessing onModernGPU-enabledSystems Ching-HsiangChu,KhaledHamidouche,Akshay Venkatesh, DipS.Banerjee,HariSubramoni andDhabaleswar K.(DK)Panda Network-basedComputingLaboratory DepartmentofComputerScienceandEngineering TheOhioStateUniversity Outline • Introduction • ProposedDesigns • PerformanceEvaluation • Conclusion NetworkBasedComputing Laboratory IPDPS2016 2 DriversofModernHPCClusterArchitectures Multi-coreProcessors • • • • HighPerformanceInterconnects- InfiniBand <1uslatency,>100Gbps Bandwidth Accelerators/Coprocessors highcomputedensity, highperformance/watt >1Tflop/s DPonachip Multi-coreprocessorsareubiquitous InfiniBandisverypopularinHPCclusters Accelerators/Coprocessorsbecomingcommoninhigh-endsystems PushingtheenvelopeforExascale computing Tianhe– 2 NetworkBasedComputing Laboratory Titan Stampede IPDPS2016 Tianhe– 1A 3 AcceleratorsinHPCSystems • GrowthofAccelerator-enabledclustersin thelast3years – 22%ofTop50clustersareboostedbyNVIDIAGPUsinNov’15 – FromTop500list(http://www.top500.org) SystemCount 100 29 80 60 40 11 20 31 12 22 0 8 15 June-2013 Nov-2013 NVIDIAKepler NetworkBasedComputing Laboratory 30 16 20 20 18 15 23 28 33 June-2014 Nov-2014 June-2015 NVIDIAFermi IntelXeonPhi IPDPS2016 14 52 Nov-2015 4 Motivation • Parallel applications on GPU clusters – CUDA (Compute Unified Device Architecture): • Kernel computation on NVIDIA GPUs – CUDA-Aware MPI (Message Passing Interface): • Communications across processes/nodes • Non-blocking communication to overlap with CUDA kernels MPI_Isend(Buf1, ...,request1); MPI_Isend(Buf2, ...,request2); /* Independent computations on CPU/GPU */ MPI_Wait (request1, status1); MPI_Wait (request2, status2); NetworkBasedComputing Laboratory IPDPS2016 5 Motivation • Use of non-contiguous data becoming common – Easy to represent complex data structure • MPI Datatypes – E.g., Fluid dynamic, image processing… • WhatifthedataareonGPUmemory? 1. CopydatatoCPUtoperformthepacking/unpacking • Slowerforlargemessage • DatamovementsbetweenGPUandCPUareexpensive 2. UtilizeGPUkerneltoperformthepacking/unpacking* • Noexplicitcopies,fasterforlargemessage *R. Shi et al., “HAND: A Hybrid Approach to Accelerate Non- contiguous Data Movement Using MPI Datatypes on GPU Clusters,” in 43rd ICPP, Sept 2014, pp. 221–230. NetworkBasedComputing Laboratory IPDPS2016 6 Motivation– Non-ContiguousDataMovementinMPI CommonScenario WasteofcomputingresourcesonCPUandGPU MPI_Isend(Buf1,...,req1); MPI_Isend(Buf2,...,req2); Applicationworkonthe CPU/GPU MPI_Waitall(req,…) Timeline *Buf1, Buf2…contain noncontiguous MPI Datatype NetworkBasedComputing Laboratory IPDPS2016 7 ProblemStatement • LowoverlapbetweenCPUandGPUfor applications – Packing/Unpackingoperationsareserialized – GPUthreadsremainidleformostofthetime – Lowutilization,lowefficiency NetworkBasedComputing Laboratory IPDPS2016 UserNaive UserAdvanced Productivity Overlap Performanc e Canwehavedesignsto leveragenewGPU technologytoaddressthese issues? Resource Utilization • CPU/GPUresourcesarenotfullyutilized Proposed Fartherfrom thecenteris Better 8 Goalsofthiswork • ProposesnewdesignsleveragenewNVIDIA GPUtechnologies ØHyper-Qtechnology(Multi-Streaming) ØCUDAEvent andCallback • Achieving ØHighperformanceandresourceutilizationfor applications ØHighproductivityfordevelopers NetworkBasedComputing Laboratory IPDPS2016 9 Outline • Introduction • ProposedDesigns – Event-based – Callback-based • PerformanceEvaluation • Conclusion NetworkBasedComputing Laboratory IPDPS2016 10 Overview Isend(1) Wait For Kernel (WFK) Wait For Kernel (WFK) Wait For Kernel (WFK) Wait CPU Start Send Isend(1) Start Send Initiate Kernel Isend(1) Start Send Initiate Kernel Initiate Kernel Existing Design Progress GPU Kernel on Stream Kernel on Stream Kernel on Stream Proposed Design Start Send WFK Start Send Start Send WFK Progress WFK Initiate Kernel Initiate Kernel Initiate Kernel CPU Wait Isend(1) Isend(2)Isend(3) GPU Kernel on Stream Expected Benefits Kernel on Stream Kernel on Stream Start NetworkBasedComputing Laboratory Time IPDPS2016 Finish Proposed Finish Existing 11 Event-basedDesign • CUDAEventManagement – Providesamechanismtosignalwhentaskshave occurredinaCUDAstream • Basicdesignidea 1. CPUlaunchesaCUDApacking/unpackingkernel 2. CPUcreatesCUDAeventandthenreturnsimmediately • GPUsetsthestatusas‘completed’ whenthekernelis completed 3. InMPI_Wait/MPI_Waitall: • CPUqueriestheeventswhenthepacked/unpacked dataisrequiredforcommunication NetworkBasedComputing Laboratory IPDPS2016 12 Event-basedDesign HCA CPU GPU pack_kernel1<<< >>> MPI_Isend() cudaEventRecord() pack_kernel2<<< >>> MPI_Isend() cudaEventRecord() pack_kernel3<<< >>> MPI_Isend() cudaEventRecord() MPI_Waitall() Query / Progress Send Completion NetworkBasedComputing Laboratory Request Complete IPDPS2016 13 Event-basedDesign • Majorbenefits – OverlapbetweenCPUcommunicationandGPU packingkernel – GPUresourcesarehighlyutilized • Limitation – CPUisrequiredtokeepcheckingthestatusof theevent • LowerCPUutilization NetworkBasedComputing Laboratory MPI_Isend(Buf1, ...,request1); MPI_Isend(Buf2, ...,request2); MPI_Wait (request1, status1); MPI_Wait (request2, status2); IPDPS2016 14 Callback-basedDesign • CUDAStreamCallback – Launchingworkautomatically ontheCPUwhensomethinghas completedontheCUDAstream – Restrictions: • Callbacksareprocessedbyadriverthread,wherenoCUDAAPIscan becalled • Overheadwheninitializing callbackfunction • Basicdesignidea 1. CPUlaunchesaCUDApacking/unpackingkernel 2. CPUaddsCallbackfunctionandthenreturnsimmediately 3. Callbackfunctionwakesupahelperthreadtoprocessthe communication NetworkBasedComputing Laboratory IPDPS2016 15 Callback-basedDesign CPU HCA main MPI_Isend() helper GPU callback pack_kernel1<<< >>> addCallback() pack_kernel2<<< >>> MPI_Isend() addCallback() pack_kernel3<<< >>> MPI_Isend() CPU Computations addCallback() Callback Send Callback Callback Completion Request Complete MPI_Waitall() NetworkBasedComputing Laboratory IPDPS2016 16 Callback-basedDesign • Majorbenefits – OverlapbetweenCPUcommunicationandGPUpacking kernel – OverlapbetweenCPUcommunicationandother computations – HigherCPUandGPUutilization MPI_Isend(Buf1, ...,&requests[0]); MPI_Isend(Buf2, ...,&requests[1]); MPI_Isend(Buf3, ...,&requests[2]); // Application work on the CPU MPI_Waitall(requests, status); NetworkBasedComputing Laboratory IPDPS2016 17 Outline • Introduction • ProposedDesigns • PerformanceEvaluation – Benchmark – HaloExchange-basedApplicationKernel • Conclusion NetworkBasedComputing Laboratory IPDPS2016 18 OverviewoftheMVAPICH2Project • HighPerformanceopen-source MPILibraryforInfiniBand,10-40Gig/iWARP, andRDMAoverConverged Enhanced Ethernet(RoCE) – MVAPICH (MPI-1),MVAPICH2 (MPI-2.2andMPI-3.0),Availablesince2002 – MVAPICH2-X (MPI+PGAS),Availablesince2011 – SupportforGPGPUs(MVAPICH2-GDR)andMIC (MVAPICH2-MIC),Availablesince2014 – Support forVirtualization(MVAPICH2-Virt),Availablesince2015 – Support forEnergy-Awareness(MVAPICH2-EA),Availablesince2015 – Usedbymorethan 2,575organizations in 80countries – More than 376,000(0.37 million)downloads fromtheOSUsitedirectly – EmpoweringmanyTOP500clusters(Nov‘15ranking) • • • – 10 th ranked519,640-corecluster(Stampede)atTACC 13 th ranked185,344-corecluster(Pleiades)atNASA 25 th ranked76,032-corecluster(Tsubame2.5)atTokyoInstituteofTechnologyandmanyothers Availablewithsoftwarestacksofmanyvendors andLinux Distros(RedHatandSuSE) – http://mvapich.cse.ohio-state.edu • EmpoweringTop500systemsforoveradecade – System-XfromVirginiaTech(3rd inNov2003, 2,200processors,12.25 TFlops)-> – StampedeatTACC(10th inNov’15, 519,640cores,5.168 Plops) NetworkBasedComputing Laboratory IPDPS2016 19 ExperimentalEnvironments 1. Wilkescluster@UniversityofCambridge – 2NVIDIAK20cGPUspernode • Upto32GPUnodes 2. CSCScluster@SwissNationalSupercomputing Centre – CrayCS-Stormsystem – 8NVIDIAK80GPUspernode • Upto96GPUsover12nodes NetworkBasedComputing Laboratory IPDPS2016 20 Benchmark-levelEvaluation- Performance • Modified‘CUDA-Aware’DDTBench (http://htor.inf.ethz.ch/research/datatypes/ddtbench/) Lowerisbetter NormalizedExecutionTime 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Default NAS_MG_y 1.5X Event-based SPECFEM3D_OC 2.6X Callback-based WRF_sa 3.4X SPECFEM3D_CM 2.7X InputSize NetworkBasedComputing Laboratory IPDPS2016 21 Benchmark-levelEvaluation- Overlap • Modified‘CUDA-Aware’DDTBenchfor NAS_MG_y test Higherisbetter – Injecteddummycomputations Overlap(%) 100 MPI_Isend(Buf1, ...,request1); MPI_Isend(Buf2, ...,request2); MPI_Isend(Buf3, ...,request3); Dummy_comp(); // Application work on the CPU MPI_Waitall(requests, status); NetworkBasedComputing Laboratory Default Event-based Callback-based 80 60 40 20 0 IPDPS2016 InputSize 22 Application-level Evaluation- HaloDataExchange • MeteoSwiss weatherforecastingCOSMO*application kernel@CSCScluster • Multi-dimensionaldata • Contiguous ononedimension • Non-contiguousonotherdimensions • Halodataexchange • Duplicatetheboundary • Exchangetheboundary *http://www.cosmo-model.org/ NetworkBasedComputing Laboratory IPDPS2016 23 Application-level(HaloExchange)Evaluation MPI_Isend(Buf1, ...,request1); MPI_Isend(Buf2, ...,request2); // Computations on GPU MPI_Wait (request1, status1); MPI_Wait (request2, status2); WilkesGPUCluster Callback-based CSCSGPUcluster Event-based 1.5 2X 1 0.5 0 4 8 16 NumberofGPUs NetworkBasedComputing Laboratory 32 Default NormalizedExecutionTime NormalizedExecutionTime Default Lowerisbetter Callback-based Event-based 1.2 1 0.8 0.6 0.4 0.2 0 IPDPS2016 1.6X 16 32 64 NumberofGPUs 96 24 Conclusion • Proposed designs can improve the overall performance and utilization of CPU as well as GPU – Event-based design: Overlapping CPU communication with GPU computation – Callback-based design: Further overlapping with CPU computation • Future Work – Non-blocking collective operations – Contiguous data movements – Next generation GPU architectures – Will be available in the MVAPICH2-GDR library NetworkBasedComputing Laboratory IPDPS2016 25 ThankYou! Ching-HsiangChu [email protected] Network-BasedComputing Laboratory http://nowlab.cse.ohio-state.edu/ TheMVAPICH2Project http://mvapich.cse.ohio-state.edu/ NetworkBasedComputing Laboratory TheHigh-PerformanceBigDataProject http://hibd.cse.ohio-state.edu/ IPDPS2016 26 Motivation– NVIDIAGPUFeature • NVIDIA- CUDAHyper-Q(Multi-stream)technology – MultipleCPUthreads/processes tolaunchkernelonasingle GPUsimultaneously – IncreasingGPUutilizationandreducingCPUidletimes http://www.hpc.co.jp/images/hyper-q.png NetworkBasedComputing Laboratory IPDPS2016 27 Motivation– Non-ContiguousDataMovementinMPI sbuf=malloc(…);rbuf=malloc(…); /*Packing*/ for(i=1;i<n;i+=2) sbuf[i]=matrix[n][0]; MPI_Send(sbuf,n,MPI_DOUBLE,…); MPI_Recv(rbuf,n,MPI_DOUBLE,…); /*Unpacking*/ for(i=1;i<n;i+=2) matrix[i][0]=rbuf[i] free(sbuf);free(rbuf); NetworkBasedComputing Laboratory MPI_Datatype nt; MPI_Type_Vector(1,1,n,MPI_DOUBLE,&nt); MPI_Type_commit(&nt); MPI_Send(matrix,1,nt,…); MPI_Recv(matrix,1,nt,…); Using MPI Datatypes • No explicit copies applications ØBette performance • Less code ØHigher productivity IPDPS2016 in 28
© Copyright 2026 Paperzz