LLVM-baseddynamicdataflow compila6onforheterogeneoustargets V.Ducrot,K.Juilly,S.Monot, G.BayleDesCourchamps T.Goubier BenoitDaMota Donnonsdelasuiteàvosidées… AS+GroupeEolen CEAList/DACLE/LCE AngerUniversity Context:theMACHProject Methods algorithms forMetagenomics R (staRsRcsDSL) LLVMIRVec FrontendR toIRVec + LLVM compiler infrastructure FrontendIR toLLVM HeterogeneousHPCaware frontendRtoLLVM MulRplaVorm binaries AcceleraRngRon heterogeneoustargets R:thedominantlanguageforstaRsRcalanalysis Usedbyeveryone,everywhere Fasttouse(easyscripRng) Slowtouse(withlargedatasets) MACH:DSeLsforheterogeneouscompuRng RisaDSL(staRsRcs) RcanbeusedtotargetacceleratedheterogeneouscompuRng RinMACH Extract/TransformdataparallelisminRscripts InaRfront-end Specifyittotarget: GPUs(Nvidia/AMD) CPUaccelerators(IntelMIC) CompilaRon+runRmetoolchain Complexsystem Toolchaintosimplify programming Taskmanagement AutomatedtaskextracRonfrom thecode Nontrivialalgorithmic AutomatedinserRonofrunRme controlfuncRon MulR-targetimplementaRon Constraintsondatastructureto simplifyanalysisandgivebe[er performance ThreestagecompilaRonsystem Frontend GoesfromRtomiddle-endIR Middleend SplitformulR-targetmanagement Re-expresscodeasstandardLLVMadaptedtotarget Backend StandardLLVMpassesandbackend AspecificpasstoinsertrunRmemanagementcalls DataflowrunRme Parallelismisexpressedastaskanddatadependency Easytogenerateparallelismfromthecompiler ExecuRonisout-of-orderwithsequenRalconsistency guaranRes Efficient Hardtodebug Naturalauto-tuningapplicaRon Memoryneedstobemanaged ManagedMemory Managedmemory • Adatadriven execuRonmodel • Unifiedviewon memory Inducedconstraints • Referencedmemory • Nopointer arithmeRc • Noglobal • Librarycallmustbe wrapped(thread safety) RunRmeinserRonatmiddle-endlevel EasiermanipulaRonofmulRpleimplementaRons SimplifiedfrontendbyremovingmostoftherunRme knowledgefromit Simplifiedwaytoaddhardwarespecificanalysisbyleveraging LLVMinfrastructure TargetRunRmeiscurrentlystarPUfromInriaBordeaux • h[p://starpu.gforge.inria.fr CompilaRonMiddle-endandBackend MiddleEnd IR + AnnotaRons SpecializaRon X86_64 LLVM+ OpRmizer LLVM X86_64ISA Binary SpecializaRon XeonPhi LLVM+ OpRmizer LLVM XeonPhiISA Binary SpecializaRon NvidiaGPU LLVM+ OpRmizer LLVM PTXISA Binary Parallelizer Tasksgraph Datatransformers Librarycalls Equivalent inchosen runRme Heterogeneous applicaRon Middle-endIR BuildontopoftheexisRngLLVMIR Addsupportforarbitrarylengthvector Addsupportformanagedcontainers AddintentsmarkersonfuncRon(task)declaraRons AddtaskdeclaraRons/submitmarker AddintrinsicvectoroperaRons Middle-endIR Arbitrarylengthvectors Arbitrarylengthvectors(ALV) Markedas0lengthinIR Manageddataspecificsload/storeusingthem(effecRvesizeare derivedfromthematrunRme) %f0v=call<0xfloat>(%nd_array_float_t*)*@ndarray.load.float(%nd_array_float_t*%f0) [email protected].float(%nd_array_float_t*%u1,<0xfloat>%u1v) Maskingintrinsic %mr=call{}*@llvm.mach.mask.acRvate.v0i1(<0xi1>%alltrue) %merge2=call<0xi32>@llvm.mach.mask.merge.v0i32({}*%mr,<0xi32>%r,<0xi32>%alvizero) [email protected]({}*%mr) Reduce/scanintrinsic %v3=call<0xfloat>@llvm.mach.alv.reduce.max.v0f32(<0xfloat>%v2) AllclassicalvectoroperaRonsaresupportedonALV Middle-endIR ManageddataContainers ND-arrays PythonlikeND-arrayasstandardcontainersfortables Viewssupport ManipulaRonfuncRonsforcopy,extracRon… RawData Managedsegmentofmemorywithoutana[achedlayout Taskneedusingthemcannotbewri[enwitharbirarylengthvector AlldatacontainersprovidealsofuncRons foraccessingthemoutsidetherunRme. Middle-endIR TaskManagement Metadataformarkingtaskcall Metadataforexpressingpa[ernsontaskimplementaRon ufunc rfunc scan Intentsonmanageddata(read,write,scratch…) Generatedbyanalysispass IRspecializingpasses Taskspecializing ArchitecturedependentrewriRngofMiddle-endIRtoIR OutputstandardLLVMIRadaptedtoagiventarget Workflowmanagement Takesthecodewithcallsmarkedastask ReplacecallsbytaskpreparaRonandsubmission MulR-implementaRonmanagement CreateiniRalizaRon/finalizaRoncalltotherunRmereferencingeachspecialized implementaRon ApplicaRonandperformancetuning TherunRmesupportsmulRpleimplementaRonforagiventask onagivenhardware OurpassgeneratesmulRpleimplementaRons TherunRmechoosesthebestimplementaRonaccordingtothe datasizes Performanceandresults WehavemeasuredtheexecuRonRmebetweenbenchmarks implementedinCandthesamebenchmarksimplementedin middle-endIR Code GCC4.9 icc13 clang3.6 IRversion Jacobi 28.71 31.38 41.9 29.72 Laxce 59.63 Bolzmann 71.10 74.64 59.43 Conclusion Weproposedaninfrastructuretocompileheterogeneous programonadataflowrunRme Themiddle-endIRenablesustocompileformulRpletargetat reasonableperformance PorRngtoanewtargetdoesn’tchangethefrontend
© Copyright 2026 Paperzz