LLVM-based dynamic dataflow compila on for

LLVM-baseddynamicdataflow
compila6onforheterogeneoustargets
V.Ducrot,K.Juilly,S.Monot,
G.BayleDesCourchamps
T.Goubier
BenoitDaMota
Donnonsdelasuiteàvosidées…
AS+GroupeEolen
CEAList/DACLE/LCE
AngerUniversity
Context:theMACHProject
Methods
algorithms
forMetagenomics
R
(staRsRcsDSL)
LLVMIRVec
FrontendR
toIRVec
+
LLVM
compiler
infrastructure
FrontendIR
toLLVM
HeterogeneousHPCaware
frontendRtoLLVM
MulRplaVorm
binaries
AcceleraRngRon
heterogeneoustargets
 R:thedominantlanguageforstaRsRcalanalysis
 
 
 
Usedbyeveryone,everywhere
Fasttouse(easyscripRng)
Slowtouse(withlargedatasets)
 MACH:DSeLsforheterogeneouscompuRng
 
 
RisaDSL(staRsRcs)
RcanbeusedtotargetacceleratedheterogeneouscompuRng
 RinMACH
 
Extract/TransformdataparallelisminRscripts
 
 
InaRfront-end
Specifyittotarget:
 
 
GPUs(Nvidia/AMD)
CPUaccelerators(IntelMIC)
CompilaRon+runRmetoolchain
Complexsystem
Toolchaintosimplify
programming
Taskmanagement
AutomatedtaskextracRonfrom
thecode
Nontrivialalgorithmic
AutomatedinserRonofrunRme
controlfuncRon
MulR-targetimplementaRon
Constraintsondatastructureto
simplifyanalysisandgivebe[er
performance
ThreestagecompilaRonsystem
 Frontend
 
GoesfromRtomiddle-endIR
 Middleend
 
 
SplitformulR-targetmanagement
Re-expresscodeasstandardLLVMadaptedtotarget
 Backend
 
 
StandardLLVMpassesandbackend
AspecificpasstoinsertrunRmemanagementcalls
DataflowrunRme
 Parallelismisexpressedastaskanddatadependency
 
Easytogenerateparallelismfromthecompiler
 ExecuRonisout-of-orderwithsequenRalconsistency
guaranRes
 
 
Efficient
Hardtodebug
 Naturalauto-tuningapplicaRon
 Memoryneedstobemanaged
ManagedMemory
Managedmemory
• Adatadriven
execuRonmodel
• Unifiedviewon
memory
Inducedconstraints
• Referencedmemory
• Nopointer
arithmeRc
• Noglobal
• Librarycallmustbe
wrapped(thread
safety)
RunRmeinserRonatmiddle-endlevel
EasiermanipulaRonofmulRpleimplementaRons
SimplifiedfrontendbyremovingmostoftherunRme
knowledgefromit
Simplifiedwaytoaddhardwarespecificanalysisbyleveraging
LLVMinfrastructure
 TargetRunRmeiscurrentlystarPUfromInriaBordeaux
• 
h[p://starpu.gforge.inria.fr
CompilaRonMiddle-endandBackend
MiddleEnd
IR
+
AnnotaRons
SpecializaRon
X86_64
LLVM+
OpRmizer
LLVM
X86_64ISA
Binary
SpecializaRon
XeonPhi
LLVM+
OpRmizer
LLVM
XeonPhiISA
Binary
SpecializaRon
NvidiaGPU
LLVM+
OpRmizer
LLVM
PTXISA
Binary
Parallelizer
Tasksgraph
Datatransformers
Librarycalls
Equivalent
inchosen
runRme
Heterogeneous
applicaRon
Middle-endIR
 BuildontopoftheexisRngLLVMIR
 
 
 
 
 
Addsupportforarbitrarylengthvector
Addsupportformanagedcontainers
AddintentsmarkersonfuncRon(task)declaraRons
AddtaskdeclaraRons/submitmarker
AddintrinsicvectoroperaRons
Middle-endIR
Arbitrarylengthvectors
 Arbitrarylengthvectors(ALV)
 
 
Markedas0lengthinIR
Manageddataspecificsload/storeusingthem(effecRvesizeare
derivedfromthematrunRme)
%f0v=call<0xfloat>(%nd_array_float_t*)*@ndarray.load.float(%nd_array_float_t*%f0)
[email protected].float(%nd_array_float_t*%u1,<0xfloat>%u1v)
 
Maskingintrinsic
%mr=call{}*@llvm.mach.mask.acRvate.v0i1(<0xi1>%alltrue)
%merge2=call<0xi32>@llvm.mach.mask.merge.v0i32({}*%mr,<0xi32>%r,<0xi32>%alvizero)
[email protected]({}*%mr)
 
Reduce/scanintrinsic
%v3=call<0xfloat>@llvm.mach.alv.reduce.max.v0f32(<0xfloat>%v2)
AllclassicalvectoroperaRonsaresupportedonALV
Middle-endIR
ManageddataContainers
 ND-arrays
 
 
PythonlikeND-arrayasstandardcontainersfortables
Viewssupport
ManipulaRonfuncRonsforcopy,extracRon…
 RawData
 
Managedsegmentofmemorywithoutana[achedlayout
Taskneedusingthemcannotbewri[enwitharbirarylengthvector
AlldatacontainersprovidealsofuncRons
foraccessingthemoutsidetherunRme.
Middle-endIR
TaskManagement
 Metadataformarkingtaskcall
 Metadataforexpressingpa[ernsontaskimplementaRon
 
ufunc
rfunc
scan
 Intentsonmanageddata(read,write,scratch…)
 
Generatedbyanalysispass
IRspecializingpasses
 Taskspecializing
 
 
ArchitecturedependentrewriRngofMiddle-endIRtoIR
OutputstandardLLVMIRadaptedtoagiventarget
 Workflowmanagement
 
 
Takesthecodewithcallsmarkedastask
ReplacecallsbytaskpreparaRonandsubmission
 MulR-implementaRonmanagement
 
CreateiniRalizaRon/finalizaRoncalltotherunRmereferencingeachspecialized
implementaRon
ApplicaRonandperformancetuning
 TherunRmesupportsmulRpleimplementaRonforagiventask
onagivenhardware
 OurpassgeneratesmulRpleimplementaRons
 TherunRmechoosesthebestimplementaRonaccordingtothe
datasizes
Performanceandresults
WehavemeasuredtheexecuRonRmebetweenbenchmarks
implementedinCandthesamebenchmarksimplementedin
middle-endIR
Code
GCC4.9
icc13
clang3.6
IRversion
Jacobi
28.71
31.38
41.9
29.72
Laxce
59.63
Bolzmann
71.10
74.64
59.43
Conclusion
 Weproposedaninfrastructuretocompileheterogeneous
programonadataflowrunRme
 Themiddle-endIRenablesustocompileformulRpletargetat
reasonableperformance
 PorRngtoanewtargetdoesn’tchangethefrontend