EfficientKernelSynthesisforPerformancePortableProgramming
1
1
2
3
1
Li-WenChang ,IzzatElHajj ,ChristopherRodrigues ,JuanGómez-Luna ,Wen-MeiHwu 1UniversityofIllinoisatUrbana-Champaign,2HuaweiAmericaResearchLab,3UniversidaddeCórdoba
• Maintaining opOmized programs for different
devicesiscostly
• Programs wriXen once should run on difference
devices with performance, which is known
performanceportability
Limitations of Current Practice
Performance (GFLOPS)
• OpenCLisnotperformanceportable
700.0
500.0
400.0
304.6
300.0
392.5
348.0
90%è30%
183.9
200.0
74.0
100.0
55.5
9.2
0.0
2.3
Tesla GPU
(GTX 280)
Fermi GPU (C2050)
Sandy Bridge CPU (i7-3820)
Parboil (default naïve OpenCL version)
Parboil (OpenCL version optimized for Tesla GPU)
Reference (MKL for CPU, CUBLAS for GPU)
Normalized
Performance
(higherisbe)er)
• ComposiOon-basedlanguageshighlyrelyingon
high-performancebase-ruleimplementaOons
(c) Compound codelet using adjacent tiling
__codelet__coop__tag(kog)
intsum(constArray<1,int>in){
__sharedinttmp[coopDim()];
unsignedlen=in.size();
unsignedid=coopIdx();
cb tmp[id]=(id<len)?in[id]:0;
for(unsigneds=1;s<coopDim();s*=2){
if(id>=s)
tmp[id]+=tmp[id-s];
}
returntmp[coopDim()-1];
}
0.4
0.2
0
(d) Compound codelet using strided tiling
TANGRAMLang.
Codelets
Architectural
HierarchyModel
RuleExtracOon
RuleSpecializaOon
ComposiOon
Device-specific
Codegen
Program
ComposiOon
Rules
Specialized
ComposiOon
Rules
ComposiOon
Plans
Kernel
Versions
p c
p d
pd
pd
ca ca ca ca ca ca ca ca ca cb cb terminate/launch
cb p c
ca ca ca sync
ca Performance Portability
Petabricks
(no MKL)
SG,regroup(pc,G),distribute(B),SB,regroup(pd,B),distribute(T),
compute(ca,SET),compute(cb,VEB)SG,devolve(B),SB,regroup(pd,B),
distribute(T),compute(ca,SET),compute(cb,VEB)
Firstkernel
SG
://Nosyncneededatbeginning
regroup(pc,G) :unsignedp_c=gridDim.x;
regroup(pc,G) :unsignedlen_c=in_size;
regroup(pc,G) :unsignedtile_c=(len_c+p_c-1)/p_c;
distribute(B)
:unsignedk=blockIdx.x;
SB
://Nosyncneededatbeginning
regroup(pd,B) :unsignedp_d=blockDim.x;
regroup(pd,B) :unsignedlen_d=tile_c;
regroup(pd,B) :unsignedtile_d=(len_d+p_d-1)/p_d;
distribute(T)
:unsignedj=threadIdx.x;
compute(ca,SET) :unsignedlen_a=tile_d;
compute(ca,SET) :intaccum_a=0;
compute(ca,SET) :for(unsignedi=0;i<len_a;++i){
compute(ca,SET) :accum_a+=in[k*tile_c+j+p_d*i];
compute(ca,SET) :}
compute(ca,SET) :ret_a=accum_a;
compute(cb,VEB):__shared__inttmp[blockDim.x];
compute(cb,VEB):unsignedlen_b=p_d;
compute(cb,VEB):unsignedid=threadIdx.x;
compute(cb,VEB):tmp[id]=ret_a;
compute(cb,VEB):__syncthreads();
compute(cb,VEB):for(unsigneds=1;s<blockDim.x;s*=2){
compute(cb,VEB):if(id>=s)
compute(cb,VEB) :tmp[id]+=tmp[id-s];
compute(cb,VEB):__syncthreads();
compute(cb,VEB):}
compute(cb,VEB):ret_b[k]=tmp[blockDim.x-1];
SG
:return;//Terminatekernel
Secondkernel
devolve(B)
:if(blockIdx.x==0)
SBun<lend
:...//Similartofirstkernel
Experimental Results
p d
TANGRAM Platform
• TANGRAMadoptscodeletprogrammingmodel
- Acodeletisdefinedasacodesnippet
reusableforoneormanykernels
• UserswriteinterchangeablealternaOvecodelets,
andcorrespondingcomposiOonandparOOonrules
foracomputaOonpaXern(calledspectrum)
- We do Not ask users to write mulOple
versionsofkernels
• TANGRAM supports recursive composiOon to
adapt to different hierarchies of devices and
cooperaOvecodeletsforSIMDarchitectures
• TANGRAM also provides performance tuning
annotaOontoenableparameterizaOon
DeviceSpecificaCon:
G:=CG=none,(ℓG,SG)=(B,terminate/launch)//grid
B :=CB=VEB , (ℓB,SB)=(T,__syncthreads())//block
T :=CT=SET, (ℓT,ST)=none //thread
SpecializedComposiConRules:
Grules:G1: compose(sum,G) → SG,devolve(B),compose(sum,B)
G4: compose(sum,G) → SG,regroup(pc,G),distribute(B),compose(sum,B),compose(sum,G)
G5: compose(sum,G) → SG,regroup(pd,G),distribute(B),compose(sum,B),compose(sum,G)
Brules:B1:compose(sum,B) → SB,devolve(T),compose(sum,T)
B3:compose(sum,B) → compute(cb,VEB)
B4:compose(sum,B) → SB,regroup(pc,B),distribute(T),compose(sum,T),compose(sum,B)
B5:compose(sum,B) → SB,regroup(pd,B),distribute(T),compose(sum,T),compose(sum,B)
Trules:T2:compose(sum,T) → compute(ca,SET)
0.6
Petabricks
(with MKL)
p d
__codelet__tag(stride_tiled)
intsum(constArray<1,int>in){
__tunableunsignedp;
S S S
unsignedlen=in.size();
unsignedtile=(len+p-1)/p;
S
returnsum(map(sum,partition(in,
p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1))));
}
ProgramComposi,onRules:(sum)
Rule1: compose(sum,L)→ SL,devolve(ℓL),compose(sum,ℓL) Rule2: compose(sum,L)→ compute(ca,SEL) Rule3: compose(sum,L)→ compute(cb,VEL) Rule4: compose(sum,L)→ SL,regroup(pc,L),distribute(ℓL),compose(sum,ℓL),compose(sum,L) Rule5: compose(sum,L)→ SL,regroup(pd,L),distribute(ℓL),compose(sum,ℓL),compose(sum,L)
ExampleforDerivingComposi,onRulesfromCompoundCodelets:(codeletc)
compose(sum,L)→ compose(cc,L) → compose(sum(map(sum,par::on(…,pc))),L) → compose(map(sum,par::on(…,pc)),L),compose(sum,L) → compose(par::on(…,pc),L),compose(map(sum,…),L),compose(sum,L) → SL,regroup(pc,L),distribute(ℓL),compose(sum,ℓL),compose(sum,L) 0.8
TANGRAM
ca (a) Atomic autonomous codelet
1
MKL
TANGRAM Workflow
__codelet__tag(asso_tiled)
p c
intsum(constArray<1,int>in){
__tunableunsignedp;
S S S
unsignedlen=in.size();
unsignedtile=(len+p-1)/p;
returnsum(map(sum,partition(in, S
p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1))));
}
(b) Atomic cooperative codelet
608.1
90%è60%
600.0
__codelet
intsum(constArray<1,int>in){
unsignedlen=in.size();
intaccum=0;
for(unsignedi=0;i<len;++i){
accum+=in[i];
}
returnaccum;
}
p c
p c
p c
p c
ca ca ca ca ca ca ca ca ca cb cb terminate/launch
cb p c
ca ca ca barrier
ca ca S
S
p d
S
ca ca ca ca ca ca barrier
ca GPU
p d
cb Kepler(reference)
p c
S
CPU
cb p c
S
S
S
• TANGRAMdelivers70%orhigherperformancecomparedto
highly-opOmizedlibraries,suchasIntelMKL,NVIDIACUBLAS,
CUSPARSE,orThrust,orexperts’opOmizedbenchmarksinRodinia
NormalizedPerformance
(higherisbe*er)
Motivation
S
p d
ca ca cb pd
ca ca ca pd
ca cb terminate/launch
ca ca ca Fermi(reference)
Fermi(TANGRAM)
CPU(reference)
CPU(TANGRAM)
1
0.8
0.6
0.4
0.2
0
Scan
SGEMV-TS
SGEMV-SF
DGEMM
SpMV
KMeans
BFS
cb p c
ca Kepler(TANGRAM)
ca ca sync
ca • TANGRAM’sdevicespecificaOonmodelishighlyextensibletosupportCPU
SIMDunit,GPUWarp,ILP,andGPUDynamicParallelism
Conclusion
• WeproposeTANGRAM,aprogrammingsystemforperformance
portabilityacrossdevices
• OurresultsshowTANGRAMcanachievepromisingperformance
acrossdevices
© Copyright 2026 Paperzz