Towards a Formally Verified Skeleton-based
Data Parallel DSL for GPGPU
Izumi Asakura, Hidehiko Masuhara, Tomoyuki Aotani
(Tokyo Institute of Technology)
Background & Our Work
Technical Contributions:
High-level Lang. - An Approach to Formally Verified Compiler for Coqfor GPGPU
embedded DSLs (compilation by proving)
(e.g., Accelerate
- Source code: a pure Coq function
Low-level
GPGPU Lang. [Chakravarty et al. 11]
- Compiler: reification and verified compiler in pure Coq
[Springer
et
al'16]
Sequential Lang. (e.g.,CUDA) Ikra
)
- Compositional Verification of Template-based Codegen.
App. level
GPUVerify
- Compiler specification in Hoare-style logic
[Betts et al.'12]
Verification
Many
(we use GPUCSL [Asakura et al.'16] , a CSL for GPGPU)
PUG [Li et al.'14]
Verified CompCert [Leroy'06]
- Compositional verification for each compiler component
CertSkel
[Kumar
et
al.'14]
Compiler CakeML
(similar to Cito [Wang et al.'14], a cross-language linking compiler)
Programming in CertSkel
Source program:
A pure Coq function on lists
written in skeletons
Compilation by proving
Proving a lemma: there exists a
GPGPU program equivalent to
the source function
Extracting GPGPU program (CUDA C)
by Coq extraction mechanism
- The extracted OCaml program
saves program to "./argmax.cu"
Compilation by Proving
(* argmax.v (a Coq source file) *)
Definition argmax xs :=
reduce (fun (i,x) (j,y) =>
if y <= x then (i,x) else (j,y))
(zip (seq 0 (length xs)) xs)
{p | argmax โ๐ถ๐บ p}
reifyFunc
Definition argmaxGPGPU:
{p : GPGPU.prog | argmax โ๐ถ๐บ p}.
Proof.
reifyFunc.
eexists; apply compileOK.
(* proof of the safety *)
Defined.
{p | argmaxR โ๐ผ๐บ p}
Normalized syntax tree
where argmaxR =
of original argmax
Let(int, seq ...,
Let(tup(int, int), zip ...,
Let(tup(int, int), reduce ..., tmp3)))
Apply the compiler
eexists;
correctness lemma:
apply compileOK.
Lemma compileOK:
โs, safe(s) โ
Defined
Definition res := generateGPGPUFile
"./argmax.cu" argmaxGPGPU.
Separate Extract res.
compileIR: a Certified
Template-based
Code Generator
s โ๐ฐ๐ฎ compileIR s
Reified syntax tree of the source function
let tmp1 := seq 0 (length xs) in
let tmp2 := zip tmp1 xs in
let tmp3 := reduce (fun (i,x) (j,y) => ...) tmp2 in
tmp3
(written in Coq)
func. and GPGPU
s
โ
c
โ
๐ผ๐บ
Compositional โl, s(l) โ r โ
correctness
{array(inp, l)}
ฮ โจ c(inp)
conditions
{array(inp, l) โ
array(res, r)}
seq. func. and code frag.
fs
โ๐
fc โ
skel. and template
S
โ๐
Tโ
โl fs fc. S fs l โ r โง fs โ๐ fc โ
{array(inp, l) โ
array(res, -)}
โจ T[fc](res, inp, len, ...)
{array(inp, l) โ
array(res,r)}
- โx : Var. writes (fc(x).1) โ GVars
- โx : Var. fc(x).2 โ GVars
- โx : Var. x โ GVars โ
โจ {x=v}fc(x).1{fc(x).2=fs(v)}
- โx : Var. fs(x).1 has no barrier
reduce
reducetemplate
template
T* argmax(...) {
out1 := alloc(len);
seqKernel<<<...>>>(out1,len,xs);
out2 := alloc (len);
zipKernel<<<...>>>(out2,len,xs,out1)
out3 := alloc(nThrd);
out4 := alloc(1);
pFoldg1<<<...>>>(out3, out1);
pFoldb1<<<...>>>(out4, out2);
return 3; }
Hand-written code template
Instantiate
(optimized e.g., AoS to SoA, complex
array accessing pattern and sync.)
invokes kernels
A host function
(executed on CPU)
What We Have Done
Reflection &
Normalization
(in Ltac)
(if
(if(y
(x<=
>=x)
y){res1
{res1==i;x;res2
res2= =x}i}
else
= j;y;res
2==y},
else{res1
{res 1=
res2
i},
(res1,
(res1,res2))
res2))
Code fragment
reduce template
reduce template
(if (y <= x) {res1 = i; res2 = x}
else {res1 = j; res 2= y},
(res1, res2))
Several kernel functions
(executed on GPU)
Future Work
- Finish the proof (top level function compilation)
- Precise semantics: machine integer does not have infinite precision
- More optimizations (e.g., fusion transformation)
- By using helper tactic library GPUVeLib (symbolic execution on statements) - More features (e.g., nested parallelism, advanced skeletons)
- Implemented: all compiler components &
simple fusion trans. by rewriting (done before reification)
- Proved: โ ๐ for some skels. (map, reduce) and seq. func. compiler
© Copyright 2026 Paperzz