Triangular Linear System Solver for GPU with CUDA and OpenCL

Triangular Linear System Solver for GPU with CUDA and OpenCL
Peng Du, Stanimire Tomov, Piotr Luszczek, Jack Dongarra
Innovative Computing Laboratory(ICL), University of Tennessee Knoxville
We present an algorithm for solving a system of equations with triangular matrix system(TRSM for short). Our implementation targets graphics processing unit (GPU)
based on CUDA and OpenCL. TRSM is an important step in obtaining the solution for a linear system of equations after matrix decomposition like LU, Cholesky and QR.
Relatively low performance of the current TRSM implementation in CUBLAS forces applications to perform TRSM on host device (CPU) for faster execution. Our
implementation is designed to remove the need of engaging CPU by executing TRSM on GPU with high performance and numerical stability comparable to that of the
CUBLAS implementation.
All 16 variants of TRSM
In-place GEMM
TRMM vs. GEMM
Arbitrary problem size
Testbed
Block algorithm
Diagonal block inversion
Recursion for multiple block sizes 0 =
⇒
•GPU: GTX 470 (Fermi)
•SGEMM PEAK: ~ 600 Gflop/s
•CPU: Q6600 @3.2GHz
•CUDA: 3.1 OS: Ubuntu 9.04
OpenCL program runtime analysis
N
M
M
M
M
N
GotoBLAS2
cublasStrsm
GotoBLAS2
(M)
cublasDtrsm
cublasStrsm
GotoBLAS2
Problem
Given a triangular system of equations:
= where is triangular matrix of size × , is the unknown matrix and
is the right hand side matrix both of size × .
means either or the transpose of A, and the equation to be
solved could also be in shape
= Triangular matrix can also be unit or non-unit diagonal
In all, there are 16 variants of the problem.
Algorithm
The algorithm has three main components.
(1) Block Algorithm
0 The equation is split into blocks as =
= And the solution is = ( − )
(2) Diagonal Block Inversion
On GPU, the diagonal blocks inversion can be done in parallel and
the rest of work is matrix multiply (GEMM), which can achieve high
performance.
(3) Recursive for multiple block size
Since most of the performance comes from GEMM, and on GPU GEMM
shows high performance with large problem size, and the size of GPU
shared memory (mostly 16KB, 48KB in FERMI) limits the diagonal block
inversion method to a block size of up to 32, We designed a recursive
method to compute the inversion to a larger size. The recursion process is
in parallel on GPU. After the diagonal block inversion on a small size
(normally 16), the recursion applies GEMM on growing problem size until
reaching the desired block size.
Performance Tuning
The overall performance of TRSM is affected by both the recursive diagonal
block inversion and the GEMM after it. As block size increases, the time of
inversion increases and the time in GEMM decreases. So depending on
problem size, a certain block size would yield the best performance for
TRSM at that size. There is a memory copy involved too but we have a
kernel to handle this with negligible runtime.
Numerical Stability
Forward and backward substitution are normally used for TRSM because of
their good numerical stability. For our diagonal block inversion, an algorithm
variant of triangular matrix inversion is picked which is proved to have
similar stability to forward substitution. Experiment result confirms that with
higher performance than TRSM in CUBLAS, our algorithm makes no
compromise in backward stability.
Application Speedup
TRSM is widely used in factorization routines like Cholesky and LU. Also in
the solver routines, TRSM is used to solve the triangular system of equations
generated by factorization routines prior to this step. The Cholesky
experiment shows the performance speedup with our algorithm and cublas
TRSM in both single and double precisions.
OpenCL Comparison
CUDA is limited to GPU from NVIDIA, while OpenCL provides a portable
solution for GPUs from different vendors. We port our TRSM algorithm to
OpenCL and compare the performance of CUDA and OpenCL
implementations.
Different from CUDA, OpenCL uses a compile-on-the-fly model. OpenCL
program and kernels are built at runtime which according to experiment result
have non-trivial overheads. This will affect the design of a math library in
OpenCL. By removing the overhead, with exact the same C++ code, OpenCL
achieves roughly half the performance of CUDA on GTX 280, GTX 470 and GTX
480, and at least 3 times faster than cublas. To make a fair comparison, a
modified GEMM based on Volkov’s matrix multiply is used rather than that
from cublas. The performance difference comes from multiple sources,
including high overhead OpenCL utility routines like memory release, and
different ‘assembly’ codes generated by the compiler of CUDA and OpenCL.