Triangular Linear System Solver for GPU with CUDA and OpenCL Peng Du, Stanimire Tomov, Piotr Luszczek, Jack Dongarra Innovative Computing Laboratory(ICL), University of Tennessee Knoxville We present an algorithm for solving a system of equations with triangular matrix system(TRSM for short). Our implementation targets graphics processing unit (GPU) based on CUDA and OpenCL. TRSM is an important step in obtaining the solution for a linear system of equations after matrix decomposition like LU, Cholesky and QR. Relatively low performance of the current TRSM implementation in CUBLAS forces applications to perform TRSM on host device (CPU) for faster execution. Our implementation is designed to remove the need of engaging CPU by executing TRSM on GPU with high performance and numerical stability comparable to that of the CUBLAS implementation. All 16 variants of TRSM In-place GEMM TRMM vs. GEMM Arbitrary problem size Testbed Block algorithm Diagonal block inversion Recursion for multiple block sizes 0 = ⇒ •GPU: GTX 470 (Fermi) •SGEMM PEAK: ~ 600 Gflop/s •CPU: Q6600 @3.2GHz •CUDA: 3.1 OS: Ubuntu 9.04 OpenCL program runtime analysis N M M M M N GotoBLAS2 cublasStrsm GotoBLAS2 (M) cublasDtrsm cublasStrsm GotoBLAS2 Problem Given a triangular system of equations: = where is triangular matrix of size × , is the unknown matrix and is the right hand side matrix both of size × . means either or the transpose of A, and the equation to be solved could also be in shape = Triangular matrix can also be unit or non-unit diagonal In all, there are 16 variants of the problem. Algorithm The algorithm has three main components. (1) Block Algorithm 0 The equation is split into blocks as = = And the solution is = ( − ) (2) Diagonal Block Inversion On GPU, the diagonal blocks inversion can be done in parallel and the rest of work is matrix multiply (GEMM), which can achieve high performance. (3) Recursive for multiple block size Since most of the performance comes from GEMM, and on GPU GEMM shows high performance with large problem size, and the size of GPU shared memory (mostly 16KB, 48KB in FERMI) limits the diagonal block inversion method to a block size of up to 32, We designed a recursive method to compute the inversion to a larger size. The recursion process is in parallel on GPU. After the diagonal block inversion on a small size (normally 16), the recursion applies GEMM on growing problem size until reaching the desired block size. Performance Tuning The overall performance of TRSM is affected by both the recursive diagonal block inversion and the GEMM after it. As block size increases, the time of inversion increases and the time in GEMM decreases. So depending on problem size, a certain block size would yield the best performance for TRSM at that size. There is a memory copy involved too but we have a kernel to handle this with negligible runtime. Numerical Stability Forward and backward substitution are normally used for TRSM because of their good numerical stability. For our diagonal block inversion, an algorithm variant of triangular matrix inversion is picked which is proved to have similar stability to forward substitution. Experiment result confirms that with higher performance than TRSM in CUBLAS, our algorithm makes no compromise in backward stability. Application Speedup TRSM is widely used in factorization routines like Cholesky and LU. Also in the solver routines, TRSM is used to solve the triangular system of equations generated by factorization routines prior to this step. The Cholesky experiment shows the performance speedup with our algorithm and cublas TRSM in both single and double precisions. OpenCL Comparison CUDA is limited to GPU from NVIDIA, while OpenCL provides a portable solution for GPUs from different vendors. We port our TRSM algorithm to OpenCL and compare the performance of CUDA and OpenCL implementations. Different from CUDA, OpenCL uses a compile-on-the-fly model. OpenCL program and kernels are built at runtime which according to experiment result have non-trivial overheads. This will affect the design of a math library in OpenCL. By removing the overhead, with exact the same C++ code, OpenCL achieves roughly half the performance of CUDA on GTX 280, GTX 470 and GTX 480, and at least 3 times faster than cublas. To make a fair comparison, a modified GEMM based on Volkov’s matrix multiply is used rather than that from cublas. The performance difference comes from multiple sources, including high overhead OpenCL utility routines like memory release, and different ‘assembly’ codes generated by the compiler of CUDA and OpenCL.
© Copyright 2024 Paperzz