Auto-tuning Dense Matrix Multiplication for GPGPU with Cache ICPDS 2010 : International Conference on Parallel and Distributed Systems Xiang Cui, Yifeng Chen, Changyou Zhang and Hong Mei Presenter: Shih-Meng Teng 1 Outline Introduction Fermi’s new feature from programmer Related work of GEMM on GPUs Auto-tuned matrix multiplication on Fermi ◦ CUDA GEMM code template ◦ Auto-tuning the GEMM code template Experiment Conclusion 2 Introduction GPU architecture : ◦ GT200 -> Fermi Compared with GT200 architecture, the new features include : 1. 2. 3. 4. 5. 6. Improved double precision performance, L1/L2 cache hierarchy, Larger register file, More shared memory, ECC support Faster atomic operations. 3 Introduction (cont.) The added cache on one hand takes advantage of data locality in runtime but on the other hand makes performance less predictable. Programmers must understand these constraints of the hardware platform well to achieve high performance on Fermi. 4 Introduction (cont.) Automatic performance tuning, or autotuning in short, is a practical technique to get near-optimal code on complex and unpredictable computational architecture. In this work, auto-tuning is used to optimize GEMM code on Fermi. 5 Introduction (cont.) Our auto-tuned SGEMM and DGEMM reach 563 Gflops and 253 GFlops respectively on Tesla C2050, which is about 1.7x and 1.6x speedup with respect to CUBLAS 3.0. CUBLAS 3.0 is the fastest GEMM implementation written in CUDA and C without tuning at the level of binary code. 6 Fermi’s new feature from programmer Programmers must understand the following features. 1. 2. 3. 4. 5. 6. L1/L2 cache Register file usage 32/64-bit device code Global memory access Bank conflict Concurrent execution of multiple kernel 7 Fermi’s new feature from programmer (cont.) L1/L2 cache ◦ Compared to GT200 architecture, Fermi comes with L1/L2 cache for local and global memory accesses. 8 Fermi’s new feature from programmer (cont.) L1/L2 cache (cont.) ◦ Programmers have some control over L1 caching: The same on-chip memory is used for both L1 and shared memory How much of it to be L1 or shared memory is configurable by each kernel call. Kernels that use a lot of local memory could benefit from 48 KB of L1 cache. 9 Fermi’s new feature from programmer (cont.) L1/L2 cache (cont.) ◦ In addition to L1 cache, Fermi also features a unified L2 of 768 KB. ◦ Considering the unpredictability of cache’s effect, auto-tuning is a practical approach to get high performance CUDA code. 10 Fermi’s new feature from programmer (cont.) Register file ◦ Register file should be used as the primary onchip storage space and if the algorithm permits, shared memory should be used less intensively in favor of using the register file. 11 Fermi’s new feature from programmer (cont.) Register file (cont.) ◦ The number of cores per multiprocessor GT200 : 8 Fermi : 32 ◦ The number of registers per multiprocessor GT200 : 16K Fermi : 32K. ◦ This means the number of registers per core is halved. GT200 : 16K / 8 = 2k Fermi : 32K / 32 = 1k 12 Fermi’s new feature from programmer (cont.) 32 / 64 bit device code ◦ On Fermi architecture, if the application is built in 64-bit mode, the compiler nvcc will compile both the host code and the device code in 64-bit mode. ◦ The larger pointers in the device code incur a performance penalty due to the extra space those pointers occupy in the register file. 13 Fermi’s new feature from programmer (cont.) 32 / 64 bit device code (cont.) ◦ In this implementation of GEMM on Fermi, the host code and the device code are always separately compiled to avoid this 64-bit pointer performance penalty. 14 Fermi’s new feature from programmer (cont.) Global memory access ◦ GT200 Global memory accesses are processed per halfwarp ◦ Fermi Global memory accesses are processed per warp 15 Fermi’s new feature from programmer (cont.) Global memory access (cont.) ◦ Two dimensional thread blocks, for example, should have their x dimension be multiple of the warp size as opposed to half the warp size so that each warp addresses a single cache line when accessing global memory. dimBlock.x Threads per block 16 Fermi’s new feature from programmer (cont.) Bank conflicts ◦ Number of banks GT200 : 16 banks and accesses are processed by half-warps. Fermi : 32 banks and accesses are processed by warps. 64-bit accesses are specifically handled to minimize bank conflicts. 17 Fermi’s new feature from programmer (cont.) Concurrent execution of multiple kernel ◦ Fermi supports concurrent kernel execution. Different kernels of the same application context can execute on the GPU at the same time. Execute a number of small kernels to utilize the whole GPU resources. 18 Related work of GEMM on GPUs CUBLAS 1.0 ◦ NVIDIA provides an example in SDK to explain how to utilize shared memory to avoid overhead of memory latency. ◦ The SGEMM in CUBLAS 1.0 with shared memory to store sub-matrix of A and B is not fully optimized. 19 Related work of GEMM on GPUs (cont.) Volkov and Demmel (CUBLAS 2.0) ◦ Modern GPUs should be viewed as multithreaded vector units, and their algorithms for matrix multiplication resemble those earlier ones developed for vector processors. 20 Related work of GEMM on GPUs (cont.) Volkov and Demmel (CUBLAS 2.0) ◦ Stores sub-matrix of A to shared memory but uses registers to hold sub-matrix of B and C. ◦ This modification saves one movement from shared memory to register per MAD operation. ◦ (MAD :one multiply + add ) subA Shared memory Global Memory subB subC 21 Related work of GEMM on GPUs (cont.) Author’s research ◦ In [6], author discuss about his experiences in improving the performance of SGEMM. ◦ The following factors have contributed to its better performance: 1. Reducing data transfer volume, 2. Adjusting the thread block size to achieve the peak device-memory bandwidth, 3. Reducing the total number of synchronization. 22 Auto-tuned matrix multiplication on Fermi - CUDA GEMM code template 23 Auto-tuned matrix multiplication on Fermi - CUDA GEMM code template Asub: m * k, Bsub: k * n, Csub: m * n A, B and C are partitioned to M * K, K * N and M * N of these sub blocks. One Csub requires fetching K blocks of Asub and Bsub from A and B. Operations sum = M * N * K * (m * k) +M * N * K * (k * n) =M*m*N*n*K*k * (1/m + 1/n) Those elements reading from the device memory. 24 Auto-tuned matrix multiplication on Fermi - CUDA GEMM code template In [6], we described an SGEMM kernel that achieves a peak performance of 393 Gflops on GTX280. In this work, we take this kernel as the overall template. 25 Auto-tuned matrix multiplication on Fermi - CUDA GEMM code template 26 Auto-tuned matrix multiplication on Fermi - CUDA GEMM code template Matrix B always resides in the global memory, threads in one column of thread blocks need to read the data of one column of Bsub blocks, as shown in Figure 4(a). If the thread blocks can be scheduled in column-major, better cache hit rates can be achieved when reading data of matrix B. 27 Auto-tuned matrix multiplication on Fermi - CUDA GEMM code template In this code template the indices of thread block are permuted. ◦ Build-in blockIdx.x and blockIdx.y variables. This modification is shown by Figure 4(b). 28 Auto-tuned matrix multiplication on Fermi - CUDA GEMM code template 29 Auto-tuned matrix multiplication on Fermi - Auto-tuning the GEMM code template A GEMM auto-tuner is designed to autotune this code template by automatically generating and searching a space of the parameters. Two component: ◦ A code generator It generates parameterized code according to the pre-defined code template. ◦ An execution engine. It runs these codes and finds out the best one. 30 Auto-tuned matrix multiplication on Fermi - Auto-tuning the GEMM code template Define parameters Parameter Explain m k Sub-matrix size n tx X dimension of thread block. ty Y dimension of thread block. perm It indicating whether to permuting the buildin blockIdx.x and blockIdx.y variables cachePreferred It indicating to prefer either L1 cache or shared memory. 31 Auto-tuned matrix multiplication on Fermi - Auto-tuning the GEMM code template Steps : 1. Code generator checks the validity of the input parameters. By validity the input parameters must conform to hardware constraints, e:g:, the maximum number of threads per thread block tx * ty <=1024. 2. Code generator takes the seven parameters as inputs, and generates the kernel code. 3. By changing the input parameters, we can generate different kernel codes. 32 Auto-tuned matrix multiplication on Fermi - Auto-tuning the GEMM code template Steps : 4. Evaluate their performance in order to identify the best combination. Figure 9 shows an example of auto-tuned DGEMM code for matrix size 2048 and its calling code. 33 34 Experiment Tip : GEMM = GEneral Matrix Multiply SGEMM = Single precision DGEMM = Double precision 563 GFlops 35 Experiment (cont.) 253 GFlops Up to 25% performance is lost if the code doesn’t work in a cache-friendly way. 36 Experiment (cont.) The peak performance of SGEMM on Tesla C2050 is 563 GFlops, which has about 1.3x speedup with respect to GTX285 and 1.4x speedup with respect to Tesla C1060. 37 Experiment (cont.) The peak performance of DGEMM on Tesla C2050 is 253 GFlops, which has 3x speedup with respect to GTX285 and 3.4x speedup with respect to Tesla C1060. 38 Experiment (cont.) These results confirm the vender’s assertion that Fermi architecture has been specifically designed to offer unprecedented performance in double precision. 39 Conclusion Our focus is to study the techniques that can assist common programmers to obtain high performance code on Fermi architecture. The new features of Fermi architecture, especially the cache hierarchy, make it less easy to predict the performance of a given code. Auto-tuning then becomes a reasonable solution to achieve high performance code on Fermi architecture. 40 Q &A 41 GDDR3 v.s. GDDR5 GDDR3 is DDR2 modified. GDDR5 is DDR3 modified. GDDR5 is about 2.0x speedup with respect to GDDR3. Power consumption of GDDR5 is lower than GDDR3. Memory bandwidth of GDDR5 is larger than GDDR3 42 CUresult cuFuncSetBlockShape ( CUfunction hfunc, int x, int y, int z ) Specifies the x, y, and z dimensions of the thread blocks that are created when the kernel given by hfunc is launched. 43 CUresult cuLaunchGrid( CUfunction f, int grid_width, int grid_height ) Invokes the kernel f on a grid_width x grid_height grid of blocks. Each block contains the number of threads specified by a previous call tocuFuncSetBlockShape(). 44
© Copyright 2026 Paperzz