Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

Auto-tuning Dense Matrix Multiplication
for GPGPU with Cache
ICPDS 2010 :
International Conference on Parallel and Distributed Systems
Xiang Cui, Yifeng Chen, Changyou Zhang and Hong Mei
Presenter: Shih-Meng Teng
1
Outline
Introduction
 Fermi’s new feature from programmer
 Related work of GEMM on GPUs


Auto-tuned matrix multiplication on Fermi
◦ CUDA GEMM code template
◦ Auto-tuning the GEMM code template
Experiment
 Conclusion

2
Introduction

GPU architecture :
◦ GT200 -> Fermi

Compared with GT200 architecture, the
new features include :
1.
2.
3.
4.
5.
6.
Improved double precision performance,
L1/L2 cache hierarchy,
Larger register file,
More shared memory,
ECC support
Faster atomic operations.
3
Introduction (cont.)

The added cache on one hand takes
advantage of data locality in runtime but
on the other hand makes performance less
predictable.

Programmers must understand these
constraints of the hardware platform well
to achieve high performance on Fermi.
4
Introduction (cont.)

Automatic performance tuning, or autotuning in short, is a practical technique to
get near-optimal code on complex and
unpredictable computational architecture.

In this work, auto-tuning is used to
optimize GEMM code on Fermi.
5
Introduction (cont.)

Our auto-tuned SGEMM and DGEMM
reach 563 Gflops and 253 GFlops
respectively on Tesla C2050, which is
about 1.7x and 1.6x speedup with respect
to CUBLAS 3.0.

CUBLAS 3.0 is the fastest GEMM
implementation written in CUDA and C
without tuning at the level of binary code.
6
Fermi’s new feature from programmer

Programmers must understand the
following features.
1.
2.
3.
4.
5.
6.
L1/L2 cache
Register file usage
32/64-bit device code
Global memory access
Bank conflict
Concurrent execution of multiple kernel
7
Fermi’s new feature from programmer
(cont.)

L1/L2 cache
◦ Compared to GT200 architecture, Fermi
comes with L1/L2 cache for local and global
memory accesses.
8
Fermi’s new feature from programmer
(cont.)

L1/L2 cache (cont.)
◦ Programmers have some control over L1
caching:
 The same on-chip memory is used for both L1 and
shared memory
 How much of it to be L1 or shared memory is
configurable by each kernel call.
 Kernels that use a lot of local memory could benefit
from 48 KB of L1 cache.
9
Fermi’s new feature from programmer
(cont.)

L1/L2 cache (cont.)
◦ In addition to L1 cache, Fermi also features a
unified L2 of 768 KB.
◦ Considering the unpredictability of cache’s
effect, auto-tuning is a practical approach to
get high performance CUDA code.
10
Fermi’s new feature from programmer
(cont.)

Register file
◦ Register file should be used as the primary onchip storage space and if the algorithm permits,
shared memory should be used less intensively
in favor of using the register file.
11
Fermi’s new feature from programmer
(cont.)

Register file (cont.)
◦ The number of cores per multiprocessor
 GT200 : 8
 Fermi : 32
◦ The number of registers per multiprocessor
 GT200 : 16K
 Fermi : 32K.
◦ This means the number of registers per core is
halved.
 GT200 : 16K / 8 = 2k
 Fermi : 32K / 32 = 1k
12
Fermi’s new feature from programmer
(cont.)

32 / 64 bit device code
◦ On Fermi architecture, if the application is
built in 64-bit mode, the compiler nvcc will
compile both the host code and the device
code in 64-bit mode.
◦ The larger pointers in the device code incur a
performance penalty due to the extra space
those pointers occupy in the register file.
13
Fermi’s new feature from programmer
(cont.)

32 / 64 bit device code (cont.)
◦ In this implementation of GEMM on Fermi,
the host code and the device code are always
separately compiled to avoid this 64-bit
pointer performance penalty.
14
Fermi’s new feature from programmer
(cont.)

Global memory access
◦ GT200
 Global memory accesses are processed per halfwarp
◦ Fermi
 Global memory accesses are processed per warp
15
Fermi’s new feature from programmer
(cont.)

Global memory access (cont.)
◦ Two dimensional thread blocks, for example,
should have their x dimension be multiple of
the warp size as opposed to half the warp size
so that each warp addresses a single cache line
when accessing global memory.
dimBlock.x
Threads per block
16
Fermi’s new feature from programmer
(cont.)

Bank conflicts
◦ Number of banks
 GT200 : 16 banks and accesses are processed by
half-warps.
 Fermi : 32 banks and accesses are processed by
warps.
 64-bit accesses are specifically handled to minimize bank
conflicts.
17
Fermi’s new feature from programmer
(cont.)

Concurrent execution of multiple kernel
◦ Fermi supports concurrent kernel execution.
 Different kernels of the same application context
can execute on the GPU at the same time.

Execute a number of small kernels to
utilize the whole GPU resources.
18
Related work of GEMM on GPUs

CUBLAS 1.0
◦ NVIDIA provides an example in SDK to
explain how to utilize shared memory to avoid
overhead of memory latency.
◦ The SGEMM in CUBLAS 1.0 with shared
memory to store sub-matrix of A and B is not
fully optimized.
19
Related work of GEMM on GPUs
(cont.)

Volkov and Demmel (CUBLAS 2.0)
◦ Modern GPUs should be viewed as multithreaded vector units, and their algorithms for
matrix multiplication resemble those earlier
ones developed for vector processors.
20
Related work of GEMM on GPUs
(cont.)

Volkov and Demmel (CUBLAS 2.0)
◦ Stores sub-matrix of A to shared memory but
uses registers to hold sub-matrix of B and C.
◦ This modification saves one movement from
shared memory to register per MAD operation.
◦ (MAD :one multiply + add )
subA
Shared
memory
Global
Memory
subB
subC
21
Related work of GEMM on GPUs
(cont.)

Author’s research
◦ In [6], author discuss about his experiences in
improving the performance of SGEMM.
◦ The following factors have contributed to its
better performance:
1. Reducing data transfer volume,
2. Adjusting the thread block size to achieve the
peak device-memory bandwidth,
3. Reducing the total number of synchronization.
22
Auto-tuned matrix multiplication on
Fermi - CUDA GEMM code template
23
Auto-tuned matrix multiplication on
Fermi - CUDA GEMM code template
Asub: m * k, Bsub: k * n, Csub: m * n
 A, B and C are partitioned to M * K, K * N and M
* N of these sub blocks.

One Csub requires fetching K blocks of Asub and
Bsub from A and B.
 Operations sum = M * N * K * (m * k)
+M * N * K * (k * n)
=M*m*N*n*K*k
* (1/m + 1/n)
 Those elements reading from the device memory.

24
Auto-tuned matrix multiplication on
Fermi - CUDA GEMM code template

In [6], we described an SGEMM kernel
that achieves a peak performance of 393
Gflops on GTX280. In this work, we take
this kernel as the overall template.
25
Auto-tuned matrix multiplication on
Fermi - CUDA GEMM code template
26
Auto-tuned matrix multiplication on
Fermi - CUDA GEMM code template
Matrix B always resides in the global
memory, threads in one column of thread
blocks need to read the data of one
column of Bsub blocks, as shown in
Figure 4(a).
 If the thread blocks can be scheduled in
column-major, better cache hit rates can
be achieved when reading data of matrix
B.

27
Auto-tuned matrix multiplication on
Fermi - CUDA GEMM code template

In this code template the indices of thread
block are permuted.
◦ Build-in blockIdx.x and blockIdx.y variables.

This modification is shown by Figure 4(b).
28
Auto-tuned matrix multiplication on
Fermi - CUDA GEMM code template
29
Auto-tuned matrix multiplication on Fermi
- Auto-tuning the GEMM code template
A GEMM auto-tuner is designed to autotune this code template by automatically
generating and searching a space of the
parameters.
 Two component:

◦ A code generator
 It generates parameterized code according to the
pre-defined code template.
◦ An execution engine.
 It runs these codes and finds out the best one.
30
Auto-tuned matrix multiplication on Fermi
- Auto-tuning the GEMM code template

Define parameters
Parameter
Explain
m
k
Sub-matrix size
n
tx
X dimension of thread block.
ty
Y dimension of thread block.
perm
It indicating whether to permuting the buildin blockIdx.x and blockIdx.y variables
cachePreferred
It indicating to prefer either L1 cache or
shared memory.
31
Auto-tuned matrix multiplication on Fermi
- Auto-tuning the GEMM code template

Steps :
1. Code generator checks the validity of the
input parameters.

By validity the input parameters must conform to
hardware constraints, e:g:, the maximum number
of threads per thread block tx * ty <=1024.
2. Code generator takes the seven parameters
as inputs, and generates the kernel code.
3. By changing the input parameters, we can
generate different kernel codes.
32
Auto-tuned matrix multiplication on Fermi
- Auto-tuning the GEMM code template

Steps :
4. Evaluate their performance in order to
identify the best combination.

Figure 9 shows an example of auto-tuned
DGEMM code for matrix size 2048 and
its calling code.
33
34
Experiment
Tip :
GEMM = GEneral Matrix Multiply
SGEMM = Single precision
DGEMM = Double precision
563 GFlops
35
Experiment (cont.)
253 GFlops
Up to 25% performance is lost if the code doesn’t work in a
cache-friendly way.
36
Experiment (cont.)
The peak performance of SGEMM on Tesla C2050 is 563 GFlops,
which has about 1.3x speedup with respect to GTX285 and 1.4x
speedup with respect to Tesla C1060.
37
Experiment (cont.)
The peak performance of DGEMM on Tesla C2050 is 253 GFlops,
which has 3x speedup with respect to GTX285 and 3.4x speedup with
respect to Tesla C1060.
38
Experiment (cont.)

These results confirm the vender’s
assertion that Fermi architecture has been
specifically designed to offer
unprecedented performance in double
precision.
39
Conclusion



Our focus is to study the techniques that can
assist common programmers to obtain high
performance code on Fermi architecture.
The new features of Fermi architecture,
especially the cache hierarchy, make it less
easy to predict the performance of a given
code.
Auto-tuning then becomes a reasonable
solution to achieve high performance code
on Fermi architecture.
40
Q &A
41
GDDR3 v.s. GDDR5
GDDR3 is DDR2 modified.
 GDDR5 is DDR3 modified.

GDDR5 is about 2.0x speedup with
respect to GDDR3.
 Power consumption of GDDR5 is lower
than GDDR3.
 Memory bandwidth of GDDR5 is larger
than GDDR3

42
CUresult cuFuncSetBlockShape (
CUfunction hfunc,
int
x,
int
y,
int
z
)

Specifies the x, y, and z dimensions of the
thread blocks that are created when the
kernel given by hfunc is launched.
43
CUresult cuLaunchGrid(
CUfunction f,
int
grid_width,
int
grid_height )

Invokes the kernel f on
a grid_width x grid_height grid of blocks.
Each block contains the number of threads
specified by a previous call
tocuFuncSetBlockShape().
44