幻灯片 1 - The Ohio State University

AN INTEGER PROGRAMMING
FRAMEWORK FOR OPTIMIZING
SHARED MEMORY USE ON GPUS
Wenjing Ma
Gagan Agrawal
The Ohio State University
HiPC 2010
GPGPU
General Purpose Programming on GPUs
(accelerators)
 High performance/price ratio
 High language support
 CUDA
 Performance vs Productivity
 Hard to program
 Memory hierarchy to manage
 ...
HiPC 2010
Get High Performance from
GPU
Automatic code generation
Device memory access is expensive
Using shared memory
Texture and constant memory
Coalescing device memory access
...
And Make the Programming Simple!
HiPC 2010
FEATURES OF SHARED
MEMORY
Small, fast, like a cache
 16KB on each multiprocessor (no more than 48KB
even on the latest GPU)
 Read-write
Software controlled
 __shared__ float data[n][n];
Allocating shared memory:
 Similar to register allocation
HiPC 2010
Problem Formulation for Shared
Memory Arrangement
Consider variables and basic blocks in a
function
 Element of array, array, section of array
Each variable can have several live ranges in
the function
 Access feature of live range: read, write, read-write,
temp
Determine in which basic block a variable is
allocated to shared memory
Assign_point[i][k]: variable i, basic block k
HiPC 2010
Integer Programming Problem
Integer Linear Programming
Objective function
 Maximize z = CT x
Constraints

Solution
 Values of x
 Special case of linear programming
 All the unknown variables are integers (1-0 in our
case)
 Solvable for reasonable size of problems
HiPC 2010
Integer Programming for Shared
Memory Arrangement
Objective Function
 Maximize shared memory usage
 Minimize data transfer between memory
hierarchies
HiPC 2010
Integer Programming for
Shared Memory Arrangement
(cnt’d)
Objective Function
HiPC 2010
An Example to Show size_alloc
for (int i=0; i<n; i++)
for (int j=0; j<m; j++)
for (int k = 0; k<r; k++)
C[k] += A[i][k]- B[j][k];
......
HiPC 2010
Integer Programming for Shared
Memory Arrangement (cnt’d)
Constraints
 Total allocation does not exceed the limit of shared
memory at any time
 Only at most one assign_point is 1 in each live range
HiPC 2010
Integer Programming for
Shared Memory Arrangement
(cnt’d)
Obtaining parameters
 Using LLVM compiler framework
 Pass 1: get access features
Read, write, read-write, temp
 Pass 2: get live ranges, loop information, indices, and
all other parameters
HiPC 2010
Code Generation
 According to the shared memory arrangement
obtained from the integer programming model
 Under the framework in previous work
 Move data to cover gap caused by data evicted
from shared memory
HiPC 2010
An Example
for (int i=0; i<n; i++)
for (int j=0; j<m; j++)
for (int k = 0; k<r; k++)
C[k] += A[i][k]- B[j][k];
......
A: n*r
B: m*r
C: r
n: 2048
m: 3
r: 3
NUM_THREADS:
256
Integer
Programming
Solver
assign_point[0][1]=1;
assign_point[1][0]=1;
assign_point[2][0]=1;
/* all other elements of assign_point
are 0 */
HiPC 2010
An Example (cnt’d)
Generated Code:
__shared__ float s_B[m][r];
__shared__ float s_C[r*NUM_THREADS];
__shared__ float s_A[r*NUM_THREADS];
for(int i=0;i<m*r;i++) s_B[i]=B[i];
for(int i=0;i<n;i+=NUM_THREADS) {
for(int j=0;j<r;j++)
s_A[tid*r+j]=A[tid+i][j];
for(int j=0;j<m;j++)
for(int k=0;k<r;k++)
s_C[k*tid]+=s_A[tid*r+k]-s_B[j][k];
......
}
/* Synchronize and combination of C */
HiPC 2010
Suggesting Loop Transformation
for (int rc = 0; rc < nRowCl; rc++) {
tempDis = 0;
for(int c = 0;c<numCol;c++)
tempDis = tempDis + data[r][c] * Acomp[rc][colCL[c]];
}
for (int rc = 0; rc < nRowCl; rc++)
tempDis[rc] = 0;
for(int c = 0;c<numCol;c++) {
/* load into shared memory */
for (int rc = 0; rc < nRowCl; rc++) {
tempDis[rc] += data[r][c] * Acomp[rc][colCL[c]];
}
}
HiPC 2010
Experiments
Effectiveness of using shared memory
Compare with intuitive approach in previous
work
 Greedy sorting: sort all the variables in increasing
order of size, and allocation them on shared memory
until to the limit of shared memory
Effectiveness of loop transformation suggested by
the integer programming model
HiPC 2010
Experiment Results
HiPC 2010
Experiment Results
EM
K-means
no shared memory
basic
basic
Int-solved
7
6
6
5
Time (seconds)
4
4
3
3
Configuration (threads_per_block * blocks)
HiPC 2010
256*64
256*32
256*16
256*8
256*256
256*64
0
256*32
0
256*16
1
256*8
1
256*4
2
2
256*4
Time (seconds)
5
Int-solve d
Configuration (threads_per_block * blocks)
Experiment Results (cnt’d)
Co-clustering
PCA
basic
Int-solved
no shared memory
12
Int solved
2
1.8
10
1.6
1.4
Time (seconds)
6
4
1.2
1
0.8
0.6
0.4
2
0.2
0
128*16
128*8
128*4
128*2
128*1
128*64
128*32
128*16
128*8
128*4
128*2
0
128*1
Time (seconds)
8
Configuration (threads_per_block * blocks)
Configuration (threads_per_block * blocks)
HiPC 2010
Effect of Loop Transformation
Co-clustering
non-transformed
transformed
10
1
8
0.8
0.4
128*64
128*32
128*16
128*8
0
128*4
0
128*2
0.2
128*1
2
128*16
4
0.6
128*8
6
transformed
128*4
Time (seconds)
1.2
Time (seconds)
12
128*2
non-transformed
128*1
PCA
Configuration (threads_per_block * blocks)
Configuration (threads_per_block * blocks)
HiPC 2010
Conclusion and Future Work
Proposed an integer programming model for
shared memory arrangement on GPU
Consider numeric variable, array, and section of
array
Suggested loop transformation for optimization
Got better results than the intuitive method
Will automate the code generation and loop
transformation selection in future
HiPC 2010
THANK YOU!
Questions?
HiPC 2010