AN INTEGER PROGRAMMING
FRAMEWORK FOR OPTIMIZING
SHARED MEMORY USE ON GPUS
Wenjing Ma
Gagan Agrawal
The Ohio State University
HiPC 2010
GPGPU
General Purpose Programming on GPUs
(accelerators)
High performance/price ratio
High language support
CUDA
Performance vs Productivity
Hard to program
Memory hierarchy to manage
...
HiPC 2010
Get High Performance from
GPU
Automatic code generation
Device memory access is expensive
Using shared memory
Texture and constant memory
Coalescing device memory access
...
And Make the Programming Simple!
HiPC 2010
FEATURES OF SHARED
MEMORY
Small, fast, like a cache
16KB on each multiprocessor (no more than 48KB
even on the latest GPU)
Read-write
Software controlled
__shared__ float data[n][n];
Allocating shared memory:
Similar to register allocation
HiPC 2010
Problem Formulation for Shared
Memory Arrangement
Consider variables and basic blocks in a
function
Element of array, array, section of array
Each variable can have several live ranges in
the function
Access feature of live range: read, write, read-write,
temp
Determine in which basic block a variable is
allocated to shared memory
Assign_point[i][k]: variable i, basic block k
HiPC 2010
Integer Programming Problem
Integer Linear Programming
Objective function
Maximize z = CT x
Constraints
Solution
Values of x
Special case of linear programming
All the unknown variables are integers (1-0 in our
case)
Solvable for reasonable size of problems
HiPC 2010
Integer Programming for Shared
Memory Arrangement
Objective Function
Maximize shared memory usage
Minimize data transfer between memory
hierarchies
HiPC 2010
Integer Programming for
Shared Memory Arrangement
(cnt’d)
Objective Function
HiPC 2010
An Example to Show size_alloc
for (int i=0; i<n; i++)
for (int j=0; j<m; j++)
for (int k = 0; k<r; k++)
C[k] += A[i][k]- B[j][k];
......
HiPC 2010
Integer Programming for Shared
Memory Arrangement (cnt’d)
Constraints
Total allocation does not exceed the limit of shared
memory at any time
Only at most one assign_point is 1 in each live range
HiPC 2010
Integer Programming for
Shared Memory Arrangement
(cnt’d)
Obtaining parameters
Using LLVM compiler framework
Pass 1: get access features
Read, write, read-write, temp
Pass 2: get live ranges, loop information, indices, and
all other parameters
HiPC 2010
Code Generation
According to the shared memory arrangement
obtained from the integer programming model
Under the framework in previous work
Move data to cover gap caused by data evicted
from shared memory
HiPC 2010
An Example
for (int i=0; i<n; i++)
for (int j=0; j<m; j++)
for (int k = 0; k<r; k++)
C[k] += A[i][k]- B[j][k];
......
A: n*r
B: m*r
C: r
n: 2048
m: 3
r: 3
NUM_THREADS:
256
Integer
Programming
Solver
assign_point[0][1]=1;
assign_point[1][0]=1;
assign_point[2][0]=1;
/* all other elements of assign_point
are 0 */
HiPC 2010
An Example (cnt’d)
Generated Code:
__shared__ float s_B[m][r];
__shared__ float s_C[r*NUM_THREADS];
__shared__ float s_A[r*NUM_THREADS];
for(int i=0;i<m*r;i++) s_B[i]=B[i];
for(int i=0;i<n;i+=NUM_THREADS) {
for(int j=0;j<r;j++)
s_A[tid*r+j]=A[tid+i][j];
for(int j=0;j<m;j++)
for(int k=0;k<r;k++)
s_C[k*tid]+=s_A[tid*r+k]-s_B[j][k];
......
}
/* Synchronize and combination of C */
HiPC 2010
Suggesting Loop Transformation
for (int rc = 0; rc < nRowCl; rc++) {
tempDis = 0;
for(int c = 0;c<numCol;c++)
tempDis = tempDis + data[r][c] * Acomp[rc][colCL[c]];
}
for (int rc = 0; rc < nRowCl; rc++)
tempDis[rc] = 0;
for(int c = 0;c<numCol;c++) {
/* load into shared memory */
for (int rc = 0; rc < nRowCl; rc++) {
tempDis[rc] += data[r][c] * Acomp[rc][colCL[c]];
}
}
HiPC 2010
Experiments
Effectiveness of using shared memory
Compare with intuitive approach in previous
work
Greedy sorting: sort all the variables in increasing
order of size, and allocation them on shared memory
until to the limit of shared memory
Effectiveness of loop transformation suggested by
the integer programming model
HiPC 2010
Experiment Results
HiPC 2010
Experiment Results
EM
K-means
no shared memory
basic
basic
Int-solved
7
6
6
5
Time (seconds)
4
4
3
3
Configuration (threads_per_block * blocks)
HiPC 2010
256*64
256*32
256*16
256*8
256*256
256*64
0
256*32
0
256*16
1
256*8
1
256*4
2
2
256*4
Time (seconds)
5
Int-solve d
Configuration (threads_per_block * blocks)
Experiment Results (cnt’d)
Co-clustering
PCA
basic
Int-solved
no shared memory
12
Int solved
2
1.8
10
1.6
1.4
Time (seconds)
6
4
1.2
1
0.8
0.6
0.4
2
0.2
0
128*16
128*8
128*4
128*2
128*1
128*64
128*32
128*16
128*8
128*4
128*2
0
128*1
Time (seconds)
8
Configuration (threads_per_block * blocks)
Configuration (threads_per_block * blocks)
HiPC 2010
Effect of Loop Transformation
Co-clustering
non-transformed
transformed
10
1
8
0.8
0.4
128*64
128*32
128*16
128*8
0
128*4
0
128*2
0.2
128*1
2
128*16
4
0.6
128*8
6
transformed
128*4
Time (seconds)
1.2
Time (seconds)
12
128*2
non-transformed
128*1
PCA
Configuration (threads_per_block * blocks)
Configuration (threads_per_block * blocks)
HiPC 2010
Conclusion and Future Work
Proposed an integer programming model for
shared memory arrangement on GPU
Consider numeric variable, array, and section of
array
Suggested loop transformation for optimization
Got better results than the intuitive method
Will automate the code generation and loop
transformation selection in future
HiPC 2010
THANK YOU!
Questions?
HiPC 2010
© Copyright 2026 Paperzz