Effect of Instruction Fetch and Memory Scheduling on
GPU Performance
Nagesh B Lakshminarayana, Hyesoon Kim
Outline
•
•
•
•
•
Background and Motivation
Policies
Experimental Setup
Results
Conclusion
2
GPU Architecture (based on Tesla Architecture)
SM – Streaming Multiprocessor SP – Scalar Processor SIMT – Single Instruction Multiple Thread
3
SM Architecture (based on Tesla Architecture)
• Fetch Mechanism
– Fetch 1 instruction for selected warp
– Stall Fetch for warp when it executes a Load/Store
or when it encounters a Branch
• Scheduler Policy
– Oldest first and Inorder (within warp)
• Caches
– I Cache, Shared Memory, Constant Cache and
Texture Cache
4
Handling Multiple Memory Requests
• MSHR/Memory Request Queue
– Allows merging of memory requests (Intra-core)
• DRAM Controller
– Allows merging of memory requests (Inter-core)
5
Intra-core Merging
6
Code Example - Intra-Core Merging
•
From MonteCarlo in CUDA SDK
for(iSum = threadIdx.x; iSum < SUM_N; iSum += blockDim.x)
{
iSum 0, 2 = 2 iSum 1, 2 = 2 iSum 2, 2 = 2
A X, Y X – Block Id, Y – Thread Id
…
for(int i = iSum; i < pathN; i += SUM_N)
i 0, 2 = 2 i 1, 2 = 2 i 2, 2 = 2
{
real r = d_Samples[i]; r 0, 2 = r 1, 2 = r 2, 2 = d_Samples[2]
real callValue = endCallValue(S, X, r, MuByT, VBySqrtT);
sumCall.Expected += callValue;
sumCall.Confidence += callValue * callValue;
}
multiple blocks are assigned to the same SM
…
threads with corresponding Ids in different blocks
access the same memory locations
}
7
Inter-core Merging
8
Why look at Fetch?
• Allows implicit control over resources
allocated to a warp
• Can control progress of a warp
• Can boost performance by fetching more for
critical warps
• Implicit resource control within a core
9
Why look at DRAM Scheduling?
• Memory System is a performance bottleneck
for several applications
• DRAM scheduling decides the order in which
memory requests are granted
• Can prioritize warps based on criticality
• Implicit performance control across cores
10
By controlling Fetch and DRAM Scheduling we
can control performance
11
How is This Useful?
• Understand applications and their behavior
better
• Detect patterns or behavioral groups across
applications
• Design new policies for GPGPU applications to
improve performance
12
Outline
•
•
•
•
•
Background and Motivation
Policies
Experimental Setup
Results
Conclusion
13
Fetch Policies
• Round Robin (RR) [default in Tesla architecture]
• FAIR
– Ensures uniform progress of all warps
• ICOUNT [Tullsen’96]
– Same as ICOUNT in SMT
– Tries to increase throughput by giving priority to
fast moving threads
• Least Recently Fetched (LRF)
– Prevents starvation of warps
14
New Oracle Based Fetch Policies
• ALL
– Gives priority to longer warps (total length until
termination)
– Ensures all warps finish at the same time, this
results in higher occupancy
Priorities:
warp 0 > warp 1 > warp 2 > warp 3
15
New Oracle Based Fetch Policies
• BAR
– Gives priority to warps with greater number of
instructions to next barrier
– Idea is to reduce wait time at barriers
Priorities:
warp 0 > warp 1 > warp 2 > warp 3
Priorities:
warp 2 > warp 1 > warp 0 > warp 3
16
New Oracle Based Fetch Policies
• MEM_BAR
– Similar to BAR but gives higher priority to warps
with more memory instructions
Priorities:
warp 0 > warp 2 > warp 1 = warp 3
Priorities:
warp 1 > warp 0 = warp 2 > warp 3
Priority(Wa) > Priority(Wb)
If MemInst(Wa) > MemInst(Wb) or
If MemInst(Wa) = MemInst(Wb) AND Inst(Wa) > Inst(Wb)
17
DRAM Scheduling Policies
• FCFS
• FRFCFS [Rixner’00]
• FR_FAIR (new policy)
– Row hit with fairness
– Ensures uniform progress of warps
• REM_INST (new Oracle based policy)
– Row hit with priority for warps with greater
number of instructions remaining for termination
– Prioritizes longer warps
18
Outline
•
•
•
•
•
Background and Motivation
Policies
Experimental Setup
Results
Conclusion
19
Experimental Setup
• Simulated GPU Architecture
– 8 SMs
– Frontend : 1 wide, 1KB I Cache, branch stall
– Execution : 8 wide SIMD execution unit, IO scheduling,
4 cycle latency for most instructions
– Caches : 64KB software managed cache,
8 load accesses/cycle
– Memory : 32B wide bus, 8 DRAM banks
– RR fetch, FRFCFS DRAM scheduling (baseline)
• Trace driven, cycle accurate simulator
• Per warp traces generated using GPU Ocelot[Kerr’09]
20
Benchmarks
• Taken from
– CUDA SDK 2.2 – MonteCarlo, Nbody, ScalarProd
– PARBOIL[UIUC’09] – MRI-Q, MRI-FHD, CP, PNS
– RODINIA[Che’09] – Leukocyte, Cell, Needle
• Classification based on lengths of warps
– Symmetric, if <= 2% divergence
– Asymmetric, otherwise (results included in paper)
21
Outline
•
•
•
•
•
Background and Motivation
Policies
Experimental Setup
Results
Conclusion
22
Results - Symmetric Applications
Normalized Execution Duration
FRFCFS
1.4
1.2
1
0.8
0.6
0.4
0.2
0
e
yt
c
ko
u
Le
ICOUNT
LRF
FAIR
ALL
BAR
MEM_BAR
P
C
Baseline : RR + FRFCFS
B
N
y
od
M
I-Q
R
L
D
s
lo
er
ul
t
r
H
G
e
l
s
a
i
e
o
xM
w
I-F
pl
eC
tri
T
ch
R
t
m
a
e
i
S
M
on
M
S
nn
ck
M
e
a
l
s
B
er
M
A
VG
• Compute intensive – no variation with different fetch policies
• Memory bound – improvement with fairness oriented fetch
policies i.e., FAIR, ALL, BAR, MEM_BAR
23
FRFAIR
1.4
1.2
1
0.8
0.6
0.4
0.2
0
AV
G
RI
-F
HD
M
at
rix
Bl
M
ac
ul
kS
ch
ol
es
M
on
te
C
ar
lo
Si
m
M
pl
er
eG
se
L
nn
eT
wi
st
er
M
RI
-Q
M
NB
od
y
RR
ICOUNT
LRF
FAIR
ALL
BAR
MEM_BAR
CP
Le
uk
oc
yt
e
Normalized Execution Duration
Results – Symmetric Applications
Baseline : RR + FRFCFS
• On average, better than FRFCFS
• MersenneTwister shows huge improvement
• REM_INST DRAM policy performs similar to FRFAIR
24
Analysis: MonteCarlo
MonteCarlo
1.4
1.2
1
Normalized Execution Time
0.8
0.6
Ratio of Merged Memory
Requests
0.4
0.2
EM
_B
AR
M
BA
R
AL
L
IR
FA
F
LR
NT
IC
O
U
RR
0
FRFCFS DRAM Scheduling
• Fairness oriented fetch policies improve performance by
increasing intra-core merging
25
Analysis: MersenneTwister
MersenneTwister
1.2
1
0.8
0.6
0.4
0.2
0
Normalized Execution Time
FRFCFS
FRFAIR
RR
ICOUNT
LRF
FAIR
ALL
BAR
MEM_BAR
RR
ICOUNT
LRF
FAIR
ALL
BAR
MEM_BAR
RR
ICOUNT
LRF
FAIR
ALL
BAR
MEM_BAR
DRAM Row Buffer Hit Ratio
REM_INST
Baseline : RR + FRFCFS
• FAIR DRAM Scheduling (FRFAIR, REM_INST) improves
performance by increasing DRAM Row Buffer Hit ratio
26
Analysis: BlackScholes
BlackScholes
1.4
1.2
1
0.8
Normalized Execution Time
0.6
MLP/100
0.4
0.2
EM
_B
AR
M
BA
R
AL
L
IR
FA
F
LR
NT
IC
O
U
RR
0
FRFCFS DRAM Scheduling
• Fairness oriented fetch policies increase MLP
• Increased (MLP + Row Buffer Hit ratio) improves performance
27
Outline
•
•
•
•
•
Background and Motivation
Policies
Experimental Setup
Results
Conclusion
28
Conclusion
• Compute intensive applications
– Fetch and DRAM Scheduling do not matter
• Symmetric memory intensive applications
– Fairness oriented Fetch (FAIR, ALL, BAR,
MEM_BAR) and DRAM policies (FR_FAIR,
REM_INST) provide performance improvement
• MonteCarlo(40%),MersenneTwister(50%),
BlackScholes(18%)
• Asymmetric memory intensive applications
– No correlation between performance and Fetch
and DRAM Scheduling policies
29
THANK YOU!
30
© Copyright 2026 Paperzz