Introduction

Introduction
Lectures Slides and Figures from MKP and Sudhakar Yalamanchili
(1)
Handling Branches
•
CUDA Code:
if(…) … (True for some threads)
T
T
taken
T
T
not taken
else … (True for others)
•
What if threads takes different branches?
•
Branch Divergence!
(2)
Branch Divergence
A
Thread Warp
B
C
D
F
Common PC
Thread Thread Thread Thread
1
2
3
4
E
G
•
•
Different threads follow different control flow paths
through the kernel code
Thread execution is (partially) serialized
 Subset of threads that follow the same path execute in
parallel
Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt
(3)
Branch Divergence
•
•
Occurs within a warp
Branches lead serialization branch dependent code
 Performance issue: low warp utilization
if(…)
{… }
Idle threads
else {
…}
Reconvergence!
(4)
Basic Implementation
•
Show a basic implementation with no reconvergence
 Point is what are the consequences
•
Focus is on correct execution
•
Acknowledge work in prior SIMD processors
(5)
Thread Reconvergence
•
Introduce the concept of thread reconvergence (using
the same block structured example)
•
Fundamental problem:
 merge threads with the same PC
 How do we sequence execution of threads? Since this can
effect the ability to reconverge
•
Question: When can threads productively
reconverge?
•
Question: When is the best time to reconverge?
 Concludes with the intuitive notion of post-dominator
•
Tie this all back to how the architecture is supposed
to work efficiently  best case we aspire to
(6)
Immediate Postdominator
•
Definition
•
Computation
(7)
Baseline: PDOM
Stack
AA/1111
Reconv. PC
Next PC
Active Mask
E
E
G
A
B
E
D
C
E
1111
0110
1001
TOS
TOS
TOS
BB/1111
C/1001 D
D/0110 F
C
Thread Warp
EE/1111
Common PC
Thread Thread Thread Thread
1
2
3
4
G/1111
G
A
B
C
D
E
G
A
Time
Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt
(8)
More Complex Example
•
Stack based implementation for nested control flow
•
Re-convergence at the immediate post-dominator of
the branch
From Fung, et. Al., “Dynamic Warp Formation: Efficient MIMD Control Flow in SIMD Graphics Hardware,
ACM TACO, June 2009
(9)
Implementation
I-Fetch
Decode
I-Buffer
Stack
Issue
Next PC
Active Mask
E
E
G
A
B
E
D
C
E
1111
0110
1001
TOS
TOS
TOS
PRF
RF
scalar
pipeline
scalar
pipeline
scalar
Pipeline
pending
warps
D-Cache
All Hit?
Reconv. PC
Data
•
GPGPUSim model:
implement stack at issue
stage
•
Implications for instruction
fetch?
Writeback
FromGPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPUSim_3.x_Manual#SIMT_Cores
(10)
Implementation (2)
•
warpPC (next instruction) compared to reconvergence
PC
•
On a branch
 Can store the reconvergence PC as part of the branch
instruction
 Branch unit has NextPC, TargetPC and reconvergence PC to
update the stack
•
On reaching a reconvergence point
 Pop the stack
 Continue fetching from the NextPC of the next entry on the
stack
(11)
Can We Do Better?
•
Warps are formed statically
•
Key idea of dynamic warp formation
 Show a pool of warps and how they can be merged
•
At a high level what are the requirements
(12)
DWF: Example
A
x/1111
y/1111
A
x/1110
y/0011
B
x/1000
Execution of Warp x
at Basic Block A
x/0110
C y/0010 D y/0001 F
E
Legend
A
x/0001
y/1100
Execution of Warp y
at Basic Block A
D
A new warp created from scalar
threads of both Warp x and y
executing at Basic Block D
x/1110
y/0011
x/1111
G y/1111
A
A
B
B
C
C
D
D
E
E
F
F
G
G
A
A
Baseline
Time
Dynamic
Warp
Formation
A
A
B
B
C
D
E
E
F
G
G
A
A
Time
Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt
(13)
How Does This Work?
•
Now discuss the criteria for merging
 Same PC
 Complements of active threads in each warp
 Remind people why there are so many threads with the same
PC
•
What information do we need to merge two warps
 Need thread IDs and PCs
(14)
DWF: Microarchitecture Implementation
A: BEQ R2, B
C: …
Warp Update Register T
PC-Warp LUT
REQ PCBA
5TID
2 x7
3N8 0110
1011
H
Warp Update Register NT
1TID
6 x N4 1001
0100
REQ PCCB
Warp Pool
PC
OCC IDX
0
2
B 0010
0110
PC
OCC IDX
1
C 1101
1001
H
xN
5TID
1
23
4 Prio
8
xN
5TID
1
67
8 Prio
4
PC
x Prio
B TID 7
N
X
Y
X
Y
ALU 1
ALU 3
ALU 4
X
Z
Y
X
Z
Y
1
5
(TID, Reg#)
6
2
(TID, Reg#)
7
3
4 (TID, Reg#)
8
1
5
6
2
7
3
4
8
1
5
6
2
7
3
4
8
X
Z
Y
RF 1
RF 2
RF 3
RF 4
(TID, Reg#)
I-Cache
ALU 2
5
1
2
6
3
7
4
8
Warp Allocator
Decode
Commit/
Writeback
5
1
2
6
3
7
4
8
PC
A
B
PC
C
A
Issue Logic
Thread Scheduler
No Lane Conflict
Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt
(15)
Hardware Consequences
•
Expose the implications that warps have in the base
design
 Implications for register file access
•
Register bank conflicts
From Fung, et. Al., “Dynamic Warp Formation: Efficient MIMD Control Flow in SIMD Graphics Hardware,
ACM TACO, June 2009
(16)
Relaxing Implications of Warps
•
Thread swizzling
•
Lane swizzling in hardware
(17)
Limitations
•
Warp synchronous programming
•
Memory coalescing
(18)