Introduction
Lectures Slides and Figures from MKP and Sudhakar Yalamanchili
(1)
Handling Branches
•
CUDA Code:
if(…) … (True for some threads)
T
T
taken
T
T
not taken
else … (True for others)
•
What if threads takes different branches?
•
Branch Divergence!
(2)
Branch Divergence
A
Thread Warp
B
C
D
F
Common PC
Thread Thread Thread Thread
1
2
3
4
E
G
•
•
Different threads follow different control flow paths
through the kernel code
Thread execution is (partially) serialized
Subset of threads that follow the same path execute in
parallel
Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt
(3)
Branch Divergence
•
•
Occurs within a warp
Branches lead serialization branch dependent code
Performance issue: low warp utilization
if(…)
{… }
Idle threads
else {
…}
Reconvergence!
(4)
Basic Implementation
•
Show a basic implementation with no reconvergence
Point is what are the consequences
•
Focus is on correct execution
•
Acknowledge work in prior SIMD processors
(5)
Thread Reconvergence
•
Introduce the concept of thread reconvergence (using
the same block structured example)
•
Fundamental problem:
merge threads with the same PC
How do we sequence execution of threads? Since this can
effect the ability to reconverge
•
Question: When can threads productively
reconverge?
•
Question: When is the best time to reconverge?
Concludes with the intuitive notion of post-dominator
•
Tie this all back to how the architecture is supposed
to work efficiently best case we aspire to
(6)
Immediate Postdominator
•
Definition
•
Computation
(7)
Baseline: PDOM
Stack
AA/1111
Reconv. PC
Next PC
Active Mask
E
E
G
A
B
E
D
C
E
1111
0110
1001
TOS
TOS
TOS
BB/1111
C/1001 D
D/0110 F
C
Thread Warp
EE/1111
Common PC
Thread Thread Thread Thread
1
2
3
4
G/1111
G
A
B
C
D
E
G
A
Time
Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt
(8)
More Complex Example
•
Stack based implementation for nested control flow
•
Re-convergence at the immediate post-dominator of
the branch
From Fung, et. Al., “Dynamic Warp Formation: Efficient MIMD Control Flow in SIMD Graphics Hardware,
ACM TACO, June 2009
(9)
Implementation
I-Fetch
Decode
I-Buffer
Stack
Issue
Next PC
Active Mask
E
E
G
A
B
E
D
C
E
1111
0110
1001
TOS
TOS
TOS
PRF
RF
scalar
pipeline
scalar
pipeline
scalar
Pipeline
pending
warps
D-Cache
All Hit?
Reconv. PC
Data
•
GPGPUSim model:
implement stack at issue
stage
•
Implications for instruction
fetch?
Writeback
FromGPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPUSim_3.x_Manual#SIMT_Cores
(10)
Implementation (2)
•
warpPC (next instruction) compared to reconvergence
PC
•
On a branch
Can store the reconvergence PC as part of the branch
instruction
Branch unit has NextPC, TargetPC and reconvergence PC to
update the stack
•
On reaching a reconvergence point
Pop the stack
Continue fetching from the NextPC of the next entry on the
stack
(11)
Can We Do Better?
•
Warps are formed statically
•
Key idea of dynamic warp formation
Show a pool of warps and how they can be merged
•
At a high level what are the requirements
(12)
DWF: Example
A
x/1111
y/1111
A
x/1110
y/0011
B
x/1000
Execution of Warp x
at Basic Block A
x/0110
C y/0010 D y/0001 F
E
Legend
A
x/0001
y/1100
Execution of Warp y
at Basic Block A
D
A new warp created from scalar
threads of both Warp x and y
executing at Basic Block D
x/1110
y/0011
x/1111
G y/1111
A
A
B
B
C
C
D
D
E
E
F
F
G
G
A
A
Baseline
Time
Dynamic
Warp
Formation
A
A
B
B
C
D
E
E
F
G
G
A
A
Time
Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt
(13)
How Does This Work?
•
Now discuss the criteria for merging
Same PC
Complements of active threads in each warp
Remind people why there are so many threads with the same
PC
•
What information do we need to merge two warps
Need thread IDs and PCs
(14)
DWF: Microarchitecture Implementation
A: BEQ R2, B
C: …
Warp Update Register T
PC-Warp LUT
REQ PCBA
5TID
2 x7
3N8 0110
1011
H
Warp Update Register NT
1TID
6 x N4 1001
0100
REQ PCCB
Warp Pool
PC
OCC IDX
0
2
B 0010
0110
PC
OCC IDX
1
C 1101
1001
H
xN
5TID
1
23
4 Prio
8
xN
5TID
1
67
8 Prio
4
PC
x Prio
B TID 7
N
X
Y
X
Y
ALU 1
ALU 3
ALU 4
X
Z
Y
X
Z
Y
1
5
(TID, Reg#)
6
2
(TID, Reg#)
7
3
4 (TID, Reg#)
8
1
5
6
2
7
3
4
8
1
5
6
2
7
3
4
8
X
Z
Y
RF 1
RF 2
RF 3
RF 4
(TID, Reg#)
I-Cache
ALU 2
5
1
2
6
3
7
4
8
Warp Allocator
Decode
Commit/
Writeback
5
1
2
6
3
7
4
8
PC
A
B
PC
C
A
Issue Logic
Thread Scheduler
No Lane Conflict
Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt
(15)
Hardware Consequences
•
Expose the implications that warps have in the base
design
Implications for register file access
•
Register bank conflicts
From Fung, et. Al., “Dynamic Warp Formation: Efficient MIMD Control Flow in SIMD Graphics Hardware,
ACM TACO, June 2009
(16)
Relaxing Implications of Warps
•
Thread swizzling
•
Lane swizzling in hardware
(17)
Limitations
•
Warp synchronous programming
•
Memory coalescing
(18)
© Copyright 2026 Paperzz