General-purpose SIMT processors

General-purpose SIMT processors
Sylvain Collange
INRIA Rennes / IRISA
[email protected]
From GPU to heterogeneous multi-core
Yesterday (2000-2010)
Homogeneous multi-core
Discrete components
Today (2011-...)
Heterogeneous multi-core
Central
Processing Unit
(CPU)
Graphics
Processing
Unit (GPU)
Physically unified
CPU + GPU on the same chip
Logically separated
Different programming models,
compilers, instruction sets
Tomorrow
Unified programming models?
Single instruction set?
Latencyoptimized
cores
Throughputoptimized
cores
Hardware
accelerators
Heterogeneous multi-core chip
2
Outline
Performance or efficiency?
Latency-oriented architectures
Throughput-oriented architectures
Heterogeneous architectures
Dynamic SPMD vectorization
Traditional dynamic vectorization
More flexibility with state-free dynamic vectorization
New CPU-GPU hybrids
DITVA: CPU with dynamic vectorization
SBI: GPU with parallel path execution
3
The 1980': pipelined processor
Example: scalar-vector multiplication: X ← a∙X
for i = 0 to n-1
X[i] ← a * X[i]
Source code
move i ← 0
loop:
load t ← X[i]
mul
t ← a×t
store X[i] ← t
add
i ← i+1
branch i<n? loop
Machine code
add i ← 18
Fetch
store X[17]
Decode
mul
Execute
Memory
L/S Unit
Sequential CPU
4
The 1990': superscalar processor
Goal: improve performance of sequential applications
Latency: time to get the result
Exploits Instruction-Level Parallelism (ILP)
Uses many tricks
Branch prediction, out-of-order execution, register renaming, data
prefetching, memory disambiguation…
Basis: speculation
Take a bet on future events
If right: time gain
If wrong, roll back: energy loss
5
What makes speculation work: regularity
Application behavior likely to follow regular patterns
Regular case
Control
regularity
Memory
regularity
for(i…)
{
if(f(i)) {
}
}
j = g(i);
x = a[j];
Irregular case
Time
i=0
i=1
i=2
i=3
taken taken taken taken
j=17
j=18
j=19
j=20
i=0
i=1
i=2
i=3
not tk taken taken not tk
j=21
j=4
j=17
j=2
Speculation exploits patterns to guess accurately
Applications
Caches
Branch prediction
Instruction prefetch, data prefetch, write combining…
6
The 2000': going multi-threaded
Memory wall
Compute
Performance
Gap
Memory
More and more difficult to hide memory
latency
Power wall
Time
Performance is now limited by power
consumption
Transistor
density
ILP wall
Law of diminishing returns on
Instruction-Level Parallelism
Total power
Transistor
power
Time
Gradual transition from latencyoriented to throughput-oriented
Cost
Homogeneous multi-core
Simultaneous multi-threading
Serial performance
7
Homogeneous multi-core
Replication of the complete execution engine
Multi-threaded software
Threads:
T0
add i ← 18 IF
IF
store X[17] ID
add i ← 50 IF
IF
store X[49] ID
mul
mul
EX
LSU
EX
Memory
move i ← slice_begin
loop:
load t ← X[i]
mul
t ← a×t
store X[i] ← t
add
i ← i+1
branch i<slice_end? loop
Machine code
LSU
T1
Improves throughput thanks to explicit parallelism
8
Simultaneous multi-threading (SMT)
Time-multiplexing of processing units
Same software view
move i ← slice_begin
loop:
load t ← X[i]
mul
t ← a×t
store X[i] ← t
add
i ← i+1
branch i<slice_end? loop
Machine code
Threads: T0
T1
T2
mul
Fetch
mul
Decode
add i ←73
add i ← 50
Execute
load X[89]
store X[72]
load X[17]
store X[49]
L/S Unit
Memory
T3
Hides latency thanks to explicit parallelism
9
Outline
Performance or efficiency?
Latency-oriented architectures
Throughput-oriented architectures
Heterogeneous architectures
Dynamic SPMD vectorization
Traditional dynamic vectorization
More flexibility with state-free dynamic vectorization
New CPU-GPU hybrids
DITVA: CPU with dynamic vectorization
SBI: GPU with parallel path execution
10
Throughput-oriented architectures
Also known as GPUs, but do more than just graphics
Target: highly parallel sections of programs
Programming model: SPMD
one function run by many threads
One code:
Many threads:
For n threads:
X[tid] ← a * X[tid]
Goal: maximize computation / energy consumption ratio
Many-core approach: many independent, multi-threaded cores
Can we be more efficient?
Exploit regularity
11
Parallel regularity
Similarity in behavior between threads
Irregular
Regular
Control
regularity
1
i=17
Thread
2
3
i=17 i=17
4
i=17
2
i=4
3
i=17
4
i=2
load
A[8]
load
A[0]
load
A[11]
load
A[3]
a=17
a=-5
a=11 a=42
b=15
b=0
b=-2
Time
switch(i) {
case 2:...
case 17:...
case 21:...
}
Memory
regularity
load
A[8]
load
A[9]
load load
A[10] A[11]
r=A[i]
A
Data
regularity
1
i=21
Memory
a=32 a=32 a=32 a=32
b=52 b=52 b=52 b=52
r=a*b
b=52
12
Dynamic SPMD vectorization aka SIMT
Run SPMD threads in lockstep
Mutualize fetch/decode, load-store units
Fetch 1 instruction on behalf of several threads
Read 1 memory location and broadcast to several registers
(0-3) load
IF
T1
(0-3) store
ID
T2
T3
(0) mul
(0)
(1) mul
(2) mul
(1)
(2)
(3) mul EX
(3)
Memory
T0
LSU
SIMT: Single Instruction, Multiple Threads
Wave of synchronized threads: warp
Improves Area/Power-efficiency thanks to regularity
13
Example GPU: NVIDIA GeForce GTX 980
SIMT: warps of 32 threads
16 SMs / chip
4×32 cores / SM, 64 warps / SM
Warp 1
Warp 5
…
Core 127
Core 93
Warp 62
Core 92
…
Warp 4
Warp 8
Core 91
Time
Core 66
Warp 61
Core 65
…
Warp 3
Warp 7
Core 64
Core 34
Warp 60
Core 33
Core 32
Core 2
Core 1
…
Warp 2
Warp 6
…
Warp 63
SM1
SM16
4612 Gflop/s
Up to 32768 threads in flight
14
SIMT vs. multi-core + explicit SIMD
SIMT
All parallelism expressed using
threads
Warp size implementationdefined
Dynamic vectorization
Threads
Warp
Multi-core + explicit SIMD
Combination of threads, vectors
Vector length fixed at compiletime
Static vectorization
Threads
Vector
SIMT benefits
Easier programming
Retain binary compatibility
15
Outline
Performance or efficiency?
Latency-oriented architectures
Throughput-oriented architectures
Heterogeneous architectures
Dynamic SPMD vectorization
Traditional dynamic vectorization
More flexibility with state-free dynamic vectorization
New CPU-GPU hybrids
DITVA: CPU with dynamic vectorization
SBI: GPU with parallel path execution
16
Heterogeneity: causes and consequences
Amdahl's law
Time to run
sequential sections
S=
1
1−P
P
N
Time to run
parallel sections
Latency-optimized multi-core (CPU)
Low efficiency on parallel sections:
spends too much resources
Throughput-optimized multi-core (GPU)
Low performance on sequential sections
Heterogeneous multi-core (CPU+GPU)
Use the right tool for the right job
Resources saved in parallel sections can be
devoted to accelerete sequential sections
M. Hill, M. Marty. Amdahl's law in the multicore era. IEEE Computer, 2008.
17
Single-ISA heterogeneous architectures
Proposed in the academia, now
in embedded systems on chip
Example: ARM big.LITTLE
High-performance CPU cores
Cortex A57
Low-power CPU cores
Cortex A53
big cores
LITTLE cores
Thread
migration
Different cores,
same instruction set
Migrate application threads
dynamically
according to demand
18
Single-ISA heterogeneous CPU-GPU
Enables dynamic task migration between CPU and GPU
Use the best core for the current job
Load-balance on available resources
CPU cores
GPU cores
Thread
migration
19
Option 1: static vectorization
Extend CPU SIMD instruction sets to throughput-oriented cores
Scalar + wide vector ISA
e.g. x86-64 + AVX-512
Latency core
Throughput cores
core
Issue: conflicting requirements
Same suboptimal SIMD vector length
for all cores
Or binary compatibility loss,
ISA fragmentation
Intel. Intel® Advanced Vector Extensions 2015/2016 Support
in GNU Compiler Collection. GNU Tools Cauldron 2014
20
Our proposal: Dynamic vectorization
Extend SIMT execution model to general-purpose cores
Scalar ISA on both sides
Latency core
Throughput cores
core
Flexibility advantage:
SIMD width optimized for each core type
Challenge: generalize dynamic vectorization
to general-purpose instruction processing
21
Outline
Performance or efficiency?
Latency-oriented architectures
Throughput-oriented architectures
Heterogeneous architectures
Dynamic SPMD vectorization
Traditional dynamic vectorization
More flexibility with state-free dynamic vectorization
New CPU-GPU hybrids
DITVA: CPU with dynamic vectorization
SBI: GPU with parallel path execution
22
Capturing instruction regularity
How to keep threads synchronized?
Challenge: conditional branches
Rules of the game
x = 0;
// Uniform condition
if(tid > 17) {
x = 1;
One thread per SIMD lane
}
Same instruction on all lanes
// Divergent conditions
Lanes can be individually disabled
if(tid < 2) {
if(tid == 0) {
x = 2;
}
else {
Thread 0 Thread 1 Thread 2 Thread 3
x = 3;
}
1 instruction Lane 0
Lane 1
Lane 2
Lane 3
}
23
Most common: mask stack
Code
Mask Stack
1 activity bit / thread
x = 0;
// Uniform condition
1111
if(tid > 17) {
skip
tid=2
tid=0
x = 1;
1111
}
// Divergent conditions
if(tid < 2) {
push
if(tid == 0) {
push
x = 2;
} pop
else {
push
} pop
} pop
x = 3;
tid=1
tid=3
1111
1100
1111
1100
1111
1100
1111
1100
1111
1100
1000
0100
1111
A. Levinthal and T. Porter. Chap - a SIMD graphics processor. SIGGRAPH’84, 1984.
24
Instruction PC, Instruction Insn,
Sequencer Activity
Fetch
Activity
mask
mask
Mask
stack
Broadcast
Traditional SIMT pipeline
Instruction
Activity bit
Exec
Instruction,
Activity bit
Exec
Activity bit=0: discard instruction
Instruction,
Activity bit
Exec
Used in Nvidia GPUs
25
Outline
Performance or efficiency?
Latency-oriented architectures
Throughput-oriented architectures
Heterogeneous architectures
Dynamic SPMD vectorization
Traditional dynamic vectorization
More flexibility with state-free dynamic vectorization
New CPU-GPU hybrids
DITVA: CPU with dynamic vectorization
SBI: GPU with parallel path execution
26
Goto considered harmful?
MIPS
j
jal
jr
syscall
NVIDIA
Tesla
(2007)
NVIDIA
Fermi
(2010)
Intel GMA Intel GMA
Gen4
SB
(2006)
(2011)
bar
bra
brk
brkpt
cal
cont
kil
pbk
pret
ret
ssy
trap
.s
bar
bpt
bra
brk
brx
cal
cont
exit
jcal
jmx
kil
pbk
pret
ret
ssy
.s
jmpi
if
iff
else
endif
do
while
break
cont
halt
msave
mrest
push
pop
jmpi
if
else
endif
case
while
break
cont
halt
call
return
fork
Control instructions in some CPU
and GPU instruction sets
AMD
R500
(2005)
jump
loop
endloop
rep
endrep
breakloop
breakrep
continue
AMD
R600
(2007)
push
push_else
pop
loop_start
loop_start_no_al
loop_start_dx10
loop_end
loop_continue
loop_break
jump
else
call
call_fs
return
return_fs
alu
alu_push_before
alu_pop_after
alu_pop2_after
alu_continue
alu_break
alu_else_after
Why so many?
Expose control flow structure to the instruction sequencer
AMD Cayman
(2011)
push
push_else
pop
push_wqm
pop_wqm
else_wqm
jump_any
reactivate
reactivate_wqm
loop_start
loop_start_no_al
loop_start_dx10
loop_end
loop_continue
loop_break
jump
else
call
call_fs
return
return_fs
alu
alu_push_before
alu_pop_after
alu_pop2_after
alu_continue
alu_break
alu_else_after
27
SIMD is so last century
Maspar MP-1 (1990)
1 instruction for
16 384 processing
elements (PEs)
PE : ~1 mm²,
1.6 µm process
SIMD programming
model
/1000
Fewer PEs
NVIDIA Fermi (2010)
1 instruction for
16 PEs
×50
Bigger PEs
More
divergence
PE : ~0,03 mm²,
40 nm process
Threaded programming model
From centralized control to flexible distributed control
28
Moving away from the vector model
Requirements for a single-ISA CPU+GPU
Run general-purpose applications
Switch freely back and forth between SIMT and MIMD modes
Conventional techniques do no meet these requirements
Solution: stateless dynamic vectorization
Key idea
Maintain 1 Program Counter (PC) per thread
Each cycle, elect one Master PC to fetch from
Activate all threads that have the same PC
29
1 PC / thread
Program Counters (PCs)
Code
x = 0;
tid=
0
1
2
3
0
0
if(tid > 17) {
x = 1;
}
if(tid < 2) {
Match
→ active
if(tid == 0) {
x = 2;
Master PC
}
1
0
PC0
else {
x = 3;
No match
→ inactive
PC1
}
}
PC2 PC3
30
Our new SIMT pipeline
Exec
Update PC
PC0
PC1
Insn, MPC=PC ?Insn
1
MPC
Exec
Update PC
PC1
PCn
MPC Instruction Insn,
Fetch
MPC
Broadcast
Insn, MPC=PC ?Insn
0
MPC
Vote
PC0
No match: discard instruction
Insn, MPC=PC ?Insn
n
MPC
Exec
Update PC
PCn
31
Benefits of stateless dynamic vectorization
Before: stack, counters
O(n), O(log n) memory
n = nesting depth
1 R/W port to memory
Exceptions: stack overflow,
underflow
Vector semantics
Structured control flow only
Specific instruction sets
After: multiple PCs
O(1) memory
No shared state
Allows thread suspension, restart,
migration
Multi-thread semantics
Traditional languages, compilers
Traditional instruction sets
Can be mixed with MIMD
32
Scheduling policy: min(SP:PC)
Which PC to choose as master PC ?
Conditionals, loops
Order of code addresses
min(PC)
Functions
Source
if(…)
{
}
else
{
}
Favor max nesting depth
min(SP)
Assembly
…
p? br else
…
br endif
else:
…
endif:
while(…) start:
{
…
}
p? br start
…
Order
1
2
3
1 2 3
4
With compiler support
Unstructured control flow too
No code duplication
Full backward and forward compatibility
…
f();
…
call f
…
void f() f:
{
…
…
ret
}
1
3
2
33
Potential of Min(SP:PC)
Comparison of fetch policies on SPMD benchmarks
PARSEC and SPLASH benchmarks for CPU, using pthreads, OpenMP, TBB
Microarchitecture-independent model: ideal SIMD machine
Average number of
active threads
Min(SP:PC) achieves reconvergence at minimal cost
T. Milanez et al. Thread scheduling and memory coalescing for dynamic vectorization
of SPMD workloads. Parallel Computing 40.9:548-558. 2014
34
Simty: a synthesizable SIMT CPU
Proof of concept for dynamic vectorization
Written in synthesizable VHDL
Runs the RISC-V instruction set (RV32I)
Fully parametrizable warp size, warp count
10-stage pipeline
FPGA Prototype
On Altera Cyclone IV – based DE-2 board
8-warp × 4-thread Simty: 7151 LEs, 12 M9Ks (6%), 72 MHz
8-warp × 8-thread Simty: 12765 LEs, 24 M9Ks (11%), 63 MHz
Overhead of per-PC control is small
35
Outline
Performance or efficiency?
Latency-oriented architectures
Throughput-oriented architectures
Heterogeneous architectures
Dynamic SPMD vectorization
Traditional dynamic vectorization
More flexibility with state-free dynamic vectorization
New CPU-GPU hybrids
DITVA: CPU with dynamic vectorization
SBI: GPU with parallel path execution
36
DITVA
Dynamic Inter-Thread Vectorization Architecture
Add dynamic vectorization capability to an in-order SMT CPU
Runs existing parallel programs compiled for x86
Scheduling policy: alternate min(SP:PC) and round-robin
4 scalar
units
2 SIMD
units
Baseline: 4-thread 4-issue
in-order with explicit SIMD
4 SIMT
units
DITVA: 4-warp × 4-thread 4-issue
37
DITVA performance
Speedup of 4-warp × 2-thread DITVA
and 4-warp × 4-thread DITVA
over baseline 4-thread processor
+18% and +30% performance on SPMD workloads
38
Outline
Performance or efficiency?
Latency-oriented architectures
Throughput-oriented architectures
Heterogeneous architectures
Dynamic SPMD vectorization
Traditional dynamic vectorization
More flexibility with state-free dynamic vectorization
New CPU-GPU hybrids
DITVA: CPU with dynamic vectorization
SBI: GPU with parallel path execution
39
Simultaneous Branch Interweaving
Co-issue instructions from divergent branches
Fill inactive units using parallelism from divergent paths
1
2
3
4
5
6
7
Control-flow
graph
Same cycle,
two
instructions
SIMT
(baseline)
SBI
N. Brunie, S. Collange, G. Diamos. Simultaneous branch and warp interweaving for sustained GPU
performance. ISCA 2012.
40
Conclusion: the missing link
CPU today
GPU today
CPU ISA
Multi-core
multi-thread
SIMT model
DITVA
SBI
SIMT
New design space
New range of architecture options between multi-core and GPUs
Enables heterogeneous platforms with unified instruction set
41