CHAINSAW Von-Neumman Accelerators To Leaverage

CHAINSAW
Von-Neumann Accelerators To
Leverage Fused Instruction Chains
Amirali Sharifian, Snehasish Kumar, Apala Guha,
Arrvindh Shriraman
1
AXC Challenge 1: Idleness
Application
DFG
1
Spatial Fabric
81
2
92
12
3
3
Fabric Size
Dataflow graph size
More dataflow dependencies
More Idleness
2
AXC Challenge 2: Data movement
Compute
81
92
30%
12
3
70%
Communication
70% energy for moving data
3
Von-Neumann Features
DFG
Ins. Buffer
1
1
2
Reg
2
3
ALU
3
Temporal Mapping = Less Idleness
Central Register File
Fetch and Decode
4
Our Approach : Fused Instruction Chains
Von-Neumann + Chains
CHAIN DFG
Reg.
1
1
2
2
3
3
Compiler
exposed
Bypass
Temporal Mapping = Less Idleness
Bypass = Internalize communication
5
Our Approach : Fused Instruction Chains
Von-Neumann w/ Chains
CHAIN DFG
 Do chains exist in a DFG?
Reg.
1
1

How
to
form
the
chains?
Compiler
2
2
exposed
Bypass Reg.

What
are
the
challenges?
3
3
 Modeling
and Evaluation
Temporal
Mapping
= Less Idleness
Central Register File
6
CHAINs vs VLIW
Chains
VLIW
Finding dependent
instructions
Finding independent
instructions
Vertical Fusion
Horizontal Fusion 7
Do chains exist in a DFG ?
100
1-2 Op
3-4 Ops
5+ Ops
50
0
50–80% of DFG part of 3+ op chains
8
How to form chains? Reduce Communication
Chained DFG
C1
C2
1
4
2
5
3
6
Schedule
C1
1
2
3
C2
4
5
6
Internalize communication
May fail to discover ILP
9
How to form chains? Optimize for ILP
Chained DFG
C1
Schedule
C1
C2
1
1
4
C3
C3
2
5
3
2
3
C2
4
5
6
6
Same ILP as the prog.
Increased communication
10
How much communication is within chains?
Inter-Chain
Intra-Chain
100
80
60
40
20
0
ILP+
SIZE+
9 Workloads
(dwt, soplex)
ILP+
SIZE+
9 Workloads
(parser, craft)
ILP+
SIZE+
8 Workloads
(gcc, namd)
40-60% of communication localized
11
How to extract – longer – Chains
Control Flow
12
How to extract – longer – Chains
GUARD
Control Flow
Larger Superblocks/Paths ⇒ Larger chains
13
CHAINSAW is an Accelerator
IN/ OUT Reg
IN/ OUT Reg
OOO
Core
 Control
free
HOT PATH
Ins.
Buffer
Ins.
Buffer
Reg.
File
Fetch/ Decode
Fetch/ Decode
Bypass Reg
BUS
Cache Mem.
Reg.
File
Bypass Reg
CHAINSAW
WORKLOAD
 Only hot
paths
 Limited
inst.
buffer
14
Multi-Lane CHAINSAW Execution
Dataflow Graph
1
D2
C2
C0
Lane 1
Lane 2
Ins. Buffer
Ins. Buffer
4
C1
D1
4
2
3
5
1
2
6
C2
C0
C1
5
3
6
D1 D2
Register file
15
Chainsaw – Fetch and Decode
Dataflow Graph
Instruction Fields
C1
D1
4
Op
4
5
5
6
6
IN0/1
WR
FWD
L/R
OUT0/1
1
0
X
0
1
0
0
1
1
X
1
X
0
1
0
Only 13bits is needed to decode!
16
Evaluation – Dynamic Energy
• Chainsaw adds Fetch/Decode cost for dynamic energy
• CGRA network overhead dominate Chainsaw F/D cost
100.0
Communication
80.0
Computation
F/D
OOO-4
F/D cost = 8%
60.0
40.0
20.0
0.0
gzip
bzip2
mcf2k
lbm
parser
45% less than 4-way OOO
14% less than CGRA8x8
dwt53
17
Evaluation – Data movement energy
100
Inter-Chain
80
Intra-Chain
CGRA 8X8
60
40
20
0
gzip
bzip2
mcf2k
lbm
parser
dwt53
Chainsaw reduces 40% of energy
18
Evaluation – Performance
CHAINSAW8
CGRA 8X8
100
80
60
40
20
0
gzip
bzip2
hmmer
lbm
parser
Within 73% of CGRA8x8
20% better than OOO core
dwt53
19
Chainsaw is a Von-Neumman accelerator
 Chains sequentially dependent
operations.
 Chainsaw Accelerator:
 Exploit lack of ILP
 Reduce communication energy
 Reuse functional units
Energy < CGRA
Performance ≃ CGRA
IN/ OUT Reg
IN/ OUT Reg
Ins.
Buffer
Ins.
Buffer
Reg.
File
Reg.
File
Fetch/ Decode
Fetch/ Decode
Bypass Reg
Bypass Reg
BUS
81
92
3
20
Q&A
github.com/sfu-arch/chainsaw
21
AXC Challenge 2: Data movement
Spatial Fabric
81
COMPUTE
92
12
3
SWITCH
50% Energy overhead for
data movement
22
Evaluation – Data movement energy
Intra-Chain
1
0.8
0.6
0.4
0.2
0
gzip
bzip2
mcf2k
lbm
parser
dwt53
Reduced energy in
Chainsaw
Chainsaw internalizes 50%+ of comm.
24
Evaluation – Dynamic Energy
• Chainsaw adds Fetch/Decode cost for dynamic energy
• CGRA network overhead dominate Chainsaw F/D cost
1
0.8
0.6
0.4
0.2
0
Communication
gzip
bzip2
mcf2k
Ops
lbm
F/D
parser
OOO-4
dwt53
13% less than CGRA
45% less than 4-way OOO
25