gem5-Aladdin Integration for Heterogeneous SoC Modeling

Integration for Heterogeneous SoC Modeling
Y. Sophia Shao, Sam Xi,
Gu-Yeon Wei, David Brooks
Harvard University
More accelerators.
Out-of-Core
Accelerators
Maltiel Consulting
estimates
[Die photo from Chipworks]
[Accelerators annotated by
Sophia Shao @ Harvard]
2
[Shao, et al., IEEE Micro]
Accelerator-CPU Integration:
Today’s Conventional SoCs
• Easy to integrate lots of IP, simple accelerator
design
• Hard to program and share data
Core
L2 $
…
L3 $
Core
L2 $
Acc #1
Acc #n
Scratchpad
Scratchpad
On-Chip System Bus
DMA
3
Accelerator Integration Trend
• Users design application-specific hardware accelerators.
• System vendors provide Host Service Layer with virtual
memory and cache coherence support
– Intel QuickAssist QPI-Based FPGA Accelerator Platform (QAP)
– IBM POWER8’s Coherent Accelerator Processor Interface
(CAPI)
Main CPU/SoC
Core
L2 $
…
L3 $
FPGA or user-defined ASIC
Core
L2 $
Accelerator
Acc
Agent
Host Service Layer
4
Aladdin: A pre-RTL, PowerPerformance Accelerator Simulator
Shared Memory/Interconnect
Models
Unmodified
C-Code
Accelerator Design
Parameters
(e.g., # FU, mem. BW)
Aladdin
Power/Area
Accelerator
Specific
Datapath
Private L1/
Scratchpad
Performance
“Accelerator Simulator”
Design Accelerator-Rich SoC
Fabrics and Memory Systems
5
Aladdin: A pre-RTL, PowerPerformance Accelerator Simulator
Shared Memory/Interconnect
Models
Unmodified
C-Code
Accelerator Design
Parameters
(e.g., # FU, mem. BW)
Aladdin
Power/Area
Accelerator
Specific
Datapath
Private L1/
Scratchpad
Performance
“Accelerator Simulator”
Design Accelerator-Rich SoC
Fabrics and Memory Systems
6
Aladdin: A pre-RTL, PowerPerformance Accelerator Simulator
Shared Memory/Interconnect
Models
Unmodified
C-Code
Aladdin
Accelerator Design
Parameters
(e.g., # FU, mem. BW)
Power/Area
Accelerator
Specific
Datapath
Private L1/
Scratchpad
Performance
“Accelerator Simulator”
Design Accelerator-Rich SoC
Fabrics and Memory Systems
“Design Assistant”
Understand Algorithmic-HW
Design Space before RTL
Flexibility
Programmability
Design Cost
7
Aladdin Overview
Optimization Phase
C Code
Acc Design
Parameters
Optimistic
IR
Initial
DDDG
Idealistic
DDDG
Dynamic Data
Dependence Graph
Resource
Program (DDDG)
Constrained
DDDG
Constrained
DDDG
Realization Phase
8
Performance
Activity
Power/Area
Models
Power/Area
Aladdin Take-Away
• Compared to HLS and hand-written RTL for SHOC
benchmarks and custom accelerator designs
Cycle Counts
within 2%
Power
within 5%
Area
within 7%
• Large design space exploration (DSE) in minutes instead of
hours/days with unmodified C/C++ algorithm description
• Limitations
– Dynamic approach  Aladdin depends on realistic workload
inputs
– Algorithm dependent Aladdin enables DSE/algorithm
exploration
9
Aladdin enables pre-RTL simulation of
accelerators with the rest of the SoC.
gem5
Big
Cores
...
gem5
Small
Cores
…
Shared
Ruby/GARNET
Resources
GPGPUGPU
Sim
Memory
DRAMSim2
Interface
Sea of Fine-Grained
Accelerators
10
gem5-Aladdin Integration
Acc Datapath
CPU
Cache
DMA
Engine
LLC
DRAM
11
Scratch
pad
TLB
Cache
gem5-Aladdin Integration
Acc Datapath
CPU
…
Cache
Scratch
pad
DMA
Engine
TLB
Cache
Acc Datapath
Scratch
pad
Acc Shared Cache
LLC
DRAM
12
TLB
Cache
…
Acc
Cache
Memory
CPU
Cache
Memory
13
Heterogeneous SoC Modeling
• Increasing number of accelerators are
integrated into both mobile SoCs and servers.
• gem5-Aladdin integration enables rapid design
space exploration of future accelerator-centric
platforms.
• Download Aladdin at
http://vlsiarch.eecs.harvard.edu/aladdin
14