Tutorial Outline
Time
Topic
9:00 am – 9:30 am
Introduction
9:30 am – 10:10 am
Standalone Accelerator Simulation: Aladdin
10:10 am – 10:30 am
Standalone Accelerator Generation: High-Level Synthesis
10:30 am – 11:00 am
HLS-Based Accelerator-Rich Architecture Simulation: PARADE
11:00 am – 11:30 am
Break
11:30 am – 12:00 pm
Pre-RTL SoC Simulation: gem5-Aladdin
12:00 pm – 12:30 pm
FPGA Prototyping: ARACompiler
12:30 pm – 2:00 pm
Lunch
2:00 pm – 3:00 pm
Panel on Accelerator Research
3:00 pm – 3:30 pm
Accelerator Benchmarks and Workload Characterization
3:30 pm – 4:00 pm
Break
4:00 pm – 5:00 pm
Hands-on Exercise
1
Integration for Heterogeneous SoC Modeling
Yakun Sophia Shao, Sam Xi,
Gu-Yeon Wei, David Brooks
Harvard University
2
Accelerator-CPU Integration:
Today’s Conventional SoCs
• Easy to integrate lots of IP, simple accelerator
design
• Hard to program and share data
Core
L2 $
…
L3 $
Core
L2 $
Acc #1
Acc #n
Scratchpad
Scratchpad
On-Chip System Bus
DMA
3
Accelerator Integration Trend
• Users design application-specific hardware accelerators.
• System vendors provide Host Service Layer with virtual
memory and cache coherence support
– Intel QuickAssist QPI-Based FPGA Accelerator Platform (QAP)
– IBM POWER8’s Coherent Accelerator Processor Interface
(CAPI)
Main CPU/SoC
Core
L2 $
…
L3 $
FPGA or user-defined ASIC
Core
L2 $
Accelerator
Acc
Agent
Host Service Layer
4
IBM CAPI: Two part solution
• Example of state-of-the-art:
– IBM POWER8’s Coherent Accelerator
Processor Interface (CAPI)
• Virtual Addressing & Data Caching
• Easier, Natural Programming Model
5
IBM CAPI: Two part solution
• Coherent Accelerator Processor Proxy (CAPP)
– Snoops PowerBus on behalf of accelerator
• Power Service Layer (PSL)
– Performs address translations, page table walker support
– Provides cache and interface logic
Core
L2 $
…
Core
L2 $
L3 $
Accelerator
PCIe
PSL
CAPP
On-Chip Coherent PowerBus
Cache
Memory
6
TLB
…
But… accelerators are
not one size fits all
• Problem: PSL layer consumes
~20-30% of FPGA resources…
for one accelerator
• Applications have drastically
different requirements.
• Memory design customization
is often more important than
datapath customization
7
gem5-Aladdin Integration
Acc Datapath
CPU
Cache
DMA
Engine
LLC
DRAM
8
Scratch
pad
TLB
Cache
Code example: Sift
void imsmooth(F2D* array, float sigma, F2D* product);
void sift() {
…
imsmooth(I, temp, gss[0]);
mapArrayToAccelerator(imsmooth, “array”, (void *)I,
sizeof(I));
mapArrayToAccelerator(imsmooth, “product”, (void
*)product, sizeof(product));
invokeAcceleratorAndBlock(imsmooth);
…
}
9
Code example: Sift
void imsmooth(F2D* array, float sigma, F2D* product);
void sift() {
…
// imsmooth(I, temp, gss[0]);
mapArrayToAccelerator(imsmooth, “array”, (void *)I,
sizeof(I));
mapArrayToAccelerator(imsmooth, “product”, (void
*)product, sizeof(product));
invokeAccelerator(imsmooth);
…
}
10
Start Aladdin
Simulation
Simulating Accelerator with Memory
System using Aladdin
Acc
Cache
Memory
11
Acc
Cache
Memory
CPU
Cache
Memory
12
Acc
Core
Cache
Modeling Accelerators in an
SoC-like Environment
Memory
160
block=16
block=32
140
Power (mW)
120
Without Memory Contention
100
80
60
40
20
0
0
0.5
1.0
1.5
2.0
Time (Million Cycles)
2.5
3.0
13
Acc
Core
Cache
Modeling Accelerators in an
SoC-like Environment
Memory
160
block=16
block=32
140
Power (mW)
120
With Memory Contention
100
80
60
40
20
0
14
0
0.5
1.0
1.5
2.0
Time (Million Cycles)
2.5
3.0
Accelerator Research Infrastructure
System Integration
Standalone
Modeling
Aladdin
gem5-Aladdin
High-Level Synthesis
PARADE
RTL
Prototyping
FPGA
15
Tutorial References
•
Y.S. Shao and D. Brooks, “ISA-Independent Workload Characterization and its
Implications for Specialized Architectures,” ISPASS’13.
•
B. Reagen, Y.S. Shao, G.-Y. Wei, D. Brooks, “Quantifying Acceleration:
Power/Performance Trade-Offs of Application Kernels in Hardware,” ISLPED’13.
•
Y.S. Shao, B. Reagen, G.-Y. Wei, D. Brooks, “Aladdin: A Pre-RTL, PowerPerformance Accelerator Simulator Enabling Large Design Space Exploration of
Customized Architectures,” ISCA’14.
•
B. Reagen, B. Adolf, Y.S. Shao, G.-Y. Wei, D. Brooks, “MachSuite: Benchmarks for
Accelerator Design and Customized Architectures,” IISWC’14.
16
© Copyright 2026 Paperzz