Slides

Dynamic Acceleration Management
of SystemC Emulation
Scott Sirowy, Chen Huang, and Frank Vahid†
Department of Computer Science and Engineering
University of California, Riverside
{ssirowy,chuang, vahid}@cs.ucr.edu
†Also with the Center for Embedded Computer Systems at UC Irvine
This work was supported in part by the National Science
Foundation and the Office of Naval Research
Introduction: Prototyping Circuits and Systems
address data go
Edge Detector

Memory Controller
SystemC

s1 s2 s3 s4
s6 s7 s8 s9

+
+
+
+
+
+
+
+
+
+
-
+
+
+

255
MIN
Pixel Value
class EDGE_DETECTOR :
public sc_module {
//signal declarations
…
EDGE_DETECTOR() {
SC_method(mainComp);
sensitive << dataReady;
Capture
in HDL

C++ based
Creation, instantiation,
and connection of
components
Precisely timed
communication and
execution among
concurrently executing
components
Supports both “software”
and “hardware”
constructs and semantics
SC_method(getPixel);
sensitive << clock.pos();
2/17
Introduction: Prototyping Circuits and Systems
address data go
Edge Detector

Memory Controller
s1 s2 s3 s4
+
+
+
+

s6 s7 s8 s9
+
+
+
+
+
+
+
+
-
In-System Emulation

+
Prior to time-consuming
mapping and synthesis
255
MIN
class EDGE_DETECTOR :
public sc_module {
//signal declarations
…
EDGE_DETECTOR() {
SC_method(mainComp);
sensitive << dataReady;
Quickly-obtained
simulation interaction
with real I/O

But slower
Capture
in HDL
Emulation
SC_method(getPixel);
sensitive << clock.pos();
3/17
SystemC Emulation Engine

Real I/O Peripherals

Emulation Engine
Input
Memory
Output
Memory
UART
Buttons
LEDs
Main Processor

Instruction
Memory
Read Signal
Memory
Write Signal
Memory
USB
Interface
Representative of many
systems
Emulation Engine
Kernel



Virtual Machine
Discrete Event Kernel
Peripheral Access and
Hooks
USB Download Interface
I/O Peripherals
Event Kernel and
Virtual Machine
Peripherals
4/17
Emulation Engine Acceleration
Emulation Engine
Input
Memory
Output
Memory
UART
Buttons
LEDs

Main Processor
SystemC
bytecode
Instruction
Memory
Read Signal
Memory
For some SystemC
applications, emulation
can be slow

USB
Interface
An Edge Detection circuit
required ~10 minutes to
process a 320x240 image *
Write Signal
Memory
5/17
* on a 100 MHz/SRAM Microblaze SystemC Emulation Engine implementation
Emulation Engine Acceleration
Emulation Engine
Input
Memory
Output
Memory
UART
Buttons
LEDs
Main Processor
SystemC
bytecode
Instruction
Memory
Read Signal
Memory
Write Signal
Memory
Accelerator 1
For some SystemC
applications, emulation
can be slow

USB
Interface

Accelerator 3
An Edge Detection circuit
required ~10 minutes to
process a 320x240 image *
If available, use
platform FPGA to create
bytecode accelerators

Accelerator 2
FPGA

Execute SystemC bytecode
natively
Accelerators speedup emulation by over 100X
* on a 100 MHz Microblaze SystemC Emulation Engine implementation
6/17
Emulation Engine Acceleration Management
Image Processing System
Emulation Engine
Input
Memory
Output
Memory
UART
Buttons
LEDs
Main Processor
Instruction
Memory
Read Signal
Memory
Event Queue
Edge
?

USB
Interface
Write Signal
Memory

Edge Edge
…
Available Accelerators
Accelerator Loading
Overhead
Accelerator 3
Total Execution Time
Emulate every process in software
Accelerator 2
FPGA
Blur
Emulate in software, or
accelerate using a
bytecode accelerator?

Accelerator 1
Blur
Accelerate Every Process
Communication and
Loading Overhead
Dynamically Manage Accelerators
7/17
Emulation Engine Acceleration Management
Image Processing System
Emulation Engine
Input
Memory
Output
Memory
UART
Buttons
LEDs
Main Processor
Instruction
Memory
Read Signal
Memory
Write Signal
Memory
?
Edge
Blur
Accelerator 2
Accelerator 3
Blur
Edge Edge
…
Online Decision
USB
Interface
Accelerate
Process
Edge
Accelerator 1
FPGA
Event Queue
Emboss
Mean
Blur
Radial
# of uses
8
4
3
5
2
In Accelerator
Yes
No
No
Yes
No
History Table
8/17
Dynamic Accelerator Management
Event Queue:
Loading
Time
(ms)
p1p1
80
10
70
p2p2
60
20
60
p3p3
70
25
30
Available Acceleration Engines
Statically Preloaded
Emulation
(ms)
Emulation +
Accelerator
(ms)
Time(ms)
Initial state: Accelerator 1 and Accelerator 2
are preloaded with processes p1 and p2
from the SystemC circuit.
Accelerator 1
Accelerator 2
9/17
Dynamic Accelerator Management
Event Queue:
Loading
Time
(ms)
p1p1
80
10
70
p2p2
60
20
60
p3p3
70
25
30
Loading Time
Available Acceleration Engines
Accelerator 1
Better schedule
Accelerator 2
Statically Preloaded Greedy schedule
Emulation
(ms)
Emulation +
Accelerator
(ms)
Time(ms)
3 -> 2
2 -> 1
1 -> 3
3 -> 2
Time(ms)
3 -> 1
Time(ms)
10/17
Aggregate Gain Solution: AG Table
p1




Gain = Emulation only – (Emulation
+ Accelerator)
Maintain a gain table for process in
the SystemC circuit:
ag(i) = ag(i) + gain(i)
Fading process for temporal locality:
ag(i)=ag(i)*f
How to define fading factor f ?
f  min{ Loading time / Gain, 1}
F=0.5
Emulation_only
Emulation+Acc
Gain
p2
p3
200 100
50
10
20
25
190
80
25
Q = <p1, p1, p3, p2, p2, p1, p3>
ag(1) 190 380 380 380 380 570 570
ag(2)
0
0
0
ag(3)
0
0
25
80 160 160 160
25 25
25
50
Q = <p1, p1, p3, p2, p2, p1, p3>
190*F+ 190
ag(1) 190 285 142 71 35 207 103
ag(2)
0
0
ag(3)
0
0
0 80 120
25 12
6
60
30
3
26
11/17
AG: Overheads And Replacement Policy
Event Queue
Emulation Engine
Main Processor
Input
Memory
Output
Memory
Edge
Instruction
Memory
UART
Read Signal
Memory
Buttons
Write Signal
Memory
LEDs
Accelerator 1
FPGA
Edge
Blur
Emulation_only
100
200
Emulation+Acc
20
10
Gain
80
190
USB
Interface
Blur
Blur
ag(Edge)
80
ag(Blur)
0
Edge Edge
…
Loading time (LT): Accelerator loading
time(i)

Communication overhead (CO): (Dλ’- Dλ)
* runtime(i)

Overhead = LT + CO
Policies:

Load: ag(i) > Overhead

12/17
AG: Overheads And Replacement Policy
Event Queue
Emulation Engine
Main Processor
Input
Memory
Output
Memory
Blur
Instruction
Memory
UART
Read Signal
Memory
Buttons
Write Signal
Memory
LEDs
Accelerator
Edge 1
FPGA
Edge
Blur
Emulation_only
100
200
Emulation+Acc
20
10
Gain
80
190
USB
Interface
Blur
Edge Edge
ag(Edge)
80 80
ag(Blur)
0 190
…
Loading time (LT): Accelerator loading
time(i)

Communication overhead (CO): (Dλ’- Dλ)
* runtime(i)

Overhead = LT + CO
Policies:

Load: ag(i) > Overhead

Replace: ag(i) > Overhead + ag(j)
(j is the Acc. to be replaced)


Wait: ag(i) > Overhead + ag(j) + wait_time(j)
13/17
Comparison Solutions
Emulation Engine
Input
Memory
Output
Memory
UART

Main Processor

Instruction
Memory
Read Signal
Memory
Buttons
LEDs
Base Emulation: Emulating
every process on main processor
Infinite Accelerators:
Accelerating every process w/o
loading overhead
USB
Interface
Write Signal
Memory
Accelerator 1
Accelerator 2
Accelerator 3
…
FPGA
Accelerator n
14/17
Comparison Solutions
Emulation Engine
Input
Memory
Output
Memory
UART
Buttons
LEDs

Main Processor

Instruction
Memory
Read Signal
Memory
Write Signal
Memory

USB
Interface

Base Emulation: Emulating
every process on main processor
Infinite Accelerators:
Accelerating every process w/o
loading overhead
Static preloaded: Each
accelerator is statically assigned
a process to execute when on the
event queue and never changes
Greedy: Always assign the
current process on the event
queue to an accelerator,
Accelerator 1
Accelerator 2
FPGA
Accelerator 3
15/17
Experiments and Results
1.3X faster than greedily assigning accelerators
(a) Virtex 4 Ml403: 1 Accelerator
617 397
651 389
(b) Virtex2Pro : 3 Accelerators
622 428
T o talE xecu tio n T im e (m s)
E xecu tio n T im e (m s )
180
160
140
120
100
80
60
40
20
0
R andom
B iased
Base emulator
P eriodic
Infinite
Accelerators
651
617
200
Statically
Preloaded
622
200
180
160
140
120
100
80
60
40
20
0
R andom
B iased
Greedy
AG
P eriodic
Aggregate Gain Algorithm on average 3.8X faster
than statically preloading accelerators
16/17
Conclusions

SystemC Emulation can be improved by
dynamically managing the SystemC bytecode
accelerators

Applied the online Aggregate Gain Algorithm to the
SystemC emulation framework


Improves emulation performance by 14X compared to
emulating all of the SystemC on a base software emulation
kernel
3.8X performance improvement over statically preloading the
accelerator engines
17/17