Hy-C A Compiler Retargetable for Single

Hy-C
A Compiler Retargetable for Single-Chip
Heterogeneous Multiprocessors
Philip Sweany
8/27/2010
No single architecture solves all power
problems
10000
1000
General Purpose
Processor
10
X
mW/MIPS
100
10
Software
Programmable
DSP
1
100
X
Hard -wired
proxy
0.1
0.01
0.001
1980
1985
1990
1995
2000
2005
2010
•Industry has debated merits
of each architecture for
2
Retargetable Compilation
• Why ?
• Rocket
– C compiler, written in C++
– Retargetable for ILP computers
– Single machine description file
– Development 1989-2000
• Gnu
Hybrid Computing
• Heterogeneous processors on single chip
– “CPU”
– FPGA
– ASIC
– N “CPU”s, M FPGAs, K ASICs
• Tradeoffs of performance, power, flexibility
Generic Hybrid Architecture
FPGA 1
CPU 1
FPGA 2
CPU 2
Shared
Memory
CPU m
Multi-CPU
FPGA n
Multi-FPGA
Generic Hy-C Tools
Source Code
Objectives/Constraints
System Specification
Partitioning
CPU
Compiler
FPGA
Synthesis
CPU
Power-Performance
Model
FPGA
Power-Performance
Model
Optimization Control
Intermediate Representations
• 3-address form
• Control flow graph
• SSA --- static single assignment
Control Flow Graph
• Nodes are Basic Blocks
– Single entry, single exit
– No branch exempt (possibly) at bottom
• Edges represent one possible flow of
execution between two basic blocks
• Whole CFG represents a function
Static Single Assignment
• SSA: A program is in SSA form iff
– Each variable is statically defined exactly only
once, and
– Each use of a variable is dominated by that
variable’s definition.
7/29/2017
9
Example
X1 =
X2 =
• In general, how to transform
an arbitrary program into SSA
form?
• Does the definition of X2
dominates its use in the
example?
X3 = (X1, X2)
X4 =
7/29/2017
10
SSA: Motivation
• Provide a uniform basis of an IR to solve a wide
range of classical dataflow problems
• Encode both dataflow and control flow information
• A SSA form can be constructed and maintained
efficiently
• Its popular
• Gcc uses SSA
7/29/2017
11
Software Pipelining
• Schedule operations from multiple iterations
of a loop in parallel
• Hides latency
• Compiler “reorders” loop code to include:
– Prelude
– Kernel
– Postlude
Software Pipeline Benefit for “Typical”
Architecture and MMult
• “Typical” Architecture
– 8-wide Instruction-Level Parallel (ILP)
• Assuming 3000 x 3000 matrices
– Original requires 45 million cycles
– Pipelined version requires 3 million + 15
Current Compiler Projects
• Hy-C
– Build tools
– Partition algorithms
– Retargetability and constraint specification
– OMAP project
• Thread-level parallelism in imperative code
– Limit study
– Improved identification of threads
• Fast compiler-controlled memory
OMAP4 Sub-System Encapsulation
OMAP4
Application
Chiron
(2xCortex-A9)
Camera Control
Apps /
Frameworks
OMX Image
ISS
Camera
Imaging
IPC
Distributed
OpenMAX
Image
HWA
IPC
GFX
OpenGL
HDMI
HLOS
Storage
LCD
USB
I/O &
Peripherals
Audio Back-End
Audio
15
Ducati
Displays
DSP/BIOS
C64x
Tesla
IVA
HD
Video
OMX
Audio
3P
extensions
IPC
DSS
Video
HWA
Power
Analog
TV
OMX Video
Programm
-able
Image/
Video
DSP/BIOS
15
OMAP Resources
Chiron
Tesla
Shared
Memory
Ducati
Multi-CPU
OMAP Processor Resources
• Chiron
– 2 x 600 MHz (2 symmetric processors each at 600 MHz
with shared L2)
– Power 600uW / MHz
• Tesla
– DSP Sub-System (C64x derivative); 400 MHz, 8-wide ILP
– Power 200uW / MHz
• Ducati
– 200 MHz (targeted for control, low latency code)
– Power 100uW / MHz
Hy-C for OMAP
Source Code
Objectives/Constraints
System Specification
Partitioning
Veyron
Ducati
Tesla
Optimization Control
OMAP Project, Current State
• Use gcc to generate “readable” SSA graphs for C
programs
• Developing translator to convert SSA graphs to
Hy-C internal Control, Data Dependence Graphs
(CDDGs).
• Translator to Hy-C CDDGs successfully tested on
small C programs
7/29/2017
Partition Algorithm
• Examine Control Flow Graph (CFG) for a
function
– Identify software pipelining possibility
– Build Dependence Graph (combining data and
control dependence)
• Choose one of three resources for the
function
Partition Algorithm (cont.)
• If software pipelining profitable, place
function on C64 DSP resource
• Else examine Dependence Graph
– if ( number of nodes / critical path length ) > 1.5,
place on double-issue ARM
– else place on single-issue ARM
Long-Term Future
• Automatic Code Generation (I don’t believe in
software)
• Visual Programming of Components