StreamIt-IBM-PL-Day-04 - Massachusetts Institute of Technology

A Compiler Infrastructure
for Stream Programs
Bill Thies
Joint work with Michael Gordon, Michal Karczmarek, Jasper Lin,
Andrew Lamb, David Maze, Rodric Rabbah and Saman Amarasinghe
Massachusetts Institute of Technology
IBM PL Day
May 21, 2004
Streaming Application Domain
• Based on audio, video, or data stream
• Increasingly prevalent and important
– Embedded systems
• Cell phones, handheld computers
– Desktop applications
• Streaming media
• Software radio
Real-time encryption
Graphics packages
– High-performance servers
• Software routers (Example: Click)
• Cell phone base stations
• HDTV editing consoles
Properties of Stream Programs
• A large (possibly infinite) amount of data
– Limited lifetime of each data item
– Little processing of each data item
• Computation: apply multiple filters to data
– Each filter takes an input stream, does some
processing, and produces an output stream
– Filters are independent and self-contained
• A regular, static communication pattern
– Filter graph is relatively constant
– A lot of opportunities for compiler optimizations
The StreamIt Project
• Goals:
– Provide a high-level stream programming model
– Invent new compiler technology for streams
• Contributions:
– Language Design, Structured Streams, Buffer
Management (CC 2002)
– Exploiting Wire-Exposed Architectures (ASPLOS
2002, ISCA 2004)
– Scheduling of Static Dataflow Graphs (LCTES 2003)
– Domain Specific Optimizations (PLDI 2003)
– Public release: Fall 2003
Outline
•
•
•
•
•
•
Introduction
StreamIt Language
Domain-specific Optimizations
Targeting Parallel Architectures
Public Release
Conclusions
Outline
•
•
•
•
•
•
Introduction
StreamIt Language
Domain-specific Optimizations
Targeting Parallel Architectures
Public Release
Conclusions
Model of Computation
• Synchronous Dataflow [Lee 1992]
– Graph of independent filters
– Communicate via channels
– Static I/O rates
A/D
Band pass
Duplicate
Detect
Detect
Detect
Detect
LED
LED
LED
LED
Freq band detector
Filter Example: LowPassFilter
float->float filter LowPassFilter (int N, float freq) {
float[N] weights;
init {
weights = calcWeights(N, freq);
}
}
work peek N pop 1 push 1 {
float result = 0;
for (int i=0; i<weights.length; i++) {
result += weights[i] * peek(i);
}
push(result);
pop();
}
filter
Filter Example: LowPassFilter
float->float filter LowPassFilter (int N, float freq) {
float[N] weights;
init {
weights = calcWeights(N, freq);
}
}
work peek N pop 1 push 1 {
float result = 0;
for (int i=0; i<weights.length; i++) {
result += weights[i] * peek(i);
}
push(result);
pop();
}
N
filter
Filter Example: LowPassFilter
float->float filter LowPassFilter (int N, float freq) {
float[N] weights;
init {
weights = calcWeights(N, freq);
}
}
work peek N pop 1 push 1 {
float result = 0;
for (int i=0; i<weights.length; i++) {
result += weights[i] * peek(i);
}
push(result);
pop();
}
N
filter
Filter Example: LowPassFilter
float->float filter LowPassFilter (int N, float freq) {
float[N] weights;
init {
weights = calcWeights(N, freq);
}
}
work peek N pop 1 push 1 {
float result = 0;
for (int i=0; i<weights.length; i++) {
result += weights[i] * peek(i);
}
push(result);
pop();
}
N
filter
Composing Filters: Structured Streams
• Hierarchical structures:
– Pipeline
– SplitJoin
– Feedback Loop
• Basic programmable unit: Filter
Freq Band Detector in StreamIt
void->void pipeline FrequencyBand {
float sFreq = 4000;
float cFreq = 500/(sFreq*2*pi);
float wFreq = 100/(sFreq*2*pi);
A/D
add D2ASource(sFreq);
add BandPassFilter(100, cFreq-wFreq, cFreq+wFreq);
Band pass
add splitjoin {
split duplicate;
for (int i=0; i<4; i++) {
add pipeline {
add Detect (i/4);
add LED (i);
}
}
join roundrobin(0);
}
}
}
Duplicate
Detect
Detect
Detect
Detect
LED
LED
LED
LED
Radar-Array Front End
Filterbank
FM Radio with Equalizer
Bitonic Sort
FFT
Block Matrix Multiply
MP3 Decoder
Outline
•
•
•
•
•
•
Introduction
StreamIt Language
Domain-specific Optimizations
Targeting Parallel Architectures
Public Release
Conclusions
Conventional DSP Design Flow
Spec. (data-flow diagram)
Design the Datapaths
(no control flow)
DSP Optimizations
Signal Processing Expert
in Matlab
Coefficient Tables
Rewrite the
program
Architecture-specific
Optimizations
(performance,
power, code size)
C/Assembly Code
Software Engineer
in C and Assembly
Any Design Modifications?
• Center frequency from 500 Hz to 1200 Hz?
– According to TI,
in the conventional design-flow:
•
•
•
•
Redesign filter in MATLAB
Cut-and-paste values to EXCEL
Recalculate the coefficients
Update assembly
Source:
Application Report SPRA414
Texas Instruments, 1999
A/D
Band pass
Duplicate
Detect
Detect
Detect
Detect
LED
LED
LED
LED
Ideal DSP Design Flow
Application-Level Design
High-Level Program
(dataflow + control)
Application Programmer
DSP Optimizations
Compiler
Architecture-Specific
Optimizations
C/Assembly Code
Challenge: maintaining performance
Our Focus: Linear Filters
• Most common target of DSP optimizations
•
•
•
•
FIR filters
Compressors
Output is weighted sum of inputs
Expanders
DFT/DCT
• Optimizations:
1. Combining Adjacent Nodes
2. Translating to Frequency Domain
3. Selecting the Best Transformations
Extracting Linear Representation
x
work peek N pop 1 push 1 {
float sum = 0;
for (int i=0; i<N; i++) {
sum += h[i]*peek(i);
}
push(sum);
pop();
}
Linear
Dataflow
Analysis
A, b
y= x A + b
1) Combining Linear Filters
• Pipelines and splitjoins can be collapsed
• Example: pipeline
x
Filter 1
y=xA
z=xAB
y
C
Filter 2
z
z=yB
Combined
Filter
z=xC
Combination Example
1 mults
output
6 mults
output
Filter 1
A  4
5
6
Combined
C = [ 32 ]
Filter
Filter 2
1
B  2
3
-40%
ad
ar
Benchm ark
D
To
A
Fi
l te
rB
an
k
Vo
co
de
r
O
ve
rs
am
pl
e
R
ad
io
0%
FM
R
et
ec
t
on
ve
rt
Ta
rg
et
D
at
eC
FI
R
-20%
R
Flops Removed (%)
Floating-Point Operations Reduction
100%
80%
linear
60%
40%
20%
0.3%
2) From Time to Frequency Domain
• Convolutions can be done
cheaply in the Frequency
Domain
Σ Xi*Wn-i
• Painful to do by hand
• Blocking
X  F(x)
FFT
Y  X .* H
VVM
y  F -1(Y)
IFFT
• Multiple outputs
• Coefficient calculations • Interfacing with FFT library
• Verification
• Startup
Floating-Point Operations Reduction
100%
80%
60%
freq
40%
20%
-40%
-140%
Benchm ark
To
A
D
Fi
l te
rB
an
k
Vo
co
de
r
O
ve
rs
am
pl
e
ad
ar
R
FM
R
et
ec
t
Ta
rg
et
D
on
ve
rt
at
eC
FI
R
-20%
ad
io
0.3%
0%
R
Flops Removed (%)
linear
3) When to Apply Transformations?
• Estimate minimal cost for each structure:
• Linear combination
• Frequency translation
• No transformation
• If hierarchical, consider all rectangular
groupings of children
• Overlapping sub-problems allows efficient
dynamic programming search
Radar (Transformation Selection)
Splitter(null)
Splitter
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Dec
Dec
Dec
Dec
Dec
Dec
Dec
Dec
Dec
Dec
Dec
Dec
Cfilt
Cfilt
Cfilt
Cfilt
Cfilt
Cfilt
Cfilt
Cfilt
Cfilt
Cfilt
Cfilt
Cfilt
Dec
Dec
Dec
Dec
Dec
Dec
Dec
Dec
Dec
Dec
Dec
Dec
CFilt2
CFilt2
CFilt2
CFilt2
CFilt2
CFilt2
CFilt2
CFilt2
CFilt2
CFilt2
CFilt2
CFilt2
RR
Duplicate
BeamFrm
BeamFrm
BeamFrm
BeamFrm
Filter
Filter
Filter
Filter
Mag
Mag
Mag
Mag
Detect
Detect
Detect
Detect
RR
Sink
Radar (Transformation Selection)
Splitter(null)
Splitter
Input
Input
Input
Input
Input
Input
Input
Input
Input
RR
Duplicate
RR
BeamFrm
BeamFrm RR BeamFrm
BeamFrm
Filter
Filter
Filter
Filter
Mag
Mag
Mag
Mag
Detect
Detect
Detect
Detect
RR
Sink
Input
Input
Input
Radar (Transformation Selection)
Splitter(null)
Splitter
Input
Input
Input
Input
Input
Input
Input
Input
Input
RR
Duplicate
BeamFrm
BeamFrm
BeamFrm
BeamFrm
RR
RR
Filter
Filter
Filter
Filter
Mag
Mag
Mag
Mag
Detect
Detect
Detect
Detect
RR
Sink
Input
Input
Input
Radar (Transformation Selection)
Splitter(null)
Splitter
Splitter(null)
Splitter
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
Input
RR
Duplicate
RR
Mag
Mag
Mag
Mag
RR
Sink
Maximal Combination and
Shifting to Frequency Domain
RR
Using Transformation
Selection
Filter
Filter
Filter
Mag
Mag
Mag
Mag
Detect
Detect
Detect
Detect
2.4 times as
many FLOPS
RR
Sink
Filter
half as
many FLOPS
Floating-Point Operations Reduction
100%
80%
60%
freq
autosel
40%
20%
-40%
-140%
Benchmark
A
DT
o
Vo
co
de
r
O
ve
rs
am
pl
e
rB
an
k
-20%
Fi
lte
R
Ra
te
Co
nv
er
t
Ta
rg
et
De
te
ct
FM
Ra
di
o
0%
Ra
da
r
0.3%
FI
Flops Removed (%)
linear
Outline
•
•
•
•
•
•
Introduction
StreamIt Language
Domain-specific Optimizations
Targeting Parallel Architectures
Public Release
Conclusions
Compiling to the Raw Architecture
Compiling to the Raw Architecture
1. Partitioning: adjust granularity of graph
Compiling to the Raw Architecture
1. Partitioning: adjust granularity of graph
2. Layout: assign filters to tiles
Compiling to the Raw Architecture
1. Partitioning: adjust granularity of graph
2. Layout: assign filters to tiles
3. Scheduling: route items across network
Scalability Results
32.0
32.0
24.0
24.0
24.0
16.0
Throughput
32.0
8.0
16.0
0.0
0.0
0
2
4
6
8
10
12
14
16
16.0
8.0
8.0
0.0
0
2
4
6
Tiles
8
10
12
14
0
16
24.0
Throughput
24.0
Throughput
24.0
16.0
16.0
8.0
8.0
6
8
Tiles
10
12
14
16
10
12
14
16
10
12
14
16
16.0
8.0
0.0
0.0
8
FFT
32.0
4
6
Bitonic Sort
32.0
2
4
Tiles
32.0
0
2
Tiles
FMRadio
Throughput
Radar
Filterbank
Throughput
Throughput
FIR
0.0
0
2
4
6
8
Tiles
10
12
14
16
0
2
4
6
8
Tiles
Raw vs. Pentium III
16-Tile Raw vs. Pentium III
18
16
Speedup (cycles)
14
12
10
8
6
4
2
0
FIR
Filterbank
Radar
FMRadio
Bitonic Sort
FFT
Outline
•
•
•
•
•
•
Introduction
StreamIt Language
Domain-specific Optimizations
Targeting Parallel Architectures
Public Release
Conclusions
StreamIt Compiler Infrastructure
• Built on Kopi Java compiler (GNU license)
– StreamIt frontend is on MIT license
• High-level hierarchical IR for streams
– Host of graph transformations
•
•
•
•
Filter fusion, filter fission
Synchronization removal
Splitjoin refactoring
Graph canonicalization
• Low-level “flat graph” for backends
– Eliminates structure; point-to-point connections
• Streaming benchmark suite
Compiler Flow
StreamIt code
StreamIt
Front-End
Legal Java file
Any Java
Compiler
Class file
StreamIt
Java Library
Linear
Optimizations
Kopi
Front-End
Scheduler
Parse Tree
SIR
Conversion
SIR
(unexpanded)
Graph
Expansion
SIR
(expanded)
Raw Backend
C code for tiles
Assembly code for switch
Uniprocessor
Backend
ANSI C code
Building on StreamIt
• StreamIt to VIRAM [Yelick et al.]
– Automatically generate permutation instructions
• StreamBit: bit-level optimization [Bodik et al.]
• Integration with IBM Eclipse Platform
StreamIt Graphical Editor
StreamIt Component-Shortcuts
 Create Filters, Pipelines, SplitJoins, Feedback Loops, FIFOs
Juan C. Reyes
M.Eng. Thesis
StreamIt Debugging Environment
General Debugging
Information
StreamIt
Graph
Zoom Panel
StreamIt
Text
Editor
not shown:
the StreamIt
On-Line Help
Manual
StreamIt Graph
Components
expanded and
collapsed views
of basic
programmable
unit
communication
buffer with live
data
Compiler and
Output Consoles
Kimberly Kuo
M.Eng. Thesis
Outline
•
•
•
•
•
•
Introduction
StreamIt Language
Domain-specific Optimizations
Targeting Parallel Architectures
Public Release
Conclusions
Related Work
• Stream languages
–
–
–
–
KernelC/StreamC, Brook: augment C with data-parallel kernels
Cg: allow low-level programming of graphics processors
SISAL, functional languages: expose temporal parallelism
StreamIt exposes more task parallelism, easier to analyze
• Control languages for embedded systems
– LUSTRE, Esterel, etc.: can verify of safety properties
– Do not expose high-bandwidth data flow for optimization
• Prototyping environments
– Ptolemy, Simulink, etc.: provide graphical abstractions
– StreamIt has more of a compiler focus
Future Work
• Backend optimizations for linear filters
– Template assembly code: asymptotically optimal
• Fault-tolerance on a cluster of workstations
– Automatically recover if machine fail
• Supporting dynamic events
– Point-to-point control messages
– Re-initialization for parts of the stream
Conclusions
• StreamIt: compiler infrastructure for streams
– Raising the level of abstraction in stream programming
– Language design for both programmer and compiler
Compiler Analysis
Language features exploited
Linear Optimizations - peek primitive (data reuse)
- atomic work function
- structured streams
Backend for Raw
- exposed communication patterns
Architecture
- exposed parallelism
- static work estimate
• Public release: many opportunities for collaboration
http://cag.csail.mit.edu/streamit
Extra Slides
StreamIt Language Summary
Feature
Autonomous
Filters
Peek primitive
Structured
Streams
Programmer Benefit
• Encapsulation
• Automatic scheduling
• Automatic buffer
management
• Natural syntax
Compiler Benefit
• Local memories
• Exposes parallelism
• Exposes communication
• Exposes data reuse
• Exposes symmetry
• Eliminate corner cases
Scripts for Graph • Reusable components
• Two-stage compilation
Construction
• Easy to maintain
AB for any A and B?
• Linear Expansion
Original
Expanded
U
U
E
[A]
pop = 
E
[A]
[A]
[A]
[A]

Backend Support for Linear Filters
• Can generate custom code for linear filters
– Many architectures have special support for matrix mult.
– On Raw: assembly code templates for tiles and switch
• Substitute coefficients, peek/pop/push rates
• Preliminary result: FIR on Raw
– StreamIt code:
15 lines
– Manually-tuned C code:
352 lines
– Both achieve 99% utilization of Raw FPU’s
• Asymptotically optimal
• Current focus: integrating with general backend