7-13-2010-stevens

Maryland DSPCAD Research Group
(http://www.ece.umd.edu/DSPCAD/home/dspcad.htm)
Department of Electrical and Computer Engineering, and
Institute for Advanced Computer Studies
University of Maryland, College Park
Model Based Design for DSP:
Presentation to Stevens
Will Plishker, Chung-Ching Shen, Nimish Sane,
George Zaki, Soujanya Kedilaya, Shuvra S. Bhattacharyya
Outline





Model Based Design
Dataflow Interchange Format
Multiprocessor Scheduling
Preliminary Setup and Results with GPUs
Future Directions
Introduction
Abstract
representation of an
algorithm
In modern, complex systems we would
1

Pattern (4 bits)
2
3
Threshold
Module
Pattern comparator
4
like to
Decision
(1 bit)
1
2
3
4
NO zero (38 bit)
Decision
check
E Adder
Y
E
S
E/Gamma
EGamma
(1 bit)
H Adder





Create an application description independent
of the target
Interface with a diverse set of tools and teams
Achieve high performance
Arrive at an initial prototype quickly
Fine Grain OR
Channel Et
Adder




Low level programming environments
Diverse and changing platforms
Non-uniform functional verification
Entrenched design processes
Tool selection
Channel Et
4x9 bits
Implementation
Gap
But algorithms are far removed from
their final implementation

Finegrain
(1 bit)
Low level, high
performance,
implementation
3
8
b
i
t
Model-Based Design for Embedded
Systems

High level application subsystems are specified in terms of
components that interact through formal models of
computation




C or other “platform-oriented” languages can be used to specify
intra-component behavior
Model-specific language can be used to specify inter-component
behavior
Object-oriented techniques can be used to maintain libraries of
components
Popular models for embedded systems



Dataflow and KPNs (Kahn process networks)
Continuous time, discrete event
FSM and related control formalisms
Dataflow-based Design: Related Trends


Dataflow-based design (in our context) is a specific form
of model-based design
Dataflow-based design is complementary to



Object-oriented design
DSP C compiler technology
Synthesis tools for hardware description languages (e.g.,
Verilog and VHDL)
Example: Dataflow-based design for
DSP
Example from Agilent ADS tool
Example: QAM Transmitter in
National Instruments LabVIEW
Rate
Control
QAM
Encoder
Passband
Signal
Transmit
Filters
Source: [Evans 2005]
Crossing the Implementation Gap:
Design Flow Using DIF
Dataflow Models
Static
SDF
MDSDF
HSDF
CSDF
Dynamic
CFDF
DSP Designs
Meta-Modeling
BDF
PDF
The DIF Language
(TDL)
DIF Specification
The DIF Package
(TDP)
Front-end
AIF / Porting
Embedded
Processing
Platforms
Image/Video
Comm Sys
Algorithms
DIF-to-C
DIF Representation
Ptolemy Ex/Im
Dataflowbased DSP
Design Tools
BLDF
Signal Proc
DIF-A T Ex/Im
Ptolemy II
Autocoding
Toolset
Java
Ada
Java VM
VDM
Other Ex/Im
Other Tools
Other
Embedded
Platforms
DSP
Libraries
VSIPL
Other
TI
C
DSP
Dataflow with Software Defined Radio:
DIF + GNU Radio
GRC
Python
Flowgraph
(.py)
1) Convert or
generate .dif file
(Complete)
DIF
specification
(.dif)
3b) Architecture
specification (.arch?)
The DIF Package (TDP)
XML
Flowgraph
(.grc)
Uniprocessor
Scheduling
•
•
•
4) Architecture aware
MP scheduling
•
(assignment, ordering,
invocation)
GNU Radio Engine
Python/C++
DIF Lite
Platform
Retargetable
Library
Schedule
(.dif,
.sched)
2) Execute static schedules from
DIF (Complete)
3a) Perform online scheduling
Legend
Existing or
Completed
Platforms
Multiprocessors
GPUs
Cell
FPGA
Proposed
Processors
Memories
Interconnect
Background: Dataflow Graphs





Vertices (actors) represent computation
Edges represent FIFO buffers
Edges may have delays, implemented as initial tokens
Tokens are produced and consumed on edges
Different models have different rules for production
(SDF=fixed, CSDF=periodic, BDF=dynamic)
X
p1
c1
e1
Y
p2
c2
e2
5Z
Evolution of Dataflow Models of
Computation for DSP: Examples


Computation Graphs and Marked Graphs [Karp 1966, Reiter 1968]

Synchronous dataflow, [Lee 1987]


Static multirate behavior

SPW (Cadence) , National Instruments LabVIEW, and others.


The processing graph method [Stevens, 1997]

Reconfigurable dynamic dataflow

U. S. Naval Research Lab, MCCI Autocoding
Toolset
Stream-based functions [Kienhuis 2001]

Parameterized dataflow [Bhattacharya 2001]
Image and video processing

Reconfigurable static dataflow

Meta-modeling for more general dataflow
graph reconfiguration
Scalable synchronous dataflow [Ritz 1993]

Block processing

Reactive process networks [Geilen 2004]

COSSAP (Synopsys)

Blocked dataflow [Ko 2005]
CAL [Eker 2003]


Turing complete models
Bounded dynamic data transfer [Pankert
1994]

Multidimensional synchronous dataflow [Lee 1992]



Schemas for bounded dynamics
Boolean/integer dataflow [Buck 1994]



Well behaved stream flow graphs [1992]

Bounded dynamic dataflow

Actor-based dataflow language
Cyclo-static dataflow [Bilsen 1996]

Phased behavior

Eonic Virtuoso Synchro, Synopsys El Greco and Cocentric,
Image and video through parameterized
processing

Windowed synchronous dataflow [Keinert
2006]

Parameterized stream-based functions
[Nikolov 2008]

Enable-invoke dataflow [Plishker 2008]

Variable rate dataflow [Wiggers 2008]
Angeles System Canvas
E x p r e s s iv e p o w e r
Modeling Design Space
X
C, BDF, DDF
X
PCSDF
X
PSDF
MDSD,
WBDF
X
Verification / synthesis power
X
CSDF,
CSDF SSDF
X
SDF
Dataflow Interchange Format

Describe DF graphs in text

Simple DIF file:
dif graph1_1 {
topology {
nodes = n1, n2,
edges = e1 (n1,
e2 (n2,
e3 (n1,
e4 (n1,
e5 (n4,
e6 (n4,
}
}
n3, n4;
n2),
n1),
n3),
n3),
n3),
n4);
More features of DIF

Ports
interface {
inputs = p1, p2:n2;
outputs = p3:n3, p4:n4;
}

Hierarchy
refinement {
graph2 = n3;
p1 : e3;
p2 : e4;
p3 : e5;
p4 : p3;
}
4096
More features of DIF

Production and consumption
production {
e1 = 4096;
e10 = 1024;
...
}
consumption {
e1 = 4096;
e10 = 64;
...
}


4096
Computation keyword
User defined attributes
1024
64
The DIF Language Syntax
dataflowModel graphID {
basedon { graphID; }
topology {
nodes = nodeID, ...;
edges = edgeID (srcNodeID, snkNodeID),
...; }
interface {
inputs = portID [:nodeID], ...;
outputs = portID [:nodeID], ...; }
parameter {
paramID [:dataType];
paramID [:dataType] = value;
paramID [:dataType] : range; }
refinement {
subgraphID = supernodeID;
subPortID : edgeID;
subParamID = paramID; }
builtInAttr {
[elementID] = value;
[elementID] = id;
[elementID] = id1, id2, ...; }
attribute usrDefAttr{
[elementID] = value;
[elementID] = id;
[elementID] = id1, id2, ...; }
actor nodeID {
computation = stringValue;
attrID [:attrType] [:dataType] = value;
attrID [:attrType] [:dataType] = id;
attrID [:attrType] [:dataType] = id1, ...; }
}
Uniprocessor Scheduling for
Synchronous Dataflow




An SDF graph G = (V,E) has a valid schedule if it is
deadlock-free and is sample rate consistent (i.e., it has a
periodic schedule that fires each actor at least once and
produces no net change in the number of tokens on each
edge).
Balance eqs:  e E, prd(e) x q[src(e)] = cns(e) x q[snk(e)].
Repetition vector q is the minimum solution of balance eqs.
A valid schedule is then a sequence of actor firings where
each actor v is fired q[v] (repetition count) times and the
firing sequence obeys the precedence constraints imposed
by the SDF graph.
Example: Sample Rate Conversion
CD to DAT: 44.1 kHz to 48 kHz sampling rate conversion.
CD
(A)

1
e1
FIR1
(B)
2
e2
3
FIR2
(C)
4
7
e3
FIR3
(D)
5
7
e4
FIR4
(E)
4
e5
1
DAT
(F)
Flat strategy



1
Topological sort the graph and iterate each actor v q[v] times.
Low context switching but large buffer requirement and
latency
CD to DAT Flat Schedule:

(147A)(147B)(98C)(56D)(40E)(160F)
Scheduling Algorithms

Acyclic pairwise grouping of adjacent nodes (APGAN)


Dynamic programming post optimization (DPPO)



An adaptable (to different cost functions) and low-complexity heuristic to compute a nested
looped schedule of an acyclic graph in a way that precedence constraints (topological sort) is
preserved through the scheduling process.
Dynamic programming over a given actor ordering (any topological sort).
GDPPO, CDPPO, SDPPO.
Recursive procedure call (RPC) based MAS


Generate MASs for a given R-schedule through recursive graph decomposition.
The resulting schedule is bounded polynomially in the graph size.
Algorithm
Looped Schedule
Buffer Size
Flat
(147A)(147B)(98C)(56D)(40E)(160F)
1273
APGAN
(49(3AB)(2C))(8(7D)(5E(4F)))
438
DPPO
(7(7(3AB)(2C))(8D))(40E(4F))
347
RPC-based
MAS
((2(((7((AB)(2(AB)C))D)D)(5E(4F)))(2(((7((AB)(2(AB)C))D)D) 69
(5E(4F)))(E(4F))))((((7((AB)(2(AB)C))D)D)(5E(4F)))(E(4F))))
Representative Dataflow Analyses and
Optimizations











Bounded memory and deadlock detection: consistency
Buffer minimization: minimize communication cost
Multirate loop scheduling: optimize code/data trade-off
Parallel scheduling and pipeline configuration
Heterogeneous task mapping and co-synthesis
Quasi-static scheduling: minimize run-time overhead
Probabilistic design: adapt system resources and exploit slack
Data partitioning: exploit parallel data memories
Vectorization: improve context switching, pipelining
Synchronization optimization: self-timed implementation
Clustering of actors into atomic scheduling units
Multiprocessor Scheduling

Multiprocessor scheduling problem:
 Actor assignment (mapping)
 Actor ordering
 Actor invocation

Approaches to each of these tend to be platform specific

Tools can be brought under a common formal umbrella
Multiprocessor Scheduling
Application Model,
G(V, E, t(v), C(e))
Mapping/Scheduling
Multiprocessor Mapping
Application Model,
G(V, E, t(v), C(e))
P1
P2
P3
P4
Mapping
Invocation Example: Self-Timed (ST)
scheduling
Assignment and ordering performed at compile-time.
Invocation performed at run-time (via synchronization)
Proc
1 A
Gantt Chart for ST schedule
E
H
B
C
Proc 1
E A
D
Proc Proc 2
B
F
B
4
Proc Proc 3
G
C G
3
Proc 4
G
D
D
Proc 5
Proc
F
5
Proc
Execution 2
Times
Application Graph
A, B, F: 3
C, H : 5
D
:6
E
:4
H
E
A
E
F B
A
E
F B
C G
F
C G C
D
H
D
H
18
TST=
9
H
Multicore Schedules

Traditional multicore scheduling




Convert application DAG to Homogenous
Synchronous Dataflow (HSDF)
Perform HSDF mapping
Problem: exponential graph explosion
Our solution:


single processor schedule (SPS) represented as a
generalized schedule tree (GST)
generate an equivalent multiprocessor schedule (MPS)
to be represented as a forest of GSTs.
Traditional Dataflow Multiprocessor
Scheduling (MPS)
Synchronous Dataflow (SDF)
representation of application
A
A
A
Homogenous SDF
representation of application
A
A
A
A
3
B
1
1
1
2
C
1
1
1
B
1
1
1
1
1
1
1
1
1
1
1
B
1
C
GST Representation for MPS - Simple
Example
(a) An SDF graph
(b) SPS as a GST
P1
P2
(c) MPS represented as a forest of GSTs
P3
Demonstration on GPUs:
Start with parallel actors
Within an actor (FIR Filter).
N
y[n]   bi x[n  i ]
i 0
Limitation (IIR Filter)
P
Q
i 0
j 1
y[n]   bi x[n  i ]   a j y[n  j ]
Individual actor results:
CUDA FIR vs. Stock GR FIR
Individual Actor Results:
Turbo Block Decode
Future Direction: Tackling the general MP
scheduling problem with dataflow analysis

Many dataflow analysis techniques are available once the
problem is well defined in dataflow terms

Maximize multicore utilization by replicating and fusing
actors/blocks




Stateless vs. stateful
Computation to communication ratios
Firing rates/execution times to number of blocks
Once application is mapped to blocks/processors

Single processor scheduling to minimize buffering
Focus first on MP Scheduling for GPUs



Blocks
Threads
Memory
Refine to a simpler question:
When to off-load onto a GPU?

Given:




An application graph
Actor timing characteristics
for communication and
computation
A target architecture with
heterogeneous
multiprocessing
1
7
2
4
5
8
3
9
6
?
Find optimal implementation


Latency
Throughput
CPU
GPU
Summary





Model Based Design
Dataflow Interchange Format
Multiprocessor Scheduling
Preliminary Setup and Results with GPUs
Future Directions

Download Report

7-13-2010-stevens

Paperzz.com

Your Paperzz