From dataflow basics to programming of ASICs

From dataflow basics to
programming of ASICs
Marco Bekooij
[email protected]/[email protected]
May 30, 2016
Outline
I
Challenges multiprocessor programming
•
industrial practice of safety critical modem/radar systems
I
Message of the talk
I
Robust system design
I
Implications of timed-dataflow analysis
I
Use of timed-dataflow in compilers
I
Future research directions
2/42
Why embedded multiprocessor systems
I
CMOS scaling:
•
•
I
number of transistor continues to increase due to CMOS scaling
clock frequency of processors does not increase
Embedded multiprocessor systems allow to exploit potential of
CMOS technology
3/42
Multiprocessor programming is challenging
I
Challenges as a result of concurrency:
•
•
•
•
•
application partitioning: load balancing, dependencies limit parallelism
functional determinism: task schedule dependent behavior
deadlock/starvation: no progress
shared resources: e.g. processor sharing, bus-sharing, memory space
sharing
worst-case performance: WCET analysis, throughput analysis
4/42
Towards autonomous driving
I
Focus of NXP is on CAR2X modems/radar components
•
part of safety critical systems
5/42
Industrial design practice
I
Industrial practice: design, test, debug till behavior deemed
satisfactory
•
increasing system complexity results in relatively longer test and
debug times
6/42
Problem
I
Programming multicore is very challenging as a result of real-time
requirements and need to exploit concurrency
I
Use of abstractions to steer design decisions is desirable
7/42
Message of Talk
I
I
Data-driven task execution preferred over timed-triggered/periodic
task execution
Timed-dataflow models are useful abstractions for the design and
programming of multicore systems
•
I
enforcing structure in system can simplify programming and analysis
Compilation tools can hide modeling effort
8/42
Classical real-time task model
I
Classical periodic real-time model: P,C,D
•
•
•
•
the period P
release ri of i-th execution of a task
worst-case execution time C
relative deadline D
i
P
ri
C
ri+1
preempted
ri+2
t
D
D
9/42
Periodic scheduling
τ1
τ2
P
Processor 1
τ1
Processor 2
τ2
t
error!
P
I
t
Execution time tasks must be smaller than period
10/42
Alternative workload model
I
I
I
Assume Ci + Ci+1 ≤ 2 · P
One execution every P on average is possible
No feasible periodic schedule with period P
τ1
τ2
P
τ1
t
τ2
I
I
t
Requires a data-driven schedule
Variation in arrival times data is absorbed by FIFO buffers
11/42
Average execution-time
inaccuracy
average
I
uncertainty
overestimation
Running average is closer to the average execution-time
•
improved throughput guarantee
12/42
Time-triggered versus data-driven
f
Time-triggerd
Periodic
so
f
τ1
13/42
f
τ2
f
si
Time-triggered versus data-driven
f
Time-triggerd
Periodic
so
f
τ1
f
τ2
f
Data-driven
I
so
f
si
f
τ1
τ2
si
Data-driven task execution results in more scheduling freedom
•
results in a higher guaranteed throughput and lower latency
13/42
Guarantees
I
Kopetz principle: arguably, any definitive statement is in fact a
statement about a model and not a statement about the thing being
modeled
I
Consequence: guarantees are given under a load hypothesis and a
fault hypothesis
Workload characterization, e.g. WCET, is not guaranteed!
How problematic is this?
14/42
Data-driven scheduling
I
I
Assume workload characterization of tasks using their WCETs
What happens if WCET assumption is faulty?
f
τ1
τ2
P
τ1
t
undefined functional behavior
τ2
t
si
t
time-triggered
15/42
si
Data-driven scheduling
I
I
Assume workload characterization of tasks using their WCETs
What happens if WCET assumption is faulty?
f
τ1
τ2
si
P
P
τ1
τ1
t
t
undefined functional behavior
τ2
τ2
t
t
data arrives in time!
si
si
t
time-triggered
t
data-driven
Data driven schedules are more robust against faulty WCET assumptions!
15/42
Data-driven scheduling
I
I
Assume workload characterization of tasks using their WCETs
What happens if WCET assumption is faulty?
f
τ1
τ2
si
P
P
τ1
τ1
t
t
undefined functional behavior
τ2
τ2
t
t
data arrives in time!
si
si
t
time-triggered
t
data-driven
Data driven schedules are more robust against faulty WCET assumptions.
Still a buffer underrun/overflow can occur!
15/42
Robustness
I
A system is robust if:
•
•
I
bounded disturbances have bounded effects
the effect of a sporadic disturbance disappears over time
Disturbances:
•
•
•
noise
variation in parameters
faulty assumptions!
16/42
Deterministic embedded system HW/SW design
non-deterministic due to noise +
uncertainty environment
signal proc. apps.
periodic tasks
synchronous systems
100% deterministic
synchronous CPUs
gates
transistors
non-deterministic due to noise
parameter variations
CMOS technology
I
Puts difficult-to-meet requirements on hardware and software
•
•
•
system design must guarantee that WCETs hold
classical real-time systems view
realization attempt: PRET processor
17/42
Robust system design
non-deterministic due to noise +
uncertainty environment
signal proc. apps.
+
faulty load-characterization
data-driven tasks
GALS multi proc.
synchronous CPUs
100% deterministic
gates
transistors
non-deterministic due to noise +
parameter variations
CMOS technology
I
Relaxes requirements on hardware and software
I
Applied in practice: e.g. retransmission packet
Complicates system evaluation
I
•
probabilistic characterization of the system
18/42
Classical task-model
I
Communication during critical section
•
shared variables can be read and written
a=3
wait(s0 ) x = b signal(s0 )
τ1
t
y=a
wait(s0 ) b = 2 signal(s0 )
τ2
t
19/42
Classical task-model
I
Communication during critical section
•
shared variables can be read and written
a=3
wait(s0 ) x = b signal(s0 )
τ1
a=3
wait(s0 ) x = b signal(s0 )
τ1
t
y=a
wait(s0 ) b = 2 signal(s0 )
τ2
t
y=a
wait(s0 ) b = 2 signal(s0 )
τ2
t
Non-deterministic functional behavior because it is task schedule
dependent
19/42
t
Kahn Process Networks (KPN)
FIFO
P2
FIFO
P1
P4
FIFO
I
producing process only writes and consuming process only reads
FIFOs enable pipeline parallelism
•
•
I
FIFO
Functionally deterministic behavior
•
I
P3
they also absorb variation
however sufficient FIFO buffer capacity cannot be computed, KPNs
are Turing Complete
Untimed
•
throughput, latency?
20/42
Timed dataflow
y0 = (t0 + ρ0 + ρ1 , (g ◦ f )(v0 ), i0 )
ρ0
x = hx0 , x1 , . . .i
f
ρ1
g
y = hy0 , y1 , . . .i
x0 = (t0 , v0 , i0 )
I
Actors have a firing duration
I
Actors consume token at start and produce token at the finish of an
actor firing
Functionally deterministic behavior if firing rules are sequential
I
•
dataflow model is more expressive than KPN
21/42
Mismatch with reality
ρ̂
fˆ(i) = p̂(i)
ê(i) = ŝ(i) = ĉ(i)
actor
t
ρ(i)
e(i)
s(i)
c(i)
p(i)
f (i)
task
t
I
Actors have atomic consumption and production at start and finish
I
For analysis the actors must have constant firing durations
22/42
Earlier-the-better-refinement
abstraction/refinement of components
a(i)
A
refinement
a0 (i)
b(j)
abstraction
A’
b0 (j)
A0 v A
I
I
Components A0 is better than A, and A must be temporally
monotone
Dataflow actors can be deterministic abstractions of tasks
23/42
Graph refinement
d
A
e
B
graph G
b
d0
a
C
A
e0
B
graph G’
b0
I
I
C’
a0
Refined components implies refined graphs, i.e. G is better than G0
Can be used to create temporally deterministic dataflow graph
abstraction
•
one behavior simplifies analysis
24/42
Monotonicity
ρ2 = T
1
v2
1
1
v0
1
1
1
ρ0 = T
I
ρ1 = T
Lower firing durations result in earlier production of tokens
•
I
v1
given a functionally deterministic dataflow graph
Sufficient to show that a schedule exists that meets the temporal
requirements
•
•
impossible for Turing complete dataflow graphs
requires also independent analysability of execution times of tasks
25/42
Timed dataflow analysis
I
Throughput,latency, buffer-capacity must be decidable
•
•
I
static applications: HSDF, SDF, CSDF, CSDFa
dynamic applications: VRDF, VPDF, SADF
VRDF example:
v0
1
n
n ∈ {1, 2}
26/42
v1
Denotational semantics of timed dataflow
Set of convex constraints
Labeled transition system
Max-plus algebra
Set of constraints
Trace algebra
Timed dataflow model
abstraction
semantics
Event triggered system
I
I
I
I
Labeled transition system (find worst-case behavior by means of
execution or model checking)
MaxPlus Algebra (symbolic evaluation, linear system theory)
System of inequalities (reason in precedence constraints and periodic
schedules, convex optimization in polynomial time, closed-form
expressions)
Trace algebra (analysis of non-deterministic behavior)
27/42
Discrete event model
I
The dataflow model is a discrete event model
•
I
closely related to timed Petri-nets, in particular Marked Graphs
Abstraction/refinement theory is used to:
•
•
create a deterministic queuing system abstraction
which can be analysed using MaxPlus linear system theory
x(k + 1)
y(k)
•
= A ⊗ x(k) ⊕ B ⊗ u(k)
= C ⊗ x(k) ⊕ D ⊗ u(k)
with A, B, C, D are respectively the dynamics, input, output, and
feed-through matrices
28/42
Independent analysability
I
Components are first characterized as functions
I
Then the overall behavior is determined, i.e., f (g(x)) = (f ◦ g)(x)
The assumption is that the behavior of the functions is not affected
by composition!
I
•
does this holds for the execution times of tasks?
29/42
Memory port sharing breaks independent
analysability
processor 1
processor 2
HP
LP
arbiter
memory
I
Use of static priorities breaks compositionality
30/42
Multiport memory
processor 1
processor 2
memory
I
Expensive and not scalable
31/42
Memory port sharing does not break independent
analysability
processor 1
processor 2
round-robin
arbiter
memory
I
Results in more variation in the execution times
32/42
Efficient memory port sharing
P
C
processor 1
processor 2
FIFO
FIFO
LC
LT
LT
arbiter
memory
I
LC
arbiter
memory
WCET of task can be determined in isolation, i.e., without
knowledge about the characteristics of other tasks
•
•
arbiter guarantees a reserved budget ⇒ independent analysable
posting of writes minimizes the number of processor stall cycles
33/42
Run-time scheduling
I
Starvation-free scheduling, e.g. time-division multiplex, round-robin
•
worst-case response-time can be computed independently of
execution rates of other tasks
•
•
throughput analysis and buffer-sizing using linear programs
•
I
C
ρ̂ = C + (P − B)d B
e
convex search-space
Non-starvation-free scheduling, e.g. fixed priority preemptive
•
•
response times are dependent on the execution rates of other tasks
throughput and buffer capacities can be computed using an iterative
algorithm
•
•
•
relies on iterative fixed-point computation of monotonic functions
non-monotone behavior e.g. smaller buffer capacities can improve the
throughput
only applicable given static dataflow graphs, e.g. SDF
34/42
Compilation
I
Key obstacles for applying dataflow analysis are:
•
I
modeling effort, model correctness, does application fit in model?
Potential of a compiler based approach:
•
•
automatic optimization and mapping of task graph
verify tool instead of the generated parallel application
35/42
Multiprocessor compiler Omphale
I
I
I
input: sequential OIL program
internal: Structured Variable Phase Dataflow (SVPDF) model
outputs: dataflow model, and executable of the task-graph
36/42
OIL program example
s t a t e = ACQUISITION ;
while (1) {
i n p u t ( out in1 ) ;
switch ( state ){
c a s e ACQUISITION : {
d e t e c t ( in1 , out s t a t e ) ;
}
c a s e RECEIVE : {
d e c o d e 1 ( i n 1 , o u t s t a t e , o u t o1 ) ;
d e c o d e 2 ( o1 , o u t o2 ) ;
o u t p u t ( o2 ) ;
}
}
}
37/42
Resulting task graph
det
in
in1
state
dec1
I
I
o1
dec2
o2
out
Every OIL function becomes a task
Every variable becomes a so-called circular buffer (CB)
•
potentially with multiple readers and writers
•
•
not a Kahn process network!
buffer capacities determine amount of pipeline parallelism
38/42
Simulink block-diagram
I
Best option: parallel or sequential specification?
39/42
Research directions
I
Generalize/apply the dataflow analysis approach
•
•
I
Allow more sources of non-determinism
•
I
alternative for synchronous languages
Design of robust cyber-physical systems
•
I
e.g. not only FIFO order communication
Programming language design for real-time parallel applications
•
I
computation of the most suitable task graph
used for optimization inside a compiler
study trade-offs between physical and cyber part especially concerning
missed deadlines
Improve fault-tolerance
•
define mechanisms to recover from loss of events in self-timed systems
40/42
Summary
I
Data-driven task execution preferred over time-triggered periodic
task execution
•
I
Timed dataflow analysis has become a useful abstraction for design
and programming of multiprocessor systems
•
•
I
tolerates more uncertainty
based on the-earlier-the-better refinement theory
restrictions enable/simplify analysis
Real-time multiprocessor compilation tools can hide complexity but
are still in their infancy
41/42
42/42