From dataflow basics to programming of ASICs Marco Bekooij [email protected]/[email protected] May 30, 2016 Outline I Challenges multiprocessor programming • industrial practice of safety critical modem/radar systems I Message of the talk I Robust system design I Implications of timed-dataflow analysis I Use of timed-dataflow in compilers I Future research directions 2/42 Why embedded multiprocessor systems I CMOS scaling: • • I number of transistor continues to increase due to CMOS scaling clock frequency of processors does not increase Embedded multiprocessor systems allow to exploit potential of CMOS technology 3/42 Multiprocessor programming is challenging I Challenges as a result of concurrency: • • • • • application partitioning: load balancing, dependencies limit parallelism functional determinism: task schedule dependent behavior deadlock/starvation: no progress shared resources: e.g. processor sharing, bus-sharing, memory space sharing worst-case performance: WCET analysis, throughput analysis 4/42 Towards autonomous driving I Focus of NXP is on CAR2X modems/radar components • part of safety critical systems 5/42 Industrial design practice I Industrial practice: design, test, debug till behavior deemed satisfactory • increasing system complexity results in relatively longer test and debug times 6/42 Problem I Programming multicore is very challenging as a result of real-time requirements and need to exploit concurrency I Use of abstractions to steer design decisions is desirable 7/42 Message of Talk I I Data-driven task execution preferred over timed-triggered/periodic task execution Timed-dataflow models are useful abstractions for the design and programming of multicore systems • I enforcing structure in system can simplify programming and analysis Compilation tools can hide modeling effort 8/42 Classical real-time task model I Classical periodic real-time model: P,C,D • • • • the period P release ri of i-th execution of a task worst-case execution time C relative deadline D i P ri C ri+1 preempted ri+2 t D D 9/42 Periodic scheduling τ1 τ2 P Processor 1 τ1 Processor 2 τ2 t error! P I t Execution time tasks must be smaller than period 10/42 Alternative workload model I I I Assume Ci + Ci+1 ≤ 2 · P One execution every P on average is possible No feasible periodic schedule with period P τ1 τ2 P τ1 t τ2 I I t Requires a data-driven schedule Variation in arrival times data is absorbed by FIFO buffers 11/42 Average execution-time inaccuracy average I uncertainty overestimation Running average is closer to the average execution-time • improved throughput guarantee 12/42 Time-triggered versus data-driven f Time-triggerd Periodic so f τ1 13/42 f τ2 f si Time-triggered versus data-driven f Time-triggerd Periodic so f τ1 f τ2 f Data-driven I so f si f τ1 τ2 si Data-driven task execution results in more scheduling freedom • results in a higher guaranteed throughput and lower latency 13/42 Guarantees I Kopetz principle: arguably, any definitive statement is in fact a statement about a model and not a statement about the thing being modeled I Consequence: guarantees are given under a load hypothesis and a fault hypothesis Workload characterization, e.g. WCET, is not guaranteed! How problematic is this? 14/42 Data-driven scheduling I I Assume workload characterization of tasks using their WCETs What happens if WCET assumption is faulty? f τ1 τ2 P τ1 t undefined functional behavior τ2 t si t time-triggered 15/42 si Data-driven scheduling I I Assume workload characterization of tasks using their WCETs What happens if WCET assumption is faulty? f τ1 τ2 si P P τ1 τ1 t t undefined functional behavior τ2 τ2 t t data arrives in time! si si t time-triggered t data-driven Data driven schedules are more robust against faulty WCET assumptions! 15/42 Data-driven scheduling I I Assume workload characterization of tasks using their WCETs What happens if WCET assumption is faulty? f τ1 τ2 si P P τ1 τ1 t t undefined functional behavior τ2 τ2 t t data arrives in time! si si t time-triggered t data-driven Data driven schedules are more robust against faulty WCET assumptions. Still a buffer underrun/overflow can occur! 15/42 Robustness I A system is robust if: • • I bounded disturbances have bounded effects the effect of a sporadic disturbance disappears over time Disturbances: • • • noise variation in parameters faulty assumptions! 16/42 Deterministic embedded system HW/SW design non-deterministic due to noise + uncertainty environment signal proc. apps. periodic tasks synchronous systems 100% deterministic synchronous CPUs gates transistors non-deterministic due to noise parameter variations CMOS technology I Puts difficult-to-meet requirements on hardware and software • • • system design must guarantee that WCETs hold classical real-time systems view realization attempt: PRET processor 17/42 Robust system design non-deterministic due to noise + uncertainty environment signal proc. apps. + faulty load-characterization data-driven tasks GALS multi proc. synchronous CPUs 100% deterministic gates transistors non-deterministic due to noise + parameter variations CMOS technology I Relaxes requirements on hardware and software I Applied in practice: e.g. retransmission packet Complicates system evaluation I • probabilistic characterization of the system 18/42 Classical task-model I Communication during critical section • shared variables can be read and written a=3 wait(s0 ) x = b signal(s0 ) τ1 t y=a wait(s0 ) b = 2 signal(s0 ) τ2 t 19/42 Classical task-model I Communication during critical section • shared variables can be read and written a=3 wait(s0 ) x = b signal(s0 ) τ1 a=3 wait(s0 ) x = b signal(s0 ) τ1 t y=a wait(s0 ) b = 2 signal(s0 ) τ2 t y=a wait(s0 ) b = 2 signal(s0 ) τ2 t Non-deterministic functional behavior because it is task schedule dependent 19/42 t Kahn Process Networks (KPN) FIFO P2 FIFO P1 P4 FIFO I producing process only writes and consuming process only reads FIFOs enable pipeline parallelism • • I FIFO Functionally deterministic behavior • I P3 they also absorb variation however sufficient FIFO buffer capacity cannot be computed, KPNs are Turing Complete Untimed • throughput, latency? 20/42 Timed dataflow y0 = (t0 + ρ0 + ρ1 , (g ◦ f )(v0 ), i0 ) ρ0 x = hx0 , x1 , . . .i f ρ1 g y = hy0 , y1 , . . .i x0 = (t0 , v0 , i0 ) I Actors have a firing duration I Actors consume token at start and produce token at the finish of an actor firing Functionally deterministic behavior if firing rules are sequential I • dataflow model is more expressive than KPN 21/42 Mismatch with reality ρ̂ fˆ(i) = p̂(i) ê(i) = ŝ(i) = ĉ(i) actor t ρ(i) e(i) s(i) c(i) p(i) f (i) task t I Actors have atomic consumption and production at start and finish I For analysis the actors must have constant firing durations 22/42 Earlier-the-better-refinement abstraction/refinement of components a(i) A refinement a0 (i) b(j) abstraction A’ b0 (j) A0 v A I I Components A0 is better than A, and A must be temporally monotone Dataflow actors can be deterministic abstractions of tasks 23/42 Graph refinement d A e B graph G b d0 a C A e0 B graph G’ b0 I I C’ a0 Refined components implies refined graphs, i.e. G is better than G0 Can be used to create temporally deterministic dataflow graph abstraction • one behavior simplifies analysis 24/42 Monotonicity ρ2 = T 1 v2 1 1 v0 1 1 1 ρ0 = T I ρ1 = T Lower firing durations result in earlier production of tokens • I v1 given a functionally deterministic dataflow graph Sufficient to show that a schedule exists that meets the temporal requirements • • impossible for Turing complete dataflow graphs requires also independent analysability of execution times of tasks 25/42 Timed dataflow analysis I Throughput,latency, buffer-capacity must be decidable • • I static applications: HSDF, SDF, CSDF, CSDFa dynamic applications: VRDF, VPDF, SADF VRDF example: v0 1 n n ∈ {1, 2} 26/42 v1 Denotational semantics of timed dataflow Set of convex constraints Labeled transition system Max-plus algebra Set of constraints Trace algebra Timed dataflow model abstraction semantics Event triggered system I I I I Labeled transition system (find worst-case behavior by means of execution or model checking) MaxPlus Algebra (symbolic evaluation, linear system theory) System of inequalities (reason in precedence constraints and periodic schedules, convex optimization in polynomial time, closed-form expressions) Trace algebra (analysis of non-deterministic behavior) 27/42 Discrete event model I The dataflow model is a discrete event model • I closely related to timed Petri-nets, in particular Marked Graphs Abstraction/refinement theory is used to: • • create a deterministic queuing system abstraction which can be analysed using MaxPlus linear system theory x(k + 1) y(k) • = A ⊗ x(k) ⊕ B ⊗ u(k) = C ⊗ x(k) ⊕ D ⊗ u(k) with A, B, C, D are respectively the dynamics, input, output, and feed-through matrices 28/42 Independent analysability I Components are first characterized as functions I Then the overall behavior is determined, i.e., f (g(x)) = (f ◦ g)(x) The assumption is that the behavior of the functions is not affected by composition! I • does this holds for the execution times of tasks? 29/42 Memory port sharing breaks independent analysability processor 1 processor 2 HP LP arbiter memory I Use of static priorities breaks compositionality 30/42 Multiport memory processor 1 processor 2 memory I Expensive and not scalable 31/42 Memory port sharing does not break independent analysability processor 1 processor 2 round-robin arbiter memory I Results in more variation in the execution times 32/42 Efficient memory port sharing P C processor 1 processor 2 FIFO FIFO LC LT LT arbiter memory I LC arbiter memory WCET of task can be determined in isolation, i.e., without knowledge about the characteristics of other tasks • • arbiter guarantees a reserved budget ⇒ independent analysable posting of writes minimizes the number of processor stall cycles 33/42 Run-time scheduling I Starvation-free scheduling, e.g. time-division multiplex, round-robin • worst-case response-time can be computed independently of execution rates of other tasks • • throughput analysis and buffer-sizing using linear programs • I C ρ̂ = C + (P − B)d B e convex search-space Non-starvation-free scheduling, e.g. fixed priority preemptive • • response times are dependent on the execution rates of other tasks throughput and buffer capacities can be computed using an iterative algorithm • • • relies on iterative fixed-point computation of monotonic functions non-monotone behavior e.g. smaller buffer capacities can improve the throughput only applicable given static dataflow graphs, e.g. SDF 34/42 Compilation I Key obstacles for applying dataflow analysis are: • I modeling effort, model correctness, does application fit in model? Potential of a compiler based approach: • • automatic optimization and mapping of task graph verify tool instead of the generated parallel application 35/42 Multiprocessor compiler Omphale I I I input: sequential OIL program internal: Structured Variable Phase Dataflow (SVPDF) model outputs: dataflow model, and executable of the task-graph 36/42 OIL program example s t a t e = ACQUISITION ; while (1) { i n p u t ( out in1 ) ; switch ( state ){ c a s e ACQUISITION : { d e t e c t ( in1 , out s t a t e ) ; } c a s e RECEIVE : { d e c o d e 1 ( i n 1 , o u t s t a t e , o u t o1 ) ; d e c o d e 2 ( o1 , o u t o2 ) ; o u t p u t ( o2 ) ; } } } 37/42 Resulting task graph det in in1 state dec1 I I o1 dec2 o2 out Every OIL function becomes a task Every variable becomes a so-called circular buffer (CB) • potentially with multiple readers and writers • • not a Kahn process network! buffer capacities determine amount of pipeline parallelism 38/42 Simulink block-diagram I Best option: parallel or sequential specification? 39/42 Research directions I Generalize/apply the dataflow analysis approach • • I Allow more sources of non-determinism • I alternative for synchronous languages Design of robust cyber-physical systems • I e.g. not only FIFO order communication Programming language design for real-time parallel applications • I computation of the most suitable task graph used for optimization inside a compiler study trade-offs between physical and cyber part especially concerning missed deadlines Improve fault-tolerance • define mechanisms to recover from loss of events in self-timed systems 40/42 Summary I Data-driven task execution preferred over time-triggered periodic task execution • I Timed dataflow analysis has become a useful abstraction for design and programming of multiprocessor systems • • I tolerates more uncertainty based on the-earlier-the-better refinement theory restrictions enable/simplify analysis Real-time multiprocessor compilation tools can hide complexity but are still in their infancy 41/42 42/42
© Copyright 2026 Paperzz