Design & Co-design of Embedded Systems HW/SW Partitioning Algorithms + Process Scheduling Maziar Goudarzi Today Program Introduction Preliminaries Hardware/Software Partitioning Distributed System Co-Synthesis (Next session) Reference: Wayne Wolf, “Hardware/Software Co-Synthesis Algorithms,” Chapter 2, Hardware/Software Co-Design: Principles and Practice, Eds: J. Staunstrup, W. Wolf, Kluwer Academic Publishers, 1997. Fall 2005 Design & Co-design of Embedded Systems 2 Topics Introduction A Classification Examples Vulcan Cosyma Fall 2005 Design & Co-design of Embedded Systems 3 Introduction to HW/SW Partitioning The first variety of co-synthesis applications Definition A HW/SW partitioning algorithm implements a specification on some sort of multiprocessor architecture Usually Multiprocessor architecture = one CPU + some ASICs on CPU bus Fall 2005 Design & Co-design of Embedded Systems 4 Introduction to HW/SW Partitioning (cont’d) A Terminology Allocation Synthesis methods which design the multiprocessor topology along with the PEs and SW architecture Scheduling The process of assigning PE (CPU and/or ASICs) time to processes to get executed Fall 2005 Design & Co-design of Embedded Systems 5 Introduction to HW/SW Partitioning (cont’d) In most partitioning algorithms Type of CPU is fixed and given ASICs must be synthesized What function to implement on each ASIC? What characteristics should the implementation have? Are single-rate synthesis problems CDFG is the starting model Fall 2005 Design & Co-design of Embedded Systems 6 HW/SW Partitioning (cont’d) Normal use of architectural components CPU performs less computationally-intensive functions ASICs used to accelerate core functions Where to use? High-performance applications No CPU is fast enough for the operations Low-cost application ASIC accelerators allow use of much smaller, cheaper CPU Fall 2005 Design & Co-design of Embedded Systems 7 A Classification Criterion: Optimization Strategy Trade-off between Performance and Cost Primal Approach Performance is the primary goal First, all functionality in ASICs. Progressively move more to CPU to reduce cost. Dual Approach Cost is the primary goal First, all functions in the CPU. Move operations to the ASIC to meet the performance goal. Fall 2005 Design & Co-design of Embedded Systems 8 A Classification (cont’d) Classification due to optimization strategy (cont’d) Example co-synthesis systems Vulcan (Stanford): Primal strategy Cosyma (Braunschweig, Germany): Dual strategy Fall 2005 Design & Co-design of Embedded Systems 9 Co-Synthesis Algorithms: HW/SW Partitioning HW/SW Partitioning Examples: Vulcan Partitioning Examples: Vulcan Gupta, De Micheli, Stanford University Primal approach 1. All-HW initial implementation. 2. Iteratively move functionality to CPU to reduce cost. System specification language HardwareC Is compiled into a flow graph Fall 2005 Design & Co-design of Embedded Systems 11 Partitioning Examples: Vulcan (cont’d) x=a; y=b; HardwareC nop 1 1 x=a y=b cond if (c>d) x=e; else y=f; HardwareC Fall 2005 c>d c<=d x=e y=f Design & Co-design of Embedded Systems 12 Partitioning Examples: Vulcan (cont’d) Flow Graph Definition A variation of a (single-rate) task graph Nodes Represent operations Typically low-level operations: mult, add Edges Represent data dependencies Each contains a Boolean condition under which the edge is traversed Fall 2005 Design & Co-design of Embedded Systems 13 Partitioning Examples: Vulcan (cont’d) Flow Graph is executed repeatedly at some rate can have initiation-time constraints for each node t(vi)+lij t(vj) t(vi)+uij can have rate constraints on each node mi Ri Mi Fall 2005 Design & Co-design of Embedded Systems 14 Partitioning Examples: Vulcan (cont’d) Vulcan Co-synthesis Algorithm Partitioning quantum is a thread Algorithm divides the flow graph into threads and allocates them Thread boundary is determined by 1. (always) a non-deterministic delay element, such as wait for an external variable 2. (on choice) other points of flow graph Target architecture CPU + Co-processor (multiple ASICs) Fall 2005 Design & Co-design of Embedded Systems 15 Partitioning Examples: Vulcan (cont’d) Vulcan Co-synthesis algorithm (cont’d) Allocation Primal approach Scheduling is done by a scheduler on the target CPU • is generated as part of synthesis process • schedules all threads (both HW and SW threads) cannot be static, due to some threads non-deterministic initiation-time Fall 2005 Design & Co-design of Embedded Systems 16 Partitioning Examples: Vulcan (cont’d) Vulcan Co-synthesis algorithm (cont’d) Cost estimation SW implementation • Code size – relatively straight forward • Data size – Biggest challenge. – Vulcan puts some effort to find bounds for each thread HW implementation •? Fall 2005 Design & Co-design of Embedded Systems 17 Partitioning Examples: Vulcan (cont’d) Vulcan Co-synthesis algorithm (cont’d) Performance estimation Both SW- and HW-implementation • From flow-graph, and basic execution times for the operators Fall 2005 Design & Co-design of Embedded Systems 18 Partitioning Examples: Vulcan (cont’d) Algorithm Details Partitioning goal Map each thread to one of two partitions • CPU Set: FS • Co-processor set: FH Required execution-rate must be met, and total cost minimized Fall 2005 Design & Co-design of Embedded Systems 19 Partitioning Examples: Vulcan (cont’d) Algorithm Details (cont’d) Algorithm steps 1. Put all threads in FH set 2. Iteratively do 2.1. Move some operations to FS. 2.1.1. Select a group of operations to move to FS. 2.1.2. Check performance feasibility, by computing worst-case delay through flow-graph given the new thread times 2.1.3. Do the move, if feasible 2.2. Incrementally update the new cost-function to reflect the new partition Fall 2005 Design & Co-design of Embedded Systems 20 Partitioning Examples: Vulcan (cont’d) Algorithm Details (cont’d) Vulcan cost function f(w) = c1Sh(FH) - c2Ss(FS) + c3B - c4P + c5|m| c: weight constants S(): Size functions B: Bus utilization (<1) P: Processor utilization (<1) m: total number of variables to be transferred between the CPU and the co-processor Fall 2005 Design & Co-design of Embedded Systems 21 Partitioning Examples: Vulcan (cont’d) Algorithm Details (cont’d) Complementary notes A heuristic to minimize communication • Once a thread is moved to FS, its immediate successors are placed in the list for evaluation in the next iteration. No back-track • Once a thread is assigned to FS, it remains there Experimental results • considerably faster implementations than all-SW, but much cheaper than all-HW designs are produced Fall 2005 Design & Co-design of Embedded Systems 22 Co-Synthesis Algorithms: HW/SW Partitioning HW/SW Partitioning Examples: Cosyma Partitioning Examples: Cosyma Rolf Ernst, et al: Technical University of Braunschweig, Germany Dual approach 1. All-SW initial implementation. 2. Iteratively move basic blocks to the ASIC accelerator to meet performance objective. System specification language Cx Is compiled into an ESG (Extended Syntax Graph) ESG is much like a CDFG Fall 2005 Design & Co-design of Embedded Systems 24 Partitioning Examples: Cosyma (cont’d) Cosyma Co-synthesis Algorithm Partitioning quantum is a Basic Block A Basic Block is a branch-free block of program Target Architecture CPU + accelerator ASIC(s) Scheduling Allocation Cost Estimation Performance Estimation Algorithm Details Fall 2005 Design & Co-design of Embedded Systems 25 Partitioning Examples: Cosyma (cont’d) Cosyma Co-synthesis Algorithm (cont’d) Performance Estimation SW implementation • Done by examining the object code for the basic block generated by a compiler HW implementation • Assumes one operator per clock cycle. • Creates a list schedule for the DFG of the basic block. • Depth of the list gives the number of clock cycles required. Communication • Done by data-flow analysis of the adjacent basic blocks. • In Shared-Memory & Co-design of – ProportionalDesign to number of variables to be accessed Fall 2005 Embedded Systems 26 Partitioning Examples: Cosyma (cont’d) Algorithm Steps Change in execution-time caused by moving basic block b from CPU to ASIC: Dc(b) = w( tHW(b)-tSW(b) + tcom(Z) - tcom(ZUb)) x It(b) w: Constant weight t(b): Execution time of basic block b tcom(b): Estimated communication time between CPU and the accelerator ASIC, given a set Z of basic blocks implemented on the ASIC It(b): Total number of times that b is executed Fall 2005 Design & Co-design of Embedded Systems 27 Partitioning Examples: Cosyma (cont’d) Experimental Results By moving only basic-blocks to HW Typical speedup of only 2x Reason: • Limited intra-basic-block parallelism Cure: • Implement several control-flow optimizations to increase parallelism in the basic block, and hence in ASIC • Examples: loop pipelining, speculative branch execution with multiple branch prediction, operator pipelining Result: • Speedups: 2.7 to 9.7 • CPU times: 35 to 304 seconds on a typical workstation Fall 2005 Design & Co-design of Embedded Systems 28 Summary What’s co-synthesis Various keywords used in classification of cosynthesis algorithms HW/SW Partitioning: One broad category of co-synthesis algorithms Criteria by which a co-synthesis algorithm is categorized Fall 2005 Design & Co-design of Embedded Systems 29 Processes and operating systems Scheduling policies: RMS; EDF. Scheduling modeling assumptions. Interprocess communication. Power management. Reference (and slides): Wayne Wolf, “Computers as Components: Principles of Embedded Computing System Design,” Chapter 6 (Processes © 2000 Morgan Overheads for Computers as and Operating Systems), MKP, 2001. Kaufman Components 30 Metrics How do we evaluate a scheduling policy: Ability to satisfy all deadlines. CPU utilization---percentage of time devoted to useful work. Scheduling overhead---time required to make scheduling decision. © 2000 Morgan Kaufman Overheads for Computers as Components 31 Rate monotonic scheduling RMS (Liu and Layland): widely-used, analyzable scheduling policy. Analysis is known as Rate Monotonic Analysis (RMA). © 2000 Morgan Kaufman Overheads for Computers as Components 32 RMA model All process run on single CPU. Zero context switch time. No data dependencies between processes. Process execution time is constant. Deadline is at end of period. Highest-priority ready process runs. © 2000 Morgan Kaufman Overheads for Computers as Components 33 Process parameters Ti is computation time of process i; ti is period of process i. period ti Pi computation time Ti © 2000 Morgan Kaufman Overheads for Computers as Components 34 Rate-monotonic analysis Response time: time required to finish process. Critical instant: scheduling state that gives worst response time. Critical instant occurs when all higherpriority processes are ready to execute. © 2000 Morgan Kaufman Overheads for Computers as Components 35 Critical instant interfering processes P1 P1 P2 critical instant P3 P1 P1 P2 P1 P2 P3 P4 © 2000 Morgan Kaufman Overheads for Computers as Components 36 RMS priorities Optimal (fixed) priority assignment: shortest-period process gets highest priority; priority inversely proportional to period; break ties arbitrarily. No fixed-priority scheme does better. © 2000 Morgan Kaufman Overheads for Computers as Components 37 RMS example P2 period P2 P1 0 P1 period P1 P1 5 10 time © 2000 Morgan Kaufman Overheads for Computers as Components 38 RMS CPU utilization Utilization for n processes is S i Ti / ti As number of tasks approaches infinity, maximum utilization approaches 69%. © 2000 Morgan Kaufman Overheads for Computers as Components 39 RMS CPU utilization, cont’d. RMS cannot asymptotically guarantee use 100% of CPU, even with zero context switch overhead. Must keep idle cycles available to handle worst-case scenario. However, RMS guarantees all processes will always meet their deadlines. © 2000 Morgan Kaufman Overheads for Computers as Components 40 RMS implementation Efficient implementation: scan processes; choose highest-priority active process. © 2000 Morgan Kaufman Overheads for Computers as Components 41 Earliest-deadline-first scheduling EDF: dynamic priority scheduling scheme. Process closest to its deadline has highest priority. Requires recalculating processes at every timer interrupt. © 2000 Morgan Kaufman Overheads for Computers as Components 42 EDF example P1 P2 © 2000 Morgan Kaufman Overheads for Computers as Components 43 EDF analysis EDF can use 100% of CPU. But EDF may miss a deadline. © 2000 Morgan Kaufman Overheads for Computers as Components 44 EDF implementation On each timer interrupt: compute time to deadline; choose process closest to deadline. Generally considered too expensive to use in practice. © 2000 Morgan Kaufman Overheads for Computers as Components 45 Fixing scheduling problems What if your set of processes is unschedulable? Change deadlines in requirements. Reduce execution times of processes. Get a faster CPU. © 2000 Morgan Kaufman Overheads for Computers as Components 46 Priority inversion Priority inversion: low-priority process keeps high-priority process from running. Improper use of system resources can cause scheduling problems: Low-priority process grabs I/O device. High-priority device needs I/O device, but can’t get it until low-priority process is done. Can cause deadlock. © 2000 Morgan Kaufman Overheads for Computers as Components 47 Solving priority inversion Give priorities to system resources. Have process inherit the priority of a resource that it requests. Low-priority process inherits priority of device if higher. © 2000 Morgan Kaufman Overheads for Computers as Components 48 Data dependencies Data dependencies allow us to improve utilization. Restrict combination of processes that can run simultaneously. P1 and P2 can’t run simultaneously. © 2000 Morgan Kaufman Overheads for Computers as Components P1 P2 49 Context-switching time Non-zero context switch time can push limits of a tight schedule. Hard to calculate effects---depends on order of context switches. In practice, OS context switch overhead is small. © 2000 Morgan Kaufman Overheads for Computers as Components 50 What about interrupts? Interrupts take time away from processes. Perform minimum work possible in the interrupt handler. © 2000 Morgan Kaufman Overheads for Computers as Components P1 OS intr P2 OS P3 51 Device processing structure Interrupt service routine (ISR) performs minimal I/O. Get register values, put register values. Interrupt service process/thread performs most of device function. © 2000 Morgan Kaufman Overheads for Computers as Components 52 Summary Two major scheduling policies RMS EDF Other parameters affecting scheduling Priority inversion Data dependencies Context switching time Interrupts Fall 2005 Design & Co-design of Embedded Systems 53 Assignment Questions from Chapter 6 of text book From 6.17 to 6.26 Due date: Two weeks: Sunday, Azar 27th Fall 2005 Design & Co-design of Embedded Systems 54
© Copyright 2026 Paperzz