COMPUTER ARCHITECTURE GROUP Many-‐core Architecture Characteriza5on of the Path Planning Workload CogArch Workshop 2015 Omer Khan Assistant Professor Electrical and Computer Engineering University of Connec9cut Contact Email: [email protected] Path Planning in Cognitive Computing S D Desired Driving Behavior Controllers Planner Actors Scheduler Actual Driving Behavior Sensors Acquisi9on & Percep9on • Collision free path? • Most efficient? COMPUTER ARCHITECTURE GROUP Image from KIT Ins9tute for Anthropoma9cs 2 Path Planning: A Challenge Application Real 9me Performance Energy Constraints Path Planning Resiliency Constraints COMPUTER ARCHITECTURE GROUP 3 Path Planning: Efficiency via Parallelization 1.3 0.8 Shortest Path Dijkstra Algorithm: Initialize Nodes(), Edges()! 0.5 For (each Node u) For (each Edge of u) --Outer Loop visits each node once! 2. Checks for the next best node (u) among the neighbors! Inner Loop paralleliza5on Each thread is assigned a set of current node’s edges to relax BARRIERS are applied to synchronize threads during each itera9on J Work efficient due to no redundant computa9ons compared to sequen9al L Hard to parallelize due to lack of edge-‐level parallelism (bi-‐ direc9onal search improves concurrency) COMPUTER ARCHITECTURE GROUP 1.1 1.3 --Inner Loop! 1. Calculates distance from current node to each neighbor! 1.2 1.2 1.3 1.3 O( N.logN + E ) Outer Loop paralleliza5on Convergence based: Divide Range based: Dynamically nodes among threads, then each distribute “range of nodes” among thread relaxes its nodes itera9vely threads, then each thread relaxes un9l the distances converge (i.e., its set of nodes one by one no change) LOCK each node to avoid poten9al races among threads relaxing shared nodes. Apply a BARRIER to synchronize threads aWer each itera9on L Work inefficient due to many redundant relaxa9ons for nodes that stabilize before convergence J Highly parallelizable due to node-‐level parallelism J Work efficient and exploits node-‐level parallelism (bi-‐ direc9onal search improves concurrency) L Needs intelligent scheduling for dynamic work balancing 4 Characterization Space • Simulated a 9led 256-‐core NOC-‐based mul9core • Algorithms: Single-‐ and Mul9ple-‐Objec9ve Shortest Path (SOSP/MOSP) • Dijkstra: Visits each node once, hence high work complexity • Heuris9c Algorithms: useful for pre-‐processed input graphs • A*, D*: Number of nodes visited is drama9cally reduced • ∆-‐Stepping: Visits each node once, but work done per node is determined by “delta” • Mar5n’s Algorithm: Similar to Dijkstra, but considers an addi9onal objec9ve when evalua9ng for the next best node • Paralleliza9on strategies • Inner Loop • Outer Loop: Convergence and Range based • Inputs (methodology similar to Gtgraph from Georgia Tech) • Adjacency list representa9on with randomly distributed edge weights • Graph configura9ons • Number of nodes: 16K – 4M • Sparse graph: 4 – 32 edges/node • Dense graph: 8K edges/node COMPUTER ARCHITECTURE GROUP 5 Characterization Objectives • Path planning is challenging, because • Operates on unstructured data • Complex dependence paderns between tasks that are known only during program execu9on • Characteriza9on to revel four areas where compu9ng must adapt at run9me to exploit execu9on efficiency 1. 2. 3. 4. Dynamic Workload Selec9on and Balancing Concurrency Controls Input Dependence Accuracy of Computa9on COMPUTER ARCHITECTURE GROUP 6 Dense Graph (16K nodes, 8K edges) Shortest Path Algorithm Dijkstra Sequen9al (baseline) Inner Loop Par Bi-‐direc5onal Inner Loop Par Convergence based Outer Loop Par Range-‐based Outer Loop Par Bidirec9onal Range-‐based Outer Comple5on Time (ms) 14200 549 356 6300 7691 7424 Num. of Accuracy Comments Threads (%) Convergence based 1 100% is work inefficient; 32 Range-‐based incurs 96 extra 160 communica9on 256 Inner loop has good 256 D* sequen9al D* Bi-‐direc5onal Inner Loop Par 74 2.51 1 32 97% 1 128 256 100% Mar9n’s Sequen9al (baseline) 14500 321 B-‐direc5onal Inner Loop Par Bi-‐direc9onal Range-‐based Outer Par 2859 COMPUTER ARCHITECTURE GROUP concurrency, gives ~40X speedup; Very high Work Efficiency ~45X speedup 7 Sparse Graph (16K nodes, 16 edges) Shortest Path Algorithm Comple5on Time (ms) Dijkstra Sequen9al (baseline) 30 Inner Loop Par 30 Bi-‐direc9onal Inner Loop Par 14 Convergence based Outer Loop Par 377 Range-‐based Outer Loop Par 10.8 Bi-‐direc5onal Range-‐based Outer 6.4 Loop Par ∆-‐Stepping Inner Par (∆=50%) D* Sequen9al D* Bi-‐direc5onal Inner Loop Par 2.9 4 3.4 Num. of Accuracy Comments Threads (%) 1 100% Inner loop has low concurrency; 1 Convergence based 2 Outer loop is work 256 inefficient; 128 Range-‐based bi-‐ 192 direc5onal gives ~5X speedup 16 80% Work Efficient 1 97% 2 Ø Mo5vates the need to adapt to “choices” along (1) input dependence, (2) dynamic workload balancing, (3) concurrency controls, and (4) the accuracy of computa5on COMPUTER ARCHITECTURE GROUP 8 “Situational Scheduler” for Adaptation Ø User/programmer selects an algorithm or heuris9c based on sta9c informa9on such as input characteris5cs or solu5on accuracy requirements Algorithmic Choices Ø U9lize run9me informa9on to make decision regarding concurrency control and workload balancing methods Architectural Choices Many-core Substrate Performance Monitoring/ Decision Engine • Develop models that predict the choices to improve efficiency of computa9on • May u9lize heuris9c or control theore9c or even machine learning methods • Goal is to design the decision engine to be low overhead with high accuracy of predic9ng the right decision COMPUTER ARCHITECTURE GROUP 9 Architecture Innovations for Extreme Efficiency Range-based Outer Loop Parallel Dijkstra Master Thread (also par9cipates as a Worker thread) 1. Create work Worker Thread (only one shown here, others do same work) 3. Wait and do the assigned work while all nodes are not visited! 2. Here is • Calculate next work range of nodes to relax based on the 4. Done degree of graph! • Distribute nodes in the current range among worker threads! while assigned nodes are not visited (update Q array)! • LOCK the node to be relaxed! • for all neighbor nodes! • Update the D array with new distance, if new distance is less than older distance (RELAX function)! • UNLOCK the node that was relaxed! BARRIER to synchronize all threads! 5. All Done Accelerate Computa9on COMPUTER ARCHITECTURE GROUP Accelerate Communica9on Accelerate Data Access 10 Data Access Efficiency • Data locality in path planning is challenging due to unstructured data dependent accesses • Locality-‐aware protocols [ISCA’13, HPCA’14] • Exploit the run9me variability in locality/reuse of data at various layers of the on-‐chip cache hierarchy • Intelligent fine-‐grain data alloca9on/replica9on at private and shared caches • Locality-‐aware private-‐L1 alloca9on/replica9on • Locality-‐aware shared-‐L2 replica9on COMPUTER ARCHITECTURE GROUP • Locality-‐aware Private Caching [ISCA’13] • Privately cache high locality lines at L1 cache • Remotely access (at word level) low locality lines at L2 cache • Alloca9on based on locality of data dynamically profiled using cache-‐line level in-‐hardware locality classifiers 11 Locality-Aware Data Access Range-based Outer Loop Parallel Dijkstra L1 Cache Miss Breakdown (%) • Reducing Sharing Misses 8 6 4 2 0 Cold Capacity Baseline Sharing Word Locality-‐aware Private Caching • Sharing misses (expensive) turned into word misses (cheap) as more cache lines with low locality are iden9fied by the hardware classifiers COMPUTER ARCHITECTURE GROUP 12 Locality-Aware Data Access Range-based Outer Loop Parallel Dijkstra Energy Consump5on • Energy Consump9on tradeoffs 0.0005 DRAM 0.0004 Network Link 0.0003 Network Router 0.0002 Directory 0.0001 L2 Cache 0 Baseline Locality-‐aware Private Caching L1-‐D Cache L1-‐I Cache • Reduce invalida9ons, asynchronous write-‐backs and cache-‐line ping-‐pong’ing COMPUTER ARCHITECTURE GROUP 13 Locality-Aware Data Access Range-based Outer Loop Parallel Dijkstra Comple5on Time (ns) • Comple9on 9me tradeoffs 1.20E+07 Synchroniza9on L2Home-‐OffChip 8.00E+06 L2Home-‐Sharers 4.00E+06 L2Home-‐Wai9ng L1Cache-‐L2Home 0.00E+00 Baseline Locality-‐aware Private Caching Compute • Less 9me spent wai9ng for coherence traffic to be serviced • Cri9cal sec9on 9me reduc9on -‐> synchroniza9on 9me reduc9on COMPUTER ARCHITECTURE GROUP 14 Area (um^2) How about Accelerating Computation? 1.40E+06 1.20E+06 1.00E+06 8.00E+05 6.00E+05 4.00E+05 2.00E+05 0.00E+00 Latency Gap 0 Aladdin [Shao et al., ISCA’14] Tool’s Design Space 200000 400000 600000 800000 1000000 1200000 1400000 Cycles DIJKSTRA FFT • Accelera9ng computa9on alone even under an idealis9c data access setup is not sufficient! Ø Must address data dependencies that lead to fine-‐grain communica5on bollenecks COMPUTER ARCHITECTURE GROUP 15 The Case for a Many-core Accelerator Core Core Core Core+ ACC Core+ ACC Core+ ACC Send() / Receive() Messages + Coherence and Communication over shared memory Data Access Coherence over shared memory Locality Aware Data Access Shared Memory Conventional System COMPUTER ARCHITECTURE GROUP Our proposal Accelerate Computation + Data access + Communication 16 Why Accelerate Communication? Example Flow Core Flow Core Flow Core Flow Core Flow Core Flow Core Flow Core Flow Core send() Lockless Shared Data Structure recv() Queues Ordering Core Ordering Core Shared Memory Explicit Messaging • The ordering core receives packets (potentially out of order) from many flow cores, and it reorders and commits the packets • Several shared memory versions implemented (we show only the best lockless shared data structure implementation) COMPUTER ARCHITECTURE GROUP 17 Latency per Packet Why Accelerate Communication? Example 150 Shared Memory Explicit Messaging 100 50 0 2 4 8 12 16 24 32 48 64 128 256 Cores Methods • In-house simulator with ADL front-end: Simple in-order RISC cores • Compiler support for send() and receive() instructions • BARC’15 paper • Shared memory: 20 cycles/packet at 256 cores is the best result • Explicit Messaging: 10 cycles/packet at 4 cores • ~2X latency advantage using point-to-point communication and by avoiding data ping-ponging COMPUTER ARCHITECTURE GROUP 18 What about Resilience? • Opens a new research direc9on that trades off program accuracy with efficient resiliency [CAL’15] COMPUTER ARCHITECTURE GROUP Performance/Energy/Power (Overheads) n-Modular Software Symptom -based n-Modular Symptom-based Resiliency Methods? Program Accuracy • Redundancy alone can achieve resiliency but hurts efficiency • Our Approach: Given correctness guarantee constraints, selec9vely apply resilience to code that is crucial for program correctness and output Software Error Vulnerability (Coverage) 19 Declarative Resilience D* Heuristic Algorithm for Path Planning • Heuris9c algorithm, D* (aka A*) is work efficient and popular in applica9ons that use pre-‐processed input graphs D S D D* Shortest Path S Correct Shortest Path Perturbed Shortest Path • Declara9ve resilience allows the heuris;c calcula;on to be considered non-‐crucial code, and hence minor perturba9ons can be tolerated COMPUTER ARCHITECTURE GROUP 20 Declarative Resilience D* Heuristic Algorithm for Path Planning Sequen9al Pseudo-‐code for D* while not at the destination node! ! RESILIENCE OFF! ! • for all neighbor nodes! • Lookup edge weights for neighboring nodes! • Use heuristic to calculate the next node with minimum distance! ! Considera9ons for program correctness of non-‐crucial code 1. Unroll for loop to remove all control flow instruc9ons 2. No stores to globally visible memory i.e., all updates local 3. Local store address calcula9on protected using redundancy RESILIENCE ON! ! Go to the next best node! COMPUTER ARCHITECTURE GROUP 4. Next node calcula9on checked for “within bounds” • Based on current node ID and degree of the graph • If bounds violated, next node is not updated (i,e., re-‐execute current node) 21 Declarative Resilience Results D* Heuristic Algorithm for Path Planning Comple5on Time (normalized) 1.4 Resilience-‐On-‐Delay 1.2 Re-‐Execu9on-‐Time 1 0.8 Network-‐Recv-‐Stall-‐Time 0.6 Synchroniza9on-‐Stall-‐Time 0.4 Compute-‐Time 0.2 Memory-‐Stall-‐Time 0 Baseline Re-‐Execu9on Declara9ve Resilience • Re-‐execu9ng all instruc9ons incurs 30% performance overhead [COMPUTER’13] • Declara9ve Resilience performance overhead is 8% • Protects all crucial code, and against the side effects of non-‐crucial code COMPUTER ARCHITECTURE GROUP 22 Summary • Exploi9ng concurrency in path planning is non trivial because (1) it operates on unstructured data, and (2) complex dependence paderns between tasks are known only during program execu9on Ø Develop a “Situa9onal Scheduler” that adapts to (1) input dependence, (2) dynamic workload varia9ons, (3) exploitable concurrency, and (4) accuracy requirements Ø Many-‐core architectures must accelerate computa9on, communica9on and data accesses for extreme efficiency Ø Resiliency of computa9on must be considered as a first order metric. A declara9ve resiliency method poten9ally reduces the efficiency overheads of resilience COMPUTER ARCHITECTURE GROUP 23
© Copyright 2026 Paperzz