Many-core Architecture Characterization of the Path

COMPUTER
ARCHITECTURE GROUP
Many-­‐core Architecture Characteriza5on of the Path Planning Workload CogArch Workshop 2015 Omer Khan Assistant Professor Electrical and Computer Engineering University of Connec9cut Contact Email: [email protected] Path Planning in Cognitive Computing
S
D
Desired Driving Behavior Controllers Planner Actors Scheduler Actual Driving Behavior Sensors Acquisi9on & Percep9on •  Collision free path? •  Most efficient? COMPUTER
ARCHITECTURE GROUP
Image from KIT Ins9tute for Anthropoma9cs 2
Path Planning: A Challenge Application
Real 9me Performance Energy Constraints Path Planning Resiliency Constraints COMPUTER
ARCHITECTURE GROUP
3
Path Planning: Efficiency via Parallelization
1.3
0.8
Shortest Path Dijkstra Algorithm: Initialize Nodes(), Edges()!
0.5
For (each Node u)
For (each Edge of u)
--Outer Loop visits each node once!
2. Checks for the next best node (u) among the neighbors!
Inner Loop paralleliza5on Each thread is assigned a set of current node’s edges to relax BARRIERS are applied to synchronize threads during each itera9on J  Work efficient due to no redundant computa9ons compared to sequen9al L  Hard to parallelize due to lack of edge-­‐level parallelism (bi-­‐
direc9onal search improves concurrency) COMPUTER
ARCHITECTURE GROUP
1.1
1.3
--Inner Loop!
1. Calculates distance from current node to each neighbor!
1.2
1.2
1.3
1.3
O( N.logN + E )
Outer Loop paralleliza5on Convergence based: Divide Range based: Dynamically nodes among threads, then each distribute “range of nodes” among thread relaxes its nodes itera9vely threads, then each thread relaxes un9l the distances converge (i.e., its set of nodes one by one no change) LOCK each node to avoid poten9al races among threads relaxing shared nodes. Apply a BARRIER to synchronize threads aWer each itera9on L  Work inefficient due to many redundant relaxa9ons for nodes that stabilize before convergence J  Highly parallelizable due to node-­‐level parallelism J  Work efficient and exploits node-­‐level parallelism (bi-­‐
direc9onal search improves concurrency) L  Needs intelligent scheduling for dynamic work balancing 4
Characterization Space
•  Simulated a 9led 256-­‐core NOC-­‐based mul9core •  Algorithms: Single-­‐ and Mul9ple-­‐Objec9ve Shortest Path (SOSP/MOSP) •  Dijkstra: Visits each node once, hence high work complexity •  Heuris9c Algorithms: useful for pre-­‐processed input graphs •  A*, D*: Number of nodes visited is drama9cally reduced •  ∆-­‐Stepping: Visits each node once, but work done per node is determined by “delta” •  Mar5n’s Algorithm: Similar to Dijkstra, but considers an addi9onal objec9ve when evalua9ng for the next best node •  Paralleliza9on strategies •  Inner Loop •  Outer Loop: Convergence and Range based •  Inputs (methodology similar to Gtgraph from Georgia Tech) •  Adjacency list representa9on with randomly distributed edge weights •  Graph configura9ons •  Number of nodes: 16K – 4M •  Sparse graph: 4 – 32 edges/node •  Dense graph: 8K edges/node COMPUTER
ARCHITECTURE GROUP
5
Characterization Objectives
•  Path planning is challenging, because •  Operates on unstructured data •  Complex dependence paderns between tasks that are known only during program execu9on •  Characteriza9on to revel four areas where compu9ng must adapt at run9me to exploit execu9on efficiency 1. 
2. 
3. 
4. 
Dynamic Workload Selec9on and Balancing Concurrency Controls Input Dependence Accuracy of Computa9on COMPUTER
ARCHITECTURE GROUP
6
Dense Graph (16K nodes, 8K edges)
Shortest Path Algorithm Dijkstra Sequen9al (baseline) Inner Loop Par Bi-­‐direc5onal Inner Loop Par Convergence based Outer Loop Par Range-­‐based Outer Loop Par Bidirec9onal Range-­‐based Outer Comple5on Time (ms) 14200 549 356 6300 7691 7424 Num. of Accuracy Comments Threads (%) Convergence based 1 100% is work inefficient; 32 Range-­‐based incurs 96 extra 160 communica9on 256 Inner loop has good 256 D* sequen9al D* Bi-­‐direc5onal Inner Loop Par 74 2.51 1 32 97% 1 128 256 100% Mar9n’s Sequen9al (baseline) 14500 321 B-­‐direc5onal Inner Loop Par Bi-­‐direc9onal Range-­‐based Outer Par 2859 COMPUTER
ARCHITECTURE GROUP
concurrency, gives ~40X speedup; Very high Work Efficiency ~45X speedup 7
Sparse Graph (16K nodes, 16 edges)
Shortest Path Algorithm Comple5on Time (ms) Dijkstra Sequen9al (baseline) 30 Inner Loop Par 30 Bi-­‐direc9onal Inner Loop Par 14 Convergence based Outer Loop Par 377 Range-­‐based Outer Loop Par 10.8 Bi-­‐direc5onal Range-­‐based Outer 6.4 Loop Par ∆-­‐Stepping Inner Par (∆=50%) D* Sequen9al D* Bi-­‐direc5onal Inner Loop Par 2.9 4 3.4 Num. of Accuracy Comments Threads (%) 1 100% Inner loop has low concurrency; 1 Convergence based 2 Outer loop is work 256 inefficient; 128 Range-­‐based bi-­‐
192 direc5onal gives ~5X speedup 16 80% Work Efficient 1 97% 2 Ø  Mo5vates the need to adapt to “choices” along (1) input dependence, (2) dynamic workload balancing, (3) concurrency controls, and (4) the accuracy of computa5on COMPUTER
ARCHITECTURE GROUP
8
“Situational Scheduler” for Adaptation
Ø  User/programmer selects an algorithm or heuris9c based on sta9c informa9on such as input characteris5cs or solu5on accuracy requirements Algorithmic Choices Ø  U9lize run9me informa9on to make decision regarding concurrency control and workload balancing methods Architectural Choices Many-core Substrate
Performance Monitoring/ Decision Engine •  Develop models that predict the choices to improve efficiency of computa9on •  May u9lize heuris9c or control theore9c or even machine learning methods •  Goal is to design the decision engine to be low overhead with high accuracy of predic9ng the right decision COMPUTER
ARCHITECTURE GROUP
9
Architecture Innovations for Extreme Efficiency
Range-based Outer Loop Parallel Dijkstra
Master Thread (also par9cipates as a Worker thread) 1. Create work
Worker Thread (only one shown here, others do same work) 3. Wait and do the assigned work
while all nodes are
not visited!
2. Here is
•  Calculate next
work
range of nodes to
relax based on the
4. Done
degree of graph!
•  Distribute nodes
in the current
range among worker
threads!
while assigned nodes are not visited
(update Q array)!
•  LOCK the node to be relaxed!
•  for all neighbor nodes!
•  Update the D array with new
distance, if new distance
is less than older distance
(RELAX function)!
•  UNLOCK the node that was relaxed!
BARRIER to synchronize all threads!
5. All Done
Accelerate Computa9on COMPUTER
ARCHITECTURE GROUP
Accelerate Communica9on Accelerate Data Access 10
Data Access Efficiency
•  Data locality in path planning is challenging due to unstructured data dependent accesses •  Locality-­‐aware protocols [ISCA’13, HPCA’14] •  Exploit the run9me variability in locality/reuse of data at various layers of the on-­‐chip cache hierarchy •  Intelligent fine-­‐grain data alloca9on/replica9on at private and shared caches •  Locality-­‐aware private-­‐L1 alloca9on/replica9on •  Locality-­‐aware shared-­‐L2 replica9on COMPUTER
ARCHITECTURE GROUP
• 
Locality-­‐aware Private Caching [ISCA’13] •  Privately cache high locality lines at L1 cache •  Remotely access (at word level) low locality lines at L2 cache • 
Alloca9on based on locality of data dynamically profiled using cache-­‐line level in-­‐hardware locality classifiers 11
Locality-Aware Data Access
Range-based Outer Loop Parallel Dijkstra
L1 Cache Miss Breakdown (%) •  Reducing Sharing Misses 8 6 4 2 0 Cold Capacity Baseline Sharing Word Locality-­‐aware Private Caching •  Sharing misses (expensive) turned into word misses (cheap) as more cache lines with low locality are iden9fied by the hardware classifiers COMPUTER
ARCHITECTURE GROUP
12
Locality-Aware Data Access
Range-based Outer Loop Parallel Dijkstra
Energy Consump5on •  Energy Consump9on tradeoffs 0.0005 DRAM 0.0004 Network Link 0.0003 Network Router 0.0002 Directory 0.0001 L2 Cache 0 Baseline Locality-­‐aware Private Caching L1-­‐D Cache L1-­‐I Cache •  Reduce invalida9ons, asynchronous write-­‐backs and cache-­‐line ping-­‐pong’ing COMPUTER
ARCHITECTURE GROUP
13
Locality-Aware Data Access
Range-based Outer Loop Parallel Dijkstra
Comple5on Time (ns) •  Comple9on 9me tradeoffs 1.20E+07 Synchroniza9on L2Home-­‐OffChip 8.00E+06 L2Home-­‐Sharers 4.00E+06 L2Home-­‐Wai9ng L1Cache-­‐L2Home 0.00E+00 Baseline Locality-­‐aware Private Caching Compute •  Less 9me spent wai9ng for coherence traffic to be serviced •  Cri9cal sec9on 9me reduc9on -­‐> synchroniza9on 9me reduc9on COMPUTER
ARCHITECTURE GROUP
14
Area (um^2) How about Accelerating Computation?
1.40E+06 1.20E+06 1.00E+06 8.00E+05 6.00E+05 4.00E+05 2.00E+05 0.00E+00 Latency Gap 0 Aladdin [Shao et al., ISCA’14] Tool’s Design Space 200000 400000 600000 800000 1000000 1200000 1400000 Cycles DIJKSTRA FFT •  Accelera9ng computa9on alone even under an idealis9c data access setup is not sufficient! Ø  Must address data dependencies that lead to fine-­‐grain communica5on bollenecks COMPUTER
ARCHITECTURE GROUP
15
The Case for a Many-core Accelerator
Core
Core
Core
Core+
ACC
Core+
ACC
Core+
ACC
Send() / Receive()
Messages +
Coherence and Communication
over shared memory
Data Access
Coherence over
shared memory
Locality Aware
Data Access
Shared Memory
Conventional System
COMPUTER
ARCHITECTURE GROUP
Our proposal
Accelerate Computation +
Data access + Communication
16
Why Accelerate Communication? Example
Flow
Core
Flow
Core
Flow
Core
Flow
Core
Flow
Core
Flow
Core
Flow
Core
Flow
Core
send()
Lockless Shared
Data Structure
recv()
Queues
Ordering
Core
Ordering
Core
Shared Memory
Explicit Messaging
•  The ordering core receives packets (potentially out of order) from many
flow cores, and it reorders and commits the packets
•  Several shared memory versions implemented (we show only the best
lockless shared data structure implementation)
COMPUTER
ARCHITECTURE GROUP
17
Latency per Packet Why Accelerate Communication? Example
150 Shared Memory Explicit Messaging 100 50 0 2 4 8 12 16 24 32 48 64 128 256 Cores Methods
•  In-house simulator
with ADL front-end:
Simple in-order
RISC cores
•  Compiler support for
send() and receive()
instructions
•  BARC’15 paper
•  Shared memory: 20 cycles/packet at 256 cores is the best
result
•  Explicit Messaging: 10 cycles/packet at 4 cores
•  ~2X latency advantage using point-to-point
communication and by avoiding data ping-ponging
COMPUTER
ARCHITECTURE GROUP
18
What about Resilience?
• 
Opens a new research direc9on that trades off program accuracy with efficient resiliency [CAL’15] COMPUTER
ARCHITECTURE GROUP
Performance/Energy/Power
(Overheads)
n-Modular
Software
Symptom -based
n-Modular
Symptom-based
Resiliency
Methods?
Program Accuracy
•  Redundancy alone can achieve resiliency but hurts efficiency •  Our Approach: Given correctness guarantee constraints, selec9vely apply resilience to code that is crucial for program correctness and output Software
Error
Vulnerability
(Coverage)
19
Declarative Resilience
D* Heuristic Algorithm for Path Planning
•  Heuris9c algorithm, D* (aka A*) is work efficient and popular in applica9ons that use pre-­‐processed input graphs D S D D* Shortest Path S Correct Shortest Path Perturbed Shortest Path •  Declara9ve resilience allows the heuris;c calcula;on to be considered non-­‐crucial code, and hence minor perturba9ons can be tolerated COMPUTER
ARCHITECTURE GROUP
20
Declarative Resilience
D* Heuristic Algorithm for Path Planning
Sequen9al Pseudo-­‐code for D* while not at the destination node!
!
RESILIENCE OFF!
!
•  for all neighbor nodes!
•  Lookup edge weights for
neighboring nodes!
•  Use heuristic to
calculate the next node
with minimum distance!
!
Considera9ons for program correctness of non-­‐crucial code 1.  Unroll for loop to remove all control flow instruc9ons 2.  No stores to globally visible memory i.e., all updates local 3.  Local store address calcula9on protected using redundancy RESILIENCE ON!
!
Go to the next best node!
COMPUTER
ARCHITECTURE GROUP
4.  Next node calcula9on checked for “within bounds” •  Based on current node ID and degree of the graph •  If bounds violated, next node is not updated (i,e., re-­‐execute current node) 21
Declarative Resilience Results
D* Heuristic Algorithm for Path Planning
Comple5on Time (normalized) 1.4 Resilience-­‐On-­‐Delay 1.2 Re-­‐Execu9on-­‐Time 1 0.8 Network-­‐Recv-­‐Stall-­‐Time 0.6 Synchroniza9on-­‐Stall-­‐Time 0.4 Compute-­‐Time 0.2 Memory-­‐Stall-­‐Time 0 Baseline Re-­‐Execu9on Declara9ve Resilience •  Re-­‐execu9ng all instruc9ons incurs 30% performance overhead [COMPUTER’13] •  Declara9ve Resilience performance overhead is 8% •  Protects all crucial code, and against the side effects of non-­‐crucial code COMPUTER
ARCHITECTURE GROUP
22
Summary
•  Exploi9ng concurrency in path planning is non trivial because (1) it operates on unstructured data, and (2) complex dependence paderns between tasks are known only during program execu9on Ø  Develop a “Situa9onal Scheduler” that adapts to (1) input dependence, (2) dynamic workload varia9ons, (3) exploitable concurrency, and (4) accuracy requirements Ø  Many-­‐core architectures must accelerate computa9on, communica9on and data accesses for extreme efficiency Ø  Resiliency of computa9on must be considered as a first order metric. A declara9ve resiliency method poten9ally reduces the efficiency overheads of resilience COMPUTER
ARCHITECTURE GROUP
23