Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P. Jouppi¤, Mike Schlansker¤, Brad Calder‡ *UCSD †UPC Barcelona ¤Hewlett-Packard Laboratories ‡UCSD/Microsoft Motivations CMPs are not just small multiprocessors – Different computation/communication ratio – Different shared resources Inter-core fabric offers potential to support optimizations/acceleration – CMPs for vector, streaming workloads Fine-grained Parallelism CMPs in role of vector processors – Software synchronization still expensive – Can target inner-loop parallelism Barriers a straightforward organizing tool – Opportunity for hardware acceleration Faster barriers allow greater parallelism – 1.2x – 6.4x on 256 element vectors – 3x – 12.2x on 1024 element vectors Accelerating Barriers Barrier Filters: a new method for barrier synchronization – – – – No dedicated networks No new instructions Changes only in shared memory system CMP-friendly design point Competitive with dedicated barrier network – Achieves 77%-95% of dedicated network performance Outline Introduction Barrier Filter Overview Barrier Filter Implementation Results Summary Observation and Intuition Observations – Barriers need to stall forward progress – There exist events that already stall processors Co-opt and extend existing stall behavior – Cache misses • Either I-Cache or D-Cache suffices High Level Barrier Behavior A thread can be in one of three states 1. Executing – Perform work – Enforce memory ordering – Signal arrival at barrier 2. Blocking – Stall at barrier until all arrive 3. Resuming – Release from barrier Barrier Filter Example CMP augmented with filter – Private L1 – Shared, banked L2 Filter State # Threads: 3 Arrived-counter: 0 Thread A: Thread B: Thread C: EXECUTING EXECUTING EXECUTING Example: Memory Ordering Before/after for memory – Each thread executes a memory fence Filter State # Threads: 3 Arrived-counter: 0 Thread A: Thread B: Thread C: EXECUTING EXECUTING EXECUTING Example: Signaling Arrival Communication with filter – Each thread invalidates a designated cache line Filter State # Threads: 3 Arrived-counter: 0 Thread A: Thread B: Thread C: EXECUTING EXECUTING EXECUTING Example: Signaling Arrival Invalidation propagates to shared L2 cache Filter snoops the invalidation – Checks address for match – Records arrival Filter State # Threads: 3 Arrived-counter: 1 0 Thread A: Thread B: Thread C: EXECUTING EXECUTING EXECUTING BLOCKING Example: Signaling Arrival Invalidation propagates to shared L2 cache Filter snoops the invalidation – Checks address for match – Records arrival Filter State # Threads: 3 Arrived-counter: 2 1 Thread A: Thread B: Thread C: EXECUTING BLOCKING EXECUTING BLOCKING Example: Stalling Thread A attempts to fetch the invalidated data Fill request not satisfied – Thread stalling mechanism Filter State # Threads: 3 Arrived-counter: 2 Thread A: Thread B: Thread C: BLOCKING EXECUTING BLOCKING Example: Release Last thread signals arrival Barrier release – Counter resets – Filter state for all threads switches Filter State # Threads: 3 Arrived-counter: 0 2 Thread A: Thread B: Thread C: EXECUTING RESUMING RESUMING BLOCKING RESUMING BLOCKING Example: Release After release – New cache-fill requests served – Filter serves pending cachefills Filter State # Threads: 3 Arrived-counter: 0 Thread A: Thread B: Thread C: RESUMING RESUMING RESUMING Example: Release After release – New cache-fill requests served – Filter serves pending cachefills Filter State # Threads: 3 Arrived-counter: 0 Thread A: Thread B: Thread C: RESUMING RESUMING RESUMING Outline Introduction Barrier Filter Overview Barrier Filter Implementation Results Summary Software Interface Communication requirements – Let hardware know # of threads – Let threads know signal addresses Barrier filters as virtualized resource – Library interface – Pure software fallback User scenario – Application calls OS to create barrier with # threads – OS allocates barrier filter, relays address and # threads – OS returns address to application Barrier Filter Hardware Additional hardware: “address filter” – In controller for shared memory level – State table, associated FSMs – Snoops invalidations, fill requests for designated addresses Makes use of existing instructions and existing interconnect network Barrier Filter Internals Each barrier filter supports one barrier – Barrier state – Per-thread state, FSMs Multiple barrier filters – In each controller – In banked caches, at a particular bank Barrier Filter Internals Each barrier filter supports one barrier – Barrier state – Per-thread state, FSMs Multiple barrier filters – In each controller – In banked caches, at a particular bank Barrier Filter Internals Each barrier filter supports one barrier – Barrier state – Per-thread state, FSMs Multiple barrier filters – In each controller – In banked caches, at a particular bank Why have an exit address? Needed for re-entry to barriers – When does Resuming again become Executing? – Additional fill requests may be issued Delivery is not a guarantee of receipt – Context switches – Migration – Cache eviction Ping-Pong Optimization Draws from sense reversal barriers – Entry and exit operations as duals Two alternating arrival addresses – Each conveys exit to the other’s barrier – Eliminates explicit invalidate of exit address Outline Introduction Barrier Filter Overview Barrier Filter Implementation Results Summary Methodology Used modified version of SMT-Sim We performed experiments using 7 different barrier implementations – Software: • Centralized, combining tree – Hardware: • Filter barrier (4 variants), dedicated barrier network We examined performance over a set of parallelizeable kernels – Livermore loops 2, 3, 6 – EEMBC kernels autocorrelation, viterbi Benchmark Selection Barriers are seen as heavyweight operations – Infrequently executed in most workloads Example: Ocean from SPLASH-2 – On simulated 16 core CMP: 4% of time in barriers Barriers will be used more frequently on CMPs Latency Micro-benchmark Average time of barrier execution (in isolation) – #threads = #cores Latency Micro-benchmark Notable effects due to bus saturation – Barrier filter scales well up until this point Latency Micro-benchmark Filters closer to dedicated network than software – Significant speedup vs. software still exhibited Autocorrelation Kernel On 16 core CMP – 7.98x speedup for dedicated network – 7.31x speedup for best filter barrier – 3.86 speedup for best software barrier Significant speedup opportunities with fast barriers Viterbi Kernel Viterbi on 4 core CMP Not all applications can scale to arbitrary number of cores Viterbi performance higher on 4 or 8 cores than on 16 cores Livermore Loops Livermore Loop 3 on 16-core CMP Serial/parallel crossover – HW achieves on 4x smaller problem Livermore Loops Livermore Loop 3 on 16-core CMP Reduction in parallelism to avoid false sharing Result Summary Fine-grained parallelism on CMPs – Significant speedups possible • 1.2x – 6.4x on 256 element vectors • 3x – 12.2x on 1024 element vectors – False sharing affects problem size/scaling Faster barriers allow greater parallelism – HW approaches extend worthwhile problem sizes Barrier filters give competitive performance – 77% - 95% of dedicated network performance Conclusions Fast barriers – Can organize fine-grained data parallelism on a CMP CMPs can act in a vector processor role – Exploit inner-loop parallelism Barrier filters – CMP-oriented fast barrier (FIN) Questions? Extra Graphs
© Copyright 2026 Paperzz