Executing?

Exploiting Fine-Grained Data Parallelism
with Chip Multiprocessors and Fast Barriers
Jack Sampson*, Rubén González†, Jean-Francois Collard¤,
Norman P. Jouppi¤, Mike Schlansker¤, Brad Calder‡
*UCSD †UPC Barcelona ¤Hewlett-Packard Laboratories ‡UCSD/Microsoft
Motivations

CMPs are not just small multiprocessors
– Different computation/communication ratio
– Different shared resources

Inter-core fabric offers potential to support
optimizations/acceleration
– CMPs for vector, streaming workloads
Fine-grained Parallelism

CMPs in role of vector processors
– Software synchronization still expensive
– Can target inner-loop parallelism

Barriers a straightforward organizing tool
– Opportunity for hardware acceleration

Faster barriers allow greater parallelism
– 1.2x – 6.4x on 256 element vectors
– 3x – 12.2x on 1024 element vectors
Accelerating Barriers

Barrier Filters: a new method for barrier
synchronization
–
–
–
–

No dedicated networks
No new instructions
Changes only in shared memory system
CMP-friendly design point
Competitive with dedicated barrier network
– Achieves 77%-95% of dedicated network performance
Outline

Introduction

Barrier Filter Overview

Barrier Filter Implementation

Results

Summary
Observation and Intuition

Observations
– Barriers need to stall forward progress
– There exist events that already stall processors

Co-opt and extend existing stall behavior
– Cache misses
• Either I-Cache or D-Cache suffices
High Level Barrier Behavior

A thread can be in one of three states
1. Executing
– Perform work
– Enforce memory ordering
– Signal arrival at barrier
2.
Blocking
– Stall at barrier until all arrive
3.
Resuming
–
Release from barrier
Barrier Filter Example

CMP augmented with filter
– Private L1
– Shared, banked L2
Filter State
# Threads: 3
Arrived-counter: 0
Thread A:
Thread B:
Thread C:
EXECUTING
EXECUTING
EXECUTING
Example: Memory Ordering

Before/after for memory
– Each thread executes a
memory fence
Filter State
# Threads: 3
Arrived-counter: 0
Thread A:
Thread B:
Thread C:
EXECUTING
EXECUTING
EXECUTING
Example: Signaling Arrival

Communication with filter
– Each thread invalidates a
designated cache line
Filter State
# Threads: 3
Arrived-counter: 0
Thread A:
Thread B:
Thread C:
EXECUTING
EXECUTING
EXECUTING
Example: Signaling Arrival

Invalidation propagates to
shared L2 cache

Filter snoops the invalidation
– Checks address for match
– Records arrival
Filter State
# Threads: 3
Arrived-counter: 1
0
Thread A:
Thread B:
Thread C:
EXECUTING EXECUTING EXECUTING
BLOCKING
Example: Signaling Arrival

Invalidation propagates to
shared L2 cache

Filter snoops the invalidation
– Checks address for match
– Records arrival
Filter State
# Threads: 3
Arrived-counter: 2
1
Thread A:
Thread B:
Thread C:
EXECUTING
BLOCKING EXECUTING BLOCKING
Example: Stalling

Thread A attempts to fetch
the invalidated data

Fill request not satisfied
– Thread stalling mechanism
Filter State
# Threads: 3
Arrived-counter: 2
Thread A:
Thread B:
Thread C:
BLOCKING EXECUTING BLOCKING
Example: Release

Last thread signals arrival

Barrier release
– Counter resets
– Filter state for all threads
switches
Filter State
# Threads: 3
Arrived-counter: 0
2
Thread A:
Thread B:
Thread C:
EXECUTING RESUMING
RESUMING
BLOCKING RESUMING
BLOCKING
Example: Release

After release
– New cache-fill requests served
– Filter serves pending cachefills
Filter State
# Threads: 3
Arrived-counter: 0
Thread A:
Thread B:
Thread C:
RESUMING RESUMING RESUMING
Example: Release

After release
– New cache-fill requests served
– Filter serves pending cachefills
Filter State
# Threads: 3
Arrived-counter: 0
Thread A:
Thread B:
Thread C:
RESUMING RESUMING RESUMING
Outline

Introduction

Barrier Filter Overview

Barrier Filter Implementation

Results

Summary
Software Interface

Communication requirements
– Let hardware know # of threads
– Let threads know signal addresses

Barrier filters as virtualized resource
– Library interface
– Pure software fallback

User scenario
– Application calls OS to create barrier with # threads
– OS allocates barrier filter, relays address and # threads
– OS returns address to application
Barrier Filter Hardware

Additional hardware: “address filter”
– In controller for shared memory level
– State table, associated FSMs
– Snoops invalidations, fill requests for
designated addresses

Makes use of existing instructions and
existing interconnect network
Barrier Filter Internals

Each barrier filter
supports one barrier
– Barrier state
– Per-thread state, FSMs

Multiple barrier filters
– In each controller
– In banked caches, at a
particular bank
Barrier Filter Internals

Each barrier filter
supports one barrier
– Barrier state
– Per-thread state, FSMs

Multiple barrier filters
– In each controller
– In banked caches, at a
particular bank
Barrier Filter Internals

Each barrier filter
supports one barrier
– Barrier state
– Per-thread state, FSMs

Multiple barrier filters
– In each controller
– In banked caches, at a
particular bank
Why have an exit address?

Needed for re-entry to barriers
– When does Resuming again become Executing?
– Additional fill requests may be issued

Delivery is not a guarantee of receipt
– Context switches
– Migration
– Cache eviction
Ping-Pong Optimization

Draws from sense reversal barriers
– Entry and exit operations as duals

Two alternating arrival addresses
– Each conveys exit to the other’s barrier
– Eliminates explicit invalidate of exit address
Outline

Introduction

Barrier Filter Overview

Barrier Filter Implementation

Results

Summary
Methodology

Used modified version of SMT-Sim

We performed experiments using 7 different
barrier implementations
– Software:
• Centralized, combining tree
– Hardware:
• Filter barrier (4 variants), dedicated barrier network

We examined performance over a set of
parallelizeable kernels
– Livermore loops 2, 3, 6
– EEMBC kernels autocorrelation, viterbi
Benchmark Selection

Barriers are seen as heavyweight operations
– Infrequently executed in most workloads

Example: Ocean from SPLASH-2
– On simulated 16 core CMP: 4% of time in
barriers

Barriers will be used more frequently on
CMPs
Latency Micro-benchmark

Average time of barrier execution (in isolation)
– #threads = #cores
Latency Micro-benchmark

Notable effects due to bus saturation
– Barrier filter scales well up until this point
Latency Micro-benchmark

Filters closer to dedicated network than software
– Significant speedup vs. software still exhibited
Autocorrelation Kernel

On 16 core CMP
– 7.98x speedup for dedicated
network
– 7.31x speedup for best filter
barrier
– 3.86 speedup for best
software barrier

Significant speedup
opportunities with fast
barriers
Viterbi Kernel
Viterbi on 4 core CMP

Not all applications
can scale to arbitrary
number of cores

Viterbi performance
higher on 4 or 8 cores
than on 16 cores
Livermore Loops
Livermore Loop 3 on 16-core CMP

Serial/parallel crossover
– HW achieves on 4x smaller problem
Livermore Loops
Livermore Loop 3 on 16-core CMP

Reduction in parallelism to avoid false sharing
Result Summary

Fine-grained parallelism on CMPs
– Significant speedups possible
• 1.2x – 6.4x on 256 element vectors
• 3x – 12.2x on 1024 element vectors
– False sharing affects problem size/scaling

Faster barriers allow greater parallelism
– HW approaches extend worthwhile problem sizes

Barrier filters give competitive performance
– 77% - 95% of dedicated network performance
Conclusions

Fast barriers
– Can organize fine-grained data parallelism on a CMP

CMPs can act in a vector processor role
– Exploit inner-loop parallelism

Barrier filters
– CMP-oriented fast barrier
(FIN)
 Questions?
Extra Graphs