CISC 879 : Advanced Parallel Programming

Exploring Memory Consistency for
Massively-Threaded ThroughputOriented Processors
Blake A. Hechtman and Daniel J. Sorin
Rahul Deore
Dept. of Computer & Information Sciences
University of Delaware
CISC 879 : Advanced Parallel Programming
Outline
•
•
Motivation
Background
•
•
•
•
Memory Consistency
MTTOP’s
Experimentation
Conclusion
Some Slides adapted from Blake et. al. and S. Zuckerman et. al. - Memory Consistency
CISC 879 : Advanced Parallel Programming
Motivation
x=1
r1 = y
CISC 879 : Advanced Parallel Programming
Motivation
y=1
r2 = x
CISC 879 : Advanced Parallel Programming
Motivation
Motivating example (1)
Thread 1
Thread 2
x=1
r1 = y
y=1
r2 = x
Initially, x=y=0, is it possible to have r1=r2=0 ?
CISC 879 : Advanced Parallel Programming
Motivation
A similar motivating example (2) from:
How to Make a Multiprocessor Computer That Correctly Execute
Multiprocessor Programs – LESLIE LAMPORT 1979
Process 1
Process 2
A := 1
If (B = 0) then
critical section;
A := 0
Else something else
B := 1
If (A = 0) then
critical section;
B := 0
Else something else
Initially A=B=0, is mutual exclusion guaranteed ?
Example Source: Leslie Lamport 1979
Sequential Consistency
CISC 879 : Advanced Parallel Programming
Background
What is Memory Consistency all
about???
CISC 879 : Advanced Parallel Programming
Memory Consistency
•
Q. What happens when at least two concurrent memory
operations arrive at the same memory location x?
CISC 879 : Advanced Parallel Programming
Memory Consistency
•
Q. What happens when at least two concurrent memory
operations arrive at the same memory location x?
•
Data Race?
CISC 879 : Advanced Parallel Programming
Memory Consistency
•
•
Uniform Memory Consistency Models
•
Strongest MCMs
•
Weaker MCMs
Non Uniform Memory Consistency Models
•
Hardware Oriented MCMs
•
Software and Programmer Oriented MCMs
Slides adapted from s. Zuckerman et. al
CISC 879 : Advanced Parallel Programming
Atomic Consistency
A system is AC if;
•
All memory operations are issued and performed in
some total order
•
•
Real time constraint
Memory operations must follow program order
•
Strongest MCM that was conceived
•
Never implemented
CISC 879 : Advanced Parallel Programming
Sequential Consistency
A system is SC if;
All memory operations appear to follow some total
order
•
Memory operations (appear to) follow program order
•
Lamport’s Definition of SC:
•
... the result of any execution is the same as if the operations of all the
processors were executed in some sequential order, and the operations of
each individual processor appear in this sequence in the order specified by
its program.
CISC 879 : Advanced Parallel Programming
Back to our example…
Thread 1
Thread 2
x=1
r1 = y
y=1
r2 = x
Initially, x=y=0, is it possible to have r1=r2=0 ?
NO → There is no total linear order which allows both Thread 0
and Thread 1 to see memory operations happening in the same
order such that r1 = r2 = 0
CISC 879 : Advanced Parallel Programming
Coherence (cache
consistency)
•
Cache coherence is the consistency of shared
resource data that ends up stored in multiple local
caches
Image Source: Wikipedia
CISC 879 : Advanced Parallel Programming
Coherence (cache
consistency)
Coherence is achieved if
•
For each memory location x, there is a total order
of all the memory operations dealing with x
•
Memory operations on x follow the program order
CISC 879 : Advanced Parallel Programming
Back to our example…
Thread 1
Thread 2
x=1
r1 = y
y=1
r2 = x
Initially, x=y=0, is it possible to have r1=r2=0 ?
YES ! →
r1 = y, x := 1
r2 = x, y := 1
CISC 879 : Advanced Parallel Programming
High-level Overview
CISC 879 : Advanced Parallel Programming
Some of my observations
Image source: preshing.com
CISC 879 : Advanced Parallel Programming
Some of my observations
Most of the examples assume
sequentially consistent
memory model
Image source: preshing.com
CISC 879 : Advanced Parallel Programming
Why this re-ordering???
Performance
CISC 879 : Advanced Parallel Programming
MTTOPs
•
Massively
Threaded
Throughput-Oriented
Processors (MTTOPs) like GPUs are being
integrated on chips with CPUs and being used for
general purpose programming
•
Conventional wisdom favors weak consistency on
MTTOPs
•
This paper implements (SC, TSO and RMO) on
MTTOPs
•
Experiments show that strong consistency models
are viable for MTTOPs
Slides adopted from Blake et. al
CISC 879 : Advanced Parallel Programming
MTTOPs
•
Massively Threaded Throughput-Oriented
•
•
•

•
4-16 core clusters
8-64 threads wide SIMD
64-128 deep SMT
Thousands of concurrent threads
Massively Threaded Throughput-Oriented
•
Sacrifice latency for throughput
•
•
Heavily banked caches and memories
Many cores, each of which is simple
CISC 879 : Advanced Parallel Programming
MTTOPs Examples
Upto 61 cores
Upto 16 cores
64 and 72 cores
2688 simple cores per chip
CISC 879 : Advanced Parallel Programming
MTTOPs Conventional
Wisdom
•
Highly parallel systems benefit from less ordering
•
(Graphics doesn’t need ordering)
•
Strong Consistency seems likely to limit MLP
•
Strong Consistency likely to suffer extra latencies
Weak ordering helps CPUs, does it help MTTOPs?
CISC 879 : Advanced Parallel Programming
Memory Models
Experimented
•
•
•
•
SC
SC with write buffer
Total store order (TSO)
RC
CISC 879 : Advanced Parallel Programming
MTTOPS Sys Configuration
CISC 879 : Advanced Parallel Programming
Differences
CPUs
Loads per Store
Weak Consistency reduces impact of store latency on
performance
Prior work shows CPUs
perform 2-4 loads per
store
10000
1000
100
10
1
MTTOPs perform more loads per store store latency
optimizations will not be as critical to MTTOP performance
CISC 879 : Advanced Parallel Programming
Differences
Outstanding L1 cache misses
Weak consistency enables more outstanding L1 misses per
thread
CPU Core
MTTOP Core (CU/SM)
threads per core
4
64
SIMD Width
4
64
L1 Miss Rate
0.1
0.5
SC Misses per Core
1.6 (too few misses)
2048 (enough misses)
…
…
…
RMO Misses per Core
6.4
8192
MTTOPs have more outstanding L1 cache misses  thread
reordering enabled by weak consistency is less important to handle
memory latency
CISC 879 : Advanced Parallel Programming
Results – Benchmarks
used
•
Ported Rodinia benchmarks
•
•
bfs, hotspot, kmeans, and nn
Handwritten benchmarks
•
dijkstra, 2dconv, and matrix_mul
CISC 879 : Advanced Parallel Programming
Results
1.6
Speedup
1.4
1.2
MTTOP Consistency Model Performance
Comparison
Significant
load
reordering
SC
SC_WB
TSO
RMO
1
0.8
0.6
0.4
0.2
0
2dconv
barnes
bfs
djisktra
fft
hotspot
kmeans matrix_mul
nn
CISC 879 : Advanced Parallel Programming
Conclusion
•
•
•
Strong Consistency should not be ruled out
for MTTOPs on the basis of performance
Improving store performance with write
buffers appears unnecessary
Graphics-like workloads may get significant
MLP from load reordering (dijkstra, 2dconv)
Conventional wisdom may be wrong
about MTTOPs
CISC 879 : Advanced Parallel Programming
THANK YOU
QUESTIONS ?
CISC 879 : Advanced Parallel Programming