Exploring Memory Consistency for Massively-Threaded ThroughputOriented Processors Blake A. Hechtman and Daniel J. Sorin Rahul Deore Dept. of Computer & Information Sciences University of Delaware CISC 879 : Advanced Parallel Programming Outline • • Motivation Background • • • • Memory Consistency MTTOP’s Experimentation Conclusion Some Slides adapted from Blake et. al. and S. Zuckerman et. al. - Memory Consistency CISC 879 : Advanced Parallel Programming Motivation x=1 r1 = y CISC 879 : Advanced Parallel Programming Motivation y=1 r2 = x CISC 879 : Advanced Parallel Programming Motivation Motivating example (1) Thread 1 Thread 2 x=1 r1 = y y=1 r2 = x Initially, x=y=0, is it possible to have r1=r2=0 ? CISC 879 : Advanced Parallel Programming Motivation A similar motivating example (2) from: How to Make a Multiprocessor Computer That Correctly Execute Multiprocessor Programs – LESLIE LAMPORT 1979 Process 1 Process 2 A := 1 If (B = 0) then critical section; A := 0 Else something else B := 1 If (A = 0) then critical section; B := 0 Else something else Initially A=B=0, is mutual exclusion guaranteed ? Example Source: Leslie Lamport 1979 Sequential Consistency CISC 879 : Advanced Parallel Programming Background What is Memory Consistency all about??? CISC 879 : Advanced Parallel Programming Memory Consistency • Q. What happens when at least two concurrent memory operations arrive at the same memory location x? CISC 879 : Advanced Parallel Programming Memory Consistency • Q. What happens when at least two concurrent memory operations arrive at the same memory location x? • Data Race? CISC 879 : Advanced Parallel Programming Memory Consistency • • Uniform Memory Consistency Models • Strongest MCMs • Weaker MCMs Non Uniform Memory Consistency Models • Hardware Oriented MCMs • Software and Programmer Oriented MCMs Slides adapted from s. Zuckerman et. al CISC 879 : Advanced Parallel Programming Atomic Consistency A system is AC if; • All memory operations are issued and performed in some total order • • Real time constraint Memory operations must follow program order • Strongest MCM that was conceived • Never implemented CISC 879 : Advanced Parallel Programming Sequential Consistency A system is SC if; All memory operations appear to follow some total order • Memory operations (appear to) follow program order • Lamport’s Definition of SC: • ... the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. CISC 879 : Advanced Parallel Programming Back to our example… Thread 1 Thread 2 x=1 r1 = y y=1 r2 = x Initially, x=y=0, is it possible to have r1=r2=0 ? NO → There is no total linear order which allows both Thread 0 and Thread 1 to see memory operations happening in the same order such that r1 = r2 = 0 CISC 879 : Advanced Parallel Programming Coherence (cache consistency) • Cache coherence is the consistency of shared resource data that ends up stored in multiple local caches Image Source: Wikipedia CISC 879 : Advanced Parallel Programming Coherence (cache consistency) Coherence is achieved if • For each memory location x, there is a total order of all the memory operations dealing with x • Memory operations on x follow the program order CISC 879 : Advanced Parallel Programming Back to our example… Thread 1 Thread 2 x=1 r1 = y y=1 r2 = x Initially, x=y=0, is it possible to have r1=r2=0 ? YES ! → r1 = y, x := 1 r2 = x, y := 1 CISC 879 : Advanced Parallel Programming High-level Overview CISC 879 : Advanced Parallel Programming Some of my observations Image source: preshing.com CISC 879 : Advanced Parallel Programming Some of my observations Most of the examples assume sequentially consistent memory model Image source: preshing.com CISC 879 : Advanced Parallel Programming Why this re-ordering??? Performance CISC 879 : Advanced Parallel Programming MTTOPs • Massively Threaded Throughput-Oriented Processors (MTTOPs) like GPUs are being integrated on chips with CPUs and being used for general purpose programming • Conventional wisdom favors weak consistency on MTTOPs • This paper implements (SC, TSO and RMO) on MTTOPs • Experiments show that strong consistency models are viable for MTTOPs Slides adopted from Blake et. al CISC 879 : Advanced Parallel Programming MTTOPs • Massively Threaded Throughput-Oriented • • • • 4-16 core clusters 8-64 threads wide SIMD 64-128 deep SMT Thousands of concurrent threads Massively Threaded Throughput-Oriented • Sacrifice latency for throughput • • Heavily banked caches and memories Many cores, each of which is simple CISC 879 : Advanced Parallel Programming MTTOPs Examples Upto 61 cores Upto 16 cores 64 and 72 cores 2688 simple cores per chip CISC 879 : Advanced Parallel Programming MTTOPs Conventional Wisdom • Highly parallel systems benefit from less ordering • (Graphics doesn’t need ordering) • Strong Consistency seems likely to limit MLP • Strong Consistency likely to suffer extra latencies Weak ordering helps CPUs, does it help MTTOPs? CISC 879 : Advanced Parallel Programming Memory Models Experimented • • • • SC SC with write buffer Total store order (TSO) RC CISC 879 : Advanced Parallel Programming MTTOPS Sys Configuration CISC 879 : Advanced Parallel Programming Differences CPUs Loads per Store Weak Consistency reduces impact of store latency on performance Prior work shows CPUs perform 2-4 loads per store 10000 1000 100 10 1 MTTOPs perform more loads per store store latency optimizations will not be as critical to MTTOP performance CISC 879 : Advanced Parallel Programming Differences Outstanding L1 cache misses Weak consistency enables more outstanding L1 misses per thread CPU Core MTTOP Core (CU/SM) threads per core 4 64 SIMD Width 4 64 L1 Miss Rate 0.1 0.5 SC Misses per Core 1.6 (too few misses) 2048 (enough misses) … … … RMO Misses per Core 6.4 8192 MTTOPs have more outstanding L1 cache misses thread reordering enabled by weak consistency is less important to handle memory latency CISC 879 : Advanced Parallel Programming Results – Benchmarks used • Ported Rodinia benchmarks • • bfs, hotspot, kmeans, and nn Handwritten benchmarks • dijkstra, 2dconv, and matrix_mul CISC 879 : Advanced Parallel Programming Results 1.6 Speedup 1.4 1.2 MTTOP Consistency Model Performance Comparison Significant load reordering SC SC_WB TSO RMO 1 0.8 0.6 0.4 0.2 0 2dconv barnes bfs djisktra fft hotspot kmeans matrix_mul nn CISC 879 : Advanced Parallel Programming Conclusion • • • Strong Consistency should not be ruled out for MTTOPs on the basis of performance Improving store performance with write buffers appears unnecessary Graphics-like workloads may get significant MLP from load reordering (dijkstra, 2dconv) Conventional wisdom may be wrong about MTTOPs CISC 879 : Advanced Parallel Programming THANK YOU QUESTIONS ? CISC 879 : Advanced Parallel Programming
© Copyright 2026 Paperzz