Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors Karin Strauss, Xiaowei Shen*, Josep Torrellas University of Illinois at Urbana-Champaign *IBM Research QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. http://iacoma.cs.uiuc.edu Motivation • CMPs are becoming standard components • cheaper to build medium size machines – 32 to 128 cores (multi-CMP) • shared memory, cache coherent – easier to program, easier to manage • supporting cache coherence is difficult QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Karin Strauss Flexible Snooping 2 Cache coherence solutions strategy ordered network? pros cons snoopy broadcast bus yes simple difficult to scale directory based protocol no scalable indirection, extra hardware snoopy embedded ring no simple long latencies • other proposals (e.g. token coherence) QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Karin Strauss Flexible Snooping 3 Contributions • family of adaptive coherence protocols for rings • two were chosen as best options compared to fastest state-of-the-art scheme high performance scheme performance energy consumption Superset Aggressive QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Karin Strauss energy conscious scheme performance energy consumption Superset Conservative Flexible Snooping 4 Multi-CMP multiprocessor CMP Proc + L1 + L2 local network memory • coherence protocol used: only one supplier if line is cached QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Karin Strauss Flexible Snooping 5 Ring in action R S S S Lazy cmp supplier predictor R R Oracle Eager snoop request response QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Karin Strauss Flexible Snooping 6 Ring in action R S S S Lazy R R Oracle Eager latency snoops messages • goal: adaptive schemes that approximate Oracle’s behavior QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Karin Strauss Flexible Snooping 7 Primitive snooping actions • snoop and then forward + fewer messages • forward and then snoop + shorter latency • forward only + fewer snoops + shorter latency – false negative predictions not allowed QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Karin Strauss X Flexible Snooping X 8 Predictors and algorithms set of addresses: node can supply in predictor predictor / algorithm action on negative prediction action on positive prediction Subset forward then snoop snoop Super set Con forward forward then snoop Agg Exact QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Karin Strauss snoop then forward forward Flexible Snooping snoop 9 Algorithms algorithm negative Subset forward then snoop S u p e r set C o n forward A g g Exact QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. positive snoop Lazy Eager snoop then forward Oracle / Exact Subset forward then snoop forward Per miss service: SupersetCon number of messages SupersetAgg snoop Karin Strauss Flexible Snooping 10 Predictor implementation • Subset – associative table: subset of addresses that can be supplied by node • Superset – bloom filter: superset of addresses that can be supplied by node – associative table (exclude cache): addresses that recently suffered false positives • Exact – associative table: all addresses that can be supplied by node – downgrading: if address has to be evicted from predictor table, corresponding line in node has to be downgraded QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Karin Strauss Flexible Snooping 11 Downgrading S E A Negative effects: • writes by this node need to snoop other nodes B A QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. • reads and writes by other nodes need to fetch line from memory Karin Strauss Flexible Snooping 12 Experiments • 8 CMPs, 4 ooo cores each = 32 cores – private L2 caches • on-chip bus interconnect • off-chip 2D torus interconnect with embedded unidirectional ring • per node predictors: latency of 3 processor cycles • sesc simulator (sesc.sourceforge.net) • SPLASH-2, SPECjbb, SPECweb QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Karin Strauss Flexible Snooping 13 Execution time Normalized execution time 1.2 1 0.8 0.6 0.4 0.2 0 SPLASH-2 SPECjbb SPECweb Lazy Eager Oracle Subset SupersetCon SupersetAgg Exact • performance of most flexible snooping algorithms is similar to Eager • the fastest of all algorithms is SupersetAgg QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Karin Strauss Flexible Snooping 14 Normalized energy consumption Miss service energy 2 3.22 Lazy Eager Oracle Subset SupersetCon 1.5 1 0.5 SupersetAgg Exact 0 SPLASH-2 SPECjbb SPECweb • algorithms that eagerly forward messages use more energy • SupersetCon is least energy-hungry algorithm QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Karin Strauss Flexible Snooping 15 Normalized energy consumption Normalized execution time Most cost-effective algorithms 1.2 1 0.8 0.6 0.4 0.2 0 2 SPLASH-2 SPECjbb SPECweb 3.22 1.5 1 0.5 0 SPLASH-2 SPECjbb SPECweb Lazy Eager Oracle Subset SupersetCon SupersetAgg Exact Lazy Eager Oracle Subset SupersetCon SupersetAgg Exact • high performance: Superset Aggressive • faster than Eager at lower energy consumption • energy conscious: Superset Conservative • slightly slower than Eager at much lower energy consumption QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Karin Strauss Flexible Snooping 16 Most cost-effective algorithms compared to fastest state-of-the-art scheme (Eager) Superset Aggressive high performance scheme energy performance consumption Superset Conservative energy conscious scheme energy performance consumption can be combined by only changing forwarding policy QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Karin Strauss Flexible Snooping 17 Conclusions • proposed flexible snooping, a family of adaptive protocols for embedded rings • two chosen protocols – high performance: Superset Aggressive – energy conservation: Superset Conservative – can be selected dynamically • embedded-ring protocols more attractive QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Karin Strauss Flexible Snooping 18 Arch map http://iacoma.cs.uiuc.edu/students/archmap/archmap.html Google: architecture conference map (1st hit) QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Karin Strauss Flexible Snooping 19 Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors Karin Strauss, Xiaowei Shen*, Josep Torrellas University of Illinois at Urbana-Champaign *IBM Research QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. http://iacoma.cs.uiuc.edu
© Copyright 2024 Paperzz