Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors 1 Carmelo Acosta Francisco J. Cazorla 1,2 1,2 Alex Ramírez Mateo Valero 1 UPC-Barcelona 2 Barcelona Supercomputing Center CMP-MSI Feb. 11th 2007 2 Overview Introduction Simulation Methodology Results Conclusions CMP-MSI Feb. 11th 2007 2 Introduction As Process Technology advances it is more important what to do with transistors. Current trend to replicate cores. Intel: Pentium4, Core Duo, Core 2 Duo, Core 2 Quad AMD: Opteron Dual-Core, Opteron Quad-Core IBM: POWER4, POWER5 Sun Microsystems: Niagara T1, Niagara T2 CMP-MSI Feb. 11th 2007 3 Introduction Power4 (CMP) Power5 (CMP+SMT) Memory Subsystem (green) spreads over more than half the chip area. CMP-MSI Feb. 11th 2007 4 Introduction Each L1 is connected to each L2 bank with a busbased interconnection network. CMP-MSI Feb. 11th 2007 5 Goal Is directly applicable prior research in the SMT field in the new CMP+SMT scenario? NO…we have to revisit well-known SMT ideas. Instruction Fetch Policy CMP-MSI Feb. 11th 2007 6 ICOUNT Fetch ROB CMP-MSI Feb. 11th 2007 7 ICOUNT Fetch ROB FETCH Stalled L2 miss Processor’s resources balanced between running threads. All resources devoted to blue thread unused until L2 miss resolution. CMP-MSI Feb. 11th 2007 8 FLUSH Fetch ROB FLUSH Triggered L2 miss All resources devoted to the pending instructions of the blue thread are freed. CMP-MSI Feb. 11th 2007 9 FLUSH Fetch Thread Stalled ROB L2 miss Freed resources allow additional forward progress. L2 miss late detection L2 miss prediction. CMP-MSI Feb. 11th 2007 10 Single vs Multi Core L2 b0 L2 b1 L2 b2 L2 b3 More pressure on both: • Interconnection Network • Shared L2 banks I$ D$ Core L2 b0 L2 b1 L2 b2 L2 b3 I$ D$ I$ D$ I$ D$ I$ D$ Core Core Core Core CMP-MSI Feb. 11th 2007 11 Single vs Multi Core L2 b0 L2 b1 L2 b2 I$ D$ L2 b3 More Unpredictable L2 Access Latency - BAD for FLUSH Core L2 b0 L2 b1 L2 b2 L2 b3 I$ D$ I$ D$ I$ D$ I$ D$ Core Core Core Core CMP-MSI Feb. 11th 2007 12 Overview Introduction Simulation Methodology Results Conclusions CMP-MSI Feb. 11th 2007 13 Simulation Methodology Trace driven SMT simulator derived from SMTsim. C2T2, C3T2, C4T2 multicore configurations. (CXTY, where X= Num. Cores and Y= Num. Threads/Core) L2 b0 L2 b1 L2 b2 L2 b3 I$ D$ I$ D$ I$ D$ I$ D$ Core Core Core Core Core Details (* per thread) CMP-MSI Feb. 11th 2007 14 Simulation Methodology Instruction Fetch Policies: ICOUNT FLUSH Workload classified per type: ILP All threads have good memory behavior. MEM All threads have bad memory behavior. MIX Mixes both types of threads. CMP-MSI Feb. 11th 2007 15 Overview Introduction Simulation Methodology Results Conclusions CMP-MSI Feb. 11th 2007 16 Results : Single-Core (2 threads) FLUSH yields 22% average speedup over ICOUNT, in MIX workloads. Mainly on MEM/MIX workloads CMP-MSI Feb. 11th 2007 17 Results : Multi-Core (2 threads/core) +Cores -Speedup FLUSH drops to 9% average slowdown over ICOUNT in a four-cored multicore. CMP-MSI Feb. 11th 2007 18 Results : L2 Hits Latency on Multi-Core +Cores +latency +dispersion L2 hit latency (cycles) CMP-MSI Feb. 11th 2007 19 Results : L2 miss prediction In this four-cored example, the best choice is predicting L2 miss after 90 cycles. CMP-MSI Feb. 11th 2007 20 Results : L2 miss prediction But, in this other four-cored example the best choice is not to predict L2 miss. CMP-MSI Feb. 11th 2007 21 Overview Introduction Simulation Methodology Results Conclusions CMP-MSI Feb. 11th 2007 22 Conclusions Future high-degree CMPs open new challenging research topics in CMP+SMT cooperation. The CMP outer cache level and interconnection characteristics may heavily affect SMT intra-core performance. For example, FLUSH relies on a predictable L2 hit latency, heavily affected in a CMP+SMT scenario. FLUSH drops from 22% average speedup to 9% average slowdown when moving from single-core to quad-core configuration. CMP-MSI Feb. 11th 2007 23 Thank you Questions? CMP-MSI Feb. 11th 2007
© Copyright 2026 Paperzz