Low Latency Scheduling Algorithm for Shared Memory Communications over Optical Networks Muhammad Ridwan Madarbux, Anouk Van Laer, Philip M. Watts Electronic and Electrical Engineering Department University College London MOTIVATION • Scaling chip multiprocessors (CMP) is increasing thermal issues − negative impact on performance • Photonic NoCs have been shown to have lower power consumption Op0cal Plane O O O X O O O O O O To Op0cal Plane Shared Network L2 Interface L1 Core Chip Mul0-‐ Processor MOTIVATION request • Latency in switched photonic networks is dominated by scheduling – Request, arbitration and grant arbitra0on grant • Scheduling generates a significant overhead in shared memory systems – 8B control messages – 16-256B data messages data Source Core Arbiter Destination Core CIRCUIT SWITCHING FOR SHARED MEMORY • Reduce scheduling latency by circuit switching for large flows of data between two cores • Maximise proportion of messages on circuits to minimise contention and improve latency – Reduces for circuit periods >100 clock cycles • Backup network required This work proposes a new scheduling algorithm which intelligently uses information from the cache hierarchy to setup optical circuits SIMULATION PARAMETERS Benchmark ● ● ● ● Operating system ● ● ● Kernel ● ● ● CPU Caches Memory PARSEC benchmark suites Blackscholes – Calculate price of options Canneal – Optimization of routing in chip design Dedup – Compression using data deduplication Ferret – Content similarity search server Fluidanimate – Simulates fluid for animations Freqmine – Data mining Streamcluster – Online clustering of input streams Swaptions – Calculates price of swaptions (MC) Vips – Image processing X264 – Video encoder 32 cores @ 1.2 GHz ● Private L1 = 16 kB ● Shared L2 = 1 MB in total (distributed among the tiles) ● Cacheline size = 64B ● Cache coherence protocol = MESI ● Interconnection network COMMON COMMUNICATION PATTERNS IN THE MESI COHERENCE PROTOCOL L1 requesting address L1 requesting address L1 requesting address Request Response L2 bank L1 requesting address Request L2 bank Time 3 messages Store Request Unblock Response Unblock L2 bank L1 requesting address Response L2 bank Request L1* caching address Time 5 messages Store Request CACHE COHERENCE PROTOCOL BASED COMMUNICATION PATTERNS Communica0on PaEerns LATENCY IMPROVEMENTS TIME request request 01 07 02 08 03 09 04 10 05 11 06 12 arbitra0on grant REQ data arbitra0on 14 15 grant 16 arbitra0on REQ data data request 13 grant RES RES Path Allocator data UNB data 17 18 Optical Switch 19 20 request grant 21 22 23 24 25 arbitra0on UNB 26 67% Overhead Savings data 27 28 29 30 31 32 Source Arbiter Des/na/on Core Core Arbitra/on per message Source Arbiter Des/na/on Core Core Arbitra/on per pa;ern LATENCY IMPROVEMENTS 70.6% Time taken per pattern BL – Blackscholes CA – Canneal DE – Dedup FE - Ferret Head latency per message FL – Fluidanimate FR – Freqmine ST – Streamcluster SW – Swaptions VI – Vips X2 – X264 EFFECT OF CONTENTION 4x < 2% BL – Blackscholes CA – Canneal DE – Dedup FE - Ferret FL – Fluidanimate FR – Freqmine ST – Streamcluster SW – Swaptions VI – Vips X2 – X264 CONCLUSION • We have proposed an algorithm that intelligently uses information from the cache hierarchy to setup optical paths • The algorithm provides significant latency reductions of up to 70.6% for Swaptions • Results shown are for on-chip networks. Larger networks with longer time of flight will benefit more from this algorithm – Examples: Multiple socket servers or rack-scale networks FUTURE WORK • Considering the implications of this algorithm on the energy consumption of the allocator and control circuits • Looking at the performance of the algorithm for heavier traffic loads considering that PARSEC benchmarks lightly load the network • Measuring the performance improvements in full system gem5 simulation THANK YOU FOR YOUR ATTENTION
© Copyright 2026 Paperzz